TCObjectNotFoundException killed TC master with a tight loop

Type: Bug
Status: Open
Priority: 2 Major
Resolution:
Component/s: DSO:L1
Labels:

Assignee: dconnard
Reporter: dconnard
Created: March 10, 2009
Votes: 1
Watchers: 1
Updated:
March 10, 2009
Resolved:

Attachments

dconnard (283.00 k) application/gzip terracotta-client.log.2.gz
dconnard (4.00 M) application/gzip terracotta-server.log.2.gz

Description

This text has been copied and edited from http://forums.terracotta.org/forums/posts/list/0/1884.page

We’ve got a TC cluster which has (well, had…) been running for over 3 months now. We’re using it to cluster our Tomcat sessions. Last night, during our peak load, it fell over badly and effectively stopped all of our Tomcat instances in one go. The basic summary is that we got a couple of TCObjectNotFoundException raised in the clients. This subsequently sent the TC master into a tight loop, creating vast amounts of logs (1GB in 10 minutes), and making it unresponsive to further client requests. A complete tomcat and TC restart was required to recover from the situation.

The sequence of events that I can roughly gather was:

Our Tomcats logged a couple of Object Not Found errors. These were also logged in the TC client-log on those hosts:

2009-03-09 21:49:18,082 [worker-1436] WARN com.tc.object.ClientObjectManager - ChannelID=[886]: Exception: com.tc.exception.TCObjectNotFoundException: *************************** Requested Object is missing : ObjectID=[2051858398] Missing Oids = [ObjectID=[2051858398]] *************************** at com.tc.object.RemoteObjectManagerImpl.basicRetrieve(RemoteObjectManagerImpl.java:175) at com.tc.object.RemoteObjectManagerImpl.retrieve(RemoteObjectManagerImpl.java:154) at com.tc.object.ClientObjectManagerImpl.lookup(ClientObjectManagerImpl.java:572) at com.tc.object.ClientObjectManagerImpl.lookupObject(ClientObjectManagerImpl.java:483) at com.tc.object.ClientObjectManagerImpl.lookupObject(ClientObjectManagerImpl.java:472) at com.tc.object.TCObjectPhysical.resolveReference(TCObjectPhysical.java:133
The TC master logged the same Object Not Found error:

2009-03-09 21:49:18,076 [WorkerThread(managed_object_fault_stage,3)] WARN com.tc.objectserver.api.ObjectManager - Request for non-exisitent object : ObjectID=[2051858398] context = ObjectManagerLookupContext : [ processed count = 1, responseContext = ManagedObjectRequestContext@12704120 [ ClientID[886] , ObjectRequestID=[2521214] , [ObjectID=[2051858398]], requestingThread = worker-1436, isServerInitiatedRequest = false ]]

(note that these start out as “isServerInitiatedRequest = false”, and this appears to be logged a few ms before the client did, which kind of makes sense)

Several more ObjectNotFound errors are logged in the clients & server. The list of missing object ids grows to about 20 or so.
The TC Master then goes into a tight loop, logging huge numbers of these:

2009-03-09 21:51:57,880 [WorkerThread(managed_object_fault_stage,2)] WARN com.tc.objectserver.api.ObjectManager - Request for non-exisitent object : ObjectID=[2051858555] context = ObjectManagerLookupContext : [ processed count = 1, responseContext = ManagedObjectRequestContext@22330968 [ ClientID[886] , ObjectRequestID=[2563140] , [ObjectID=[2051858555], ObjectID=[2051858565], ObjectID=[2051858577], ObjectID=[2051858576], ObjectID=[2051858566], ObjectID=[2051858572], ObjectID=[2051858556], ObjectID=[2051858568], ObjectID=[2051858561], ObjectID=[2051858569], ObjectID=[2051858560], ObjectID=[2051858573], ObjectID=[2051858574], ObjectID=[2051858579], ObjectID=[2051858557], ObjectID=[2051858570], ObjectID=[2051858563], ObjectID=[2051858575], ObjectID=[2051858571], ObjectID=[2051858558]], requestingThread = WorkerThread(respond_to_request_stage,0), isServerInitiatedRequest = true ]]

(note that the isServerInitiatedRequest is now set to “true”, and the “(respond_to_request_stage,NN)” field - the NN is observed to vary between 0,1,2,3 in the various log entries. I don’t know what that means, but did the server kick off an infinite loop as the result of the clients original (failed) request?

At this point, the TC master went high CPU. Many log entries are being written each millisecond. The TC master then went on to consume about 1GB of log files (neatly rolling them at the 512MB mark, well done guys!). Our Tomcats are effectively dead to further user requests from this point, as they are blocked trying to fetch session data from the unresponsive master.

Several minutes later, the TC clients started logging:

2009-03-09 21:57:32,355 [worker-1231] INFO com.tc.object.tx.RemoteTransactionManagerImpl - ChannelID=[886]: Flush for LockID(B9D8B1C8601866BD8A2E) took longer than: 15 sec. Took : 16160 ms. # Transactions not yet Acked = 4

(and these go on and on, with the time and #Transactions increasing)

So…… has anyone else experieced this..? Anyone care to take a look at the possible causes for the fault in the Master?

We are running on CentOS 4.4 (final), TC: 2009-03-09 14:02:51,459 [main] INFO com.terracottatech.console - Terracotta 2.6.4, as of 20080821-180810 (Revision 9831 by cruise@rh4mo0 from 2.6)

I’ve attached part of the server log file (the final one of the 3, while the server was tight looping before we shut it down… the first in the set is 17MB even when compressed - I’ll have to upload this to you some other way). I’ve attached the complete client log file for the time.

Comments

Fiona OShea 2009-03-10

We notice on the forum post that you are going to try with 2.7.3 which has many improvements and bug fixes. Please let us know if you see this issue again

David Connard 2009-03-10

OK, will do… but you may as well just close the bug then. It’s clearly a rare incident (we’ve been running for over 3 months continuous prior to this happening), and I think the chances of it happening again in the same timeframe are reasonably unlikely… and if it did ever happen again, it’ll probably take 3 months or more of running, and the TC stable version will have moved on by then, and your response will presumably be that we should try 3.0.N…!

I was lead to believe in the forum post that it might be worth raising a bug for someone to take a look at our logs from this particular incident, else I wouldn’t have bothered raising this.