• Bug
  • Status: Open
  • 2 Major
  • Resolution:
  • Sessions
  • fern
  • Reporter: fern
  • August 09, 2008
  • 1
  • Watchers: 2
  • September 02, 2008

Attachments

Description

After running for under 24 hours our server freezes up. And it looks like all 400 tomcat threads are tied up waiting for their sessions.

“http-8086-Processor67” daemon prio=1 tid=0x0000002b080465b0 nid=0x7370 in Object.wait() [0x000000004dc29000..0x000000004dc2acb0] at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:474) at com.tc.object.lockmanager.impl.ClientLock.waitForLock(ClientLock.java:597) - locked <0x0000002aae5a6a50> (a java.lang.Object) at com.tc.object.lockmanager.impl.ClientLock.basicLock(ClientLock.java:212) at com.tc.object.lockmanager.impl.ClientLock.lock(ClientLock.java:117) at com.tc.object.lockmanager.impl.ClientLock.lock(ClientLock.java:107) at com.tc.object.lockmanager.impl.ClientLockManagerImpl.lock(ClientLockManagerImpl.java:225) at com.tc.object.lockmanager.impl.ThreadLockManagerImpl.lock(ThreadLockManagerImpl.java:46) at com.tc.object.tx.ClientTransactionManagerImpl.begin(ClientTransactionManagerImpl.java:181) at com.tc.object.bytecode.ManagerImpl.begin(ManagerImpl.java:321) at com.tc.object.bytecode.ManagerImpl.beginLock(ManagerImpl.java:298) at com.tc.object.bytecode.ManagerUtil.beginLock(ManagerUtil.java:200) at com.terracotta.session.util.Lock.getWriteLock(Lock.java:36) at com.terracotta.session.util.DefaultSessionId.getWriteLock(DefaultSessionId.java:58) at com.terracotta.session.SessionDataStore.find(SessionDataStore.java:83) at com.terracotta.session.TerracottaSessionManager.doGetSession(TerracottaSessionManager.java:431) at com.terracotta.session.TerracottaSessionManager.getSession(TerracottaSessionManager.java:335) at com.terracotta.session.SessionRequest.getTerracottaSession(SessionRequest.java:108) at com.terracotta.session.SessionRequest.getSession(SessionRequest.java:66) at com.protrade.facebook.base.USSPFacebook.addUsspConfigToSession(USSPFacebook.java:107) at com.protrade.facebook.base.USSPFacebook.newUnifiedSocialServices(USSPFacebook.java:202) at com.protrade.unifiedsocial.USSPCollection.newUnifiedSocialServices(USSPCollection.java:48) at com.protrade.unifiedsocial.filters.USSPFilter.handle(USSPFilter.java:37) at com.protrade.unifiedsocial.filters.USSPFilter.doFilter(USSPFilter.java:29) at com.protrade.common.web.ProtradeFilter.doFilter(ProtradeFilter.java:50) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)

Comments

Fernando Padilla 2008-08-09

This is the thread dump for when it hung.

Fernando Padilla 2008-08-09

Assuming that you guys don’t have a minor race condition in your session code, then maybe it’s a communication error? how could someone trouble shoot that?

We are also accessing the sessions pretty much on every single request coming to the server, including resources ( images/css/js, etc ), maybe the TC can’t handle the shear load of requests??

Steve Harris 2008-08-09

We’ll check it out but please include all terracotta logs from all nodes (Hopefully with thread dumps in them). thx!

Fernando Padilla 2008-08-09

Here is the client log associated with that server (www3)

Interesting how the session invalidator stopped running after the client hung.

Fernando Padilla 2008-08-09

Thank you for suggesting to get all logs from all clients.. this looks good. Here is the log of www4, which was the second client running at the moment.

complaining about the flushing of a lock taking way too long.. and started just around the time that www3 was hung.

Fernando Padilla 2008-08-09

Here is the logfile for the TC server. Nothing much interesting, except around the time of the hangup there was this line, from an ehcache instance:

2008-08-07 19:42:22,375 [WorkerThread(managed_object_fault_stage,3)] WARN com.tc.objectserver.api.ObjectManager - Request for non-exisitent object : ObjectID=[707083] context = ObjectManagerLookupContext : [ processed count = 1, responseContext = ManagedObjectRequestContext@1071879504 [ ClientID[0] , ObjectRequestID=[170152] , [ObjectID=[707083]], requestingThread = CacheInvalidator - tc-lockCache invalidation thread0 ]]

not sure if that is cause or effect..

Fernando Padilla 2008-08-09

I am going to attach more logs, for another time that the servers hung (i am guessing for sessions). Here are the notes about the second incident:

– different application, different terracotta server, different tomcat servers :) tc-server: msupply2 tomcat severs: www7, www8, www9 time-of-incident: last night it looks like www7 or www8 hung up, and our ops reaction was to restart them without TC (since we know we’re having TC issues). but we forgot to deal with www9, and it seems like it is behaving just fine, weird. The only indicator in the client-logs was that the session invalidator stops working. I know that these servers didn’t print out a thread dump, but I’ll see if I can find something interesting in their logs to put up here as well, else we’ll have to do with the tc-client-logs.

thank you!

Fernando Padilla 2008-08-09

www7 tc-client-logs for last night

Fernando Padilla 2008-08-09

www8 tc-client-logs for last night

Fernando Padilla 2008-08-09

www9 tc-client-logs for last night

Fernando Padilla 2008-08-09

msupply2 tc-server-logs for last night ( this one serves www7, www8, www9 )

Fernando Padilla 2008-08-09

by the way, www9 doesn’t through running the same code as www7/www8 it serves a smaller population of our users, so it doesn’t use sessions quite as much, that could be why it never hung last night and was able to recover.

Fernando Padilla 2008-08-09

www8 app-logs last night,

it hung at midnight, briefly recovered at 12:30 when www7 was restarted..

Fernando Padilla 2008-08-09

www7 app-logs last night

it hung at 12:08 or so (less than 10 minutes after www8 hung), it never recovered, we restarted it without TC.

Tim Eck 2008-08-11

The majority of threads are blocking trying to get an exclusive lock on some session. We can’t tell from the thread dumps, but I suspect they are trying to get the same lock. Is this real load or simulated load? If simulated, how many sessions are ther?

There are two threads (“http-8086-Processor66” and “http-8086-Processor51”) in catalina.out.080807213852-dump that looks different from the rest. They appear to be blocked in terracotta locks inside ehcache.

I’ll have to look into the 2.5.2 code, but those “Request for non-exisitent object” errors on the terracotta server could very well be relevant here. My theory (still needs investigation) is that the missing objects cause locks to be held in thread ““CacheInvalidator - tc-lockCache invalidation thread0” since it never finishes it’s lookup.

If the two tomcat threads that are blocked in ehcache have also taken session locks, I can believe more requests on those sessions will just hang forever. The session lock is taken when request.getSession() is called and is held for the duration of the request.

p.s. If there are more nodes in the cluster, getting thread dumps of ALL of them at the time things are hung would be good.

Fernando Padilla 2008-08-13

This is real load.

So you’re saying that ehcache is really the one blocking causing a request to block for ever; Secondarily tc-sessions can only deal with a single request accessing the session at a time, thus all future requests to that session are blocked, and thus exhausting our tomcat threads? Hmm, thank you for shedding some light.

This is real load, and I did give up all the logs I had for the 2 incidences ( one with 2 clients and one with 3 clients ).

It sounds like 2.6.2 has done lots of work around locks and stuff, I’ll try upgrading to that again and see how things behave then.

Eric Ellis 2008-08-28

Hi Tim, I’m having just about the same issue with 2.6.2 while attempting to load test and deploy the Terracotta implementation for Ziff. If this is something your working on right now? I’m more than happy to provide you with whatever you need to replicate and resolve it. I’d even be willing to bring my entire tc-sessions code-base on site (notebook) if need be. Everything is tested and ready to go at this point. Things are just freezing up during load testing.

INFO: Server startup in 39463 ms Error reading URL content: java.net.SocketTimeoutException: Read timed out Error reading URL content: java.net.SocketTimeoutException: Read timed out Error reading URL content: java.net.SocketTimeoutException: Read timed out Error reading URL content: java.net.SocketTimeoutException: Read timed out Error reading URL content: java.net.SocketTimeoutException: Read timed out Error reading URL content: java.net.SocketTimeoutException: Read timed out Error reading URL content: java.net.SocketTimeoutException: Read timed out Error reading URL content: java.net.SocketTimeoutException: Read timed out Error reading URL content: java.net.SocketTimeoutException: Read timed out Error reading URL content: java.net.SocketTimeoutException: Read timed out

Feel free to email me or give me a call: Iyer has my email and my phone should be on file

This is basically what I’m working on right now. I’ll be gathering debug output and general information surrounding the problem.

Thanks, Eric

Eric Ellis 2008-08-28

Just another quick note: the exact same code-base is currently running in out production environment without Terracotta clustered sessions. And running the load test against our dev servers w/o Terracotta runs with approx 15 requests per second and doesn’t freeze. The only variable is Terracotta Sessions. CPU is almost nothing and memory usage is low (my first suspicion).

Fernando Padilla 2008-08-28

I hope you can track this down :), because we have given up on TC-sessions until this is fixed.

We are currently seeing locks up when clients are starting up, while other clients are up and doing work. So we can never restart servers during the day, just turn them off and wait until off hours.. :( :( :(

And this is after we turned of TC-Sessions, TC-Ehcache, and Spring Integration. So.. I hope that a version of TC higher than 2.5.2 will fix this.. I hope that we can upgrade soon and not run into any issues..

Eric Ellis 2008-09-02

I’m still looking into things but at this point the finger seems to be pointing at my attempt to cluster ACEGI security. Terracotta has had ACEGI support on their todo list for about 2 years now and given these two message threads it appears to have some serious implementation issues. I really need to discuss it’s importance with my client (though I know the answer is going to be “Terracotta said: no code changes”).

http://forums.terracotta.org/forums/posts/list/830.page http://forums.terracotta.org/forums/posts/list/226.page

Eric Ellis 2008-09-02

Yep, case closed for me. With ACEGI disabled my load test is cruising at break neck speeds. I’ll be upgrading to the new Spring native security/authentication mechanism.