• Bug
  • Status: Closed
  • 1 Critical
  • Resolution: Fixed
  • Byte Code Transform
  • wharley
  • Reporter: teck
  • February 28, 2009
  • 0
  • Watchers: 0
  • April 10, 2009
  • March 06, 2009

Attachments

Description

Attached are all the relevant logs (with thread dumps from all processes) from a run of AppGroupTest which failed. this type of bug has existed in the past (not sure if it is new or not). In server_0.log you can see two threads that I believe are deadlocked. One thread is calling size() which takes read locks on all segments. The other is the applicator thread (receive_transaction_stage) that is trying to apply a change which would ultimately allow the other thread to proceeed.

Comments

Chris Dennis 2009-03-02

Email sent to Walter & Transparency

Hi Walter,

I’ll try and explain to the best of my ability what I believe to be going on in CDV-1160.

Applicator Thread: “WorkerThread(receive_transaction_stage,0)” daemon prio=1 tid=0x0872c500 nid=0x2376 waiting on condition [0x896f2000..0x896f3150] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:118) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:716) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:746) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1076) at java.util.concurrent.locks.ReentrantReadWriteLock$FairSync.wlock(ReentrantReadWriteLock.java:392) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.__RWL__tc_lock(ReentrantReadWriteLock.java:637) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java) at java.util.concurrent.ConcurrentHashMap$Segment.lock(ConcurrentHashMap.java) at java.util.concurrent.ConcurrentHashMap$Segment.__tc_origPut(ConcurrentHashMap.java:408) at java.util.concurrent.ConcurrentHashMap.__tc_applicator_put(Unknown Source) at com.tc.object.applicator.HashMapApplicator.apply(HashMapApplicator.java:74) at com.tc.object.applicator.ConcurrentHashMapApplicator.hydrate(ConcurrentHashMapApplicator.java:103)

This is trying to grab a local write lock on one of the CHM’s segments (lets call it ‘segments[n].local(w)’). Its holding nothing else of any significance except that it will be blocking acquires of segments[m].dso(r/w). Here n is the segment index on this node, and m is the segment index on the other node - they should be the same for all possible keys.

User Thread: “http-15699-Processor24” daemon prio=1 tid=0x88220f20 nid=0x23a7 in Object.wait() [0x876e8000..0x876e8ed0] at java.lang.Object.wait(Native Method) - waiting on <0xae808ab0> (a java.lang.Object) at java.lang.Object.wait(Object.java:474) at com.tc.object.lockmanager.impl.ClientLock.waitForLock(ClientLock.java:653) - locked <0xae808ab0> (a java.lang.Object) at com.tc.object.lockmanager.impl.ClientLock.basicLock(ClientLock.java:239) at com.tc.object.lockmanager.impl.ClientLock.lock(ClientLock.java:130) at com.tc.object.lockmanager.impl.ClientLock.lock(ClientLock.java:117) at com.tc.object.lockmanager.impl.ClientLockManagerImpl.lock(ClientLockManagerImpl.java:342) at com.tc.object.lockmanager.impl.StripedClientLockManagerImpl.lock(StripedClientLockManagerImpl.java:106) at com.tc.object.lockmanager.impl.ThreadLockManagerImpl.lock(ThreadLockManagerImpl.java:46) at com.tc.object.tx.ClientTransactionManagerImpl.begin(ClientTransactionManagerImpl.java:227) at com.tc.object.bytecode.ManagerImpl.begin(ManagerImpl.java:355) at com.tc.object.bytecode.ManagerImpl.monitorEnter(ManagerImpl.java:531) at com.tc.object.bytecode.ManagerUtil.monitorEnter(ManagerUtil.java:472) at com.tc.object.bytecode.ManagerUtil.monitorEnter(ManagerUtil.java:461) at java.util.concurrent.locks.ReentrantReadWriteLock$DsoLock.lock(ReentrantReadWriteLock.java:43) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java) at java.util.concurrent.ConcurrentHashMap$Segment.__tc_readLock(ConcurrentHashMap.java) at java.util.concurrent.ConcurrentHashMap.__tc_fullyReadLock(Unknown Source) at java.util.concurrent.ConcurrentHashMap.size(Unknown Source) at com.tctest.webapp.servlets.SharedCache.size(SharedCache.java:23) at com.tctest.webapp.servlets.RootSharingServletA.doGet(RootSharingServletA.java:36)

This is trying to grab a dso read lock on one of the CHM’s segments. Assuming that the lock in question is ‘segments[m].dso(r)’ (all other threads seem to be moving) then we know it is holding (RRWLs acquire dso first then local - see ReentrantReadWriteLockTC.java - and the fullyReadLock method goes through the array calling lock on each RRWL):

segments[0].dso(r), segments[0].local(r) ……… segments[m-1].dso(r), segments[m-1].local(r), pending on segments[m].dso(r) and then following that will acquire segments[m].local(r).

Hence the only way that one of the local holds of this thread can be blocking the local acquire of the applicator is if m > n. I.e. the segment index being used here is not the same as the segment index on the other node - which means they don’t have the same hashcode.

I’m sure you know the test code better than me - its in the tim-session-system-tests tim.

I’m sure this is all clear as mud now…

Chris _______________________________________________ Transparency mailing list [email protected] http://mail.terracottatech.com/mailman/listinfo/transparency

Chris Dennis 2009-03-02

I subsequently became more panicked.

Was looking at this some more and I can’t see that we do anything clever with the hashcodes of literals. This is fine for literals like Integer and String, where the hashcode implementation is entirely data dependent - but means that for things like Class, Currency, (and a crap load of others I imagine) that use object identity as part of their hashing calculation maps will be really buggered (but in really subtle ways).

This means for maps that use literal keys with object identity based hashcodes you’ll get:

  • The occasional random lockup if map level and key levels ops are concurrent (like in the monkey issue).
  • Nodes seeing applications in the wrong order because segment assignment aren’t consistent across the cluster and hence all the locking is unreliable.

I just hope to crap that none of these objects rely on identity for equality tests.

Someone please tell me all this is wrong.

Chris

Walter Harley 2009-03-03

I’ve committed ConcurrentHashMapBadKeyDeadlockTest, which reproduces this failure reliably. The test is currently disabled.

Walter Harley 2009-03-06

Fixed in 3.0 branch with change 12034, in trunk with 12030. We no longer inadvertently attempt to share literals; instead, for literals without stable hash codes, we calculate our own.

nadeem ghani 2009-03-10

Test is running on monkeys, so closing