• Bug
  • Status: Open
  • 2 Major
  • Resolution:
  • DSO:L1
  • cdennis
  • Reporter: jrray
  • June 13, 2011
  • 0
  • Watchers: 3
  • December 15, 2011

Description

There is a code path through TerracottaLock.tryLock(…) that calls PendingTryLockHold.park() with no timeout, resulting in the tryLock(…) caller blocking forever. The code path is still present in 3.5.1.

The scenario I encounter this is when an L1 is ejected from the cluster after failing the healthcheck. There appears to be no possibility for the Semaphore.acquire() to complete.

Ice.ThreadPool.Server-44 [WAITING] CPU time: 0:00
java.util.concurrent.Semaphore.acquire()
com.tc.object.locks.LockStateNode$PendingLockHold.park()
com.tc.object.locks.ClientLockImpl.acquireQueuedTimeout(RemoteLockManager, ThreadID, LockLevel, long)
com.tc.object.locks.ClientLockImpl.tryLock(RemoteLockManager, ThreadID, LockLevel, long)
com.tc.object.locks.ClientLockManagerImpl.tryLock(LockID, LockLevel, long)
com.tc.object.bytecode.ManagerImpl.tryLock(LockID, LockLevel, long)
org.terracotta.locking.TerracottaLock.tryLock(LockType, long, TimeUnit)

Comments

J Robert Ray 2011-06-13

Here’s another case:

Ice.ThreadPool.Server-31 [WAITING] CPU time: 0:00
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await()
com.tc.object.locks.ClientLockManagerImpl.waitUntilRunning()
com.tc.object.locks.ClientLockManagerImpl.tryLock(LockID, LockLevel, long)
com.tc.object.bytecode.ManagerImpl.tryLock(LockID, LockLevel, long)
org.terracotta.locking.TerracottaLock.tryLock(LockType, long, TimeUnit)

Fiona OShea 2011-06-21

can you take a look at this and see if you can give an answer/comment

thanks

Chris Dennis 2011-06-24

This is a very complex part of the Terracota locking code.

The reason that we do an infinite park is due to how we interpret the {{tryLock()}} contract in {{java.util.concurrent.locks.Lock}}. Quoting the javadoc: {quote} Acquires the lock only if it is free at the time of invocation.

Acquires the lock if it is available and returns immediately with the value true. If the lock is not available then this method will return immediately with the value {{false}}. {quote} On a single machine this contract results in an almost instantaneous response since checking for the lock’s status is very quick. In a clustered environment unless the lock is either: held locally or held greedily (this client has been given temporary exclusive access to the lock), we are forced to trigger a server call to determine the lock hold state. This means we trigger a server request, and then wait for it’s response so we can determine the correct course of action.

In your case the client has been disconnected from the cluster… a pedantic person would reason that the tryLock call cannot return until the client rejoins the cluster as it cannot know the lock state, and so any return value would breach the above contract. The reasoning (and contract) for a timed {{tryLock(…)}} is very similar: {quote} Acquires the lock if it is free within the given waiting time and the current thread has not been {{interrupted}}.

If the lock is available this method returns immediately with the value {{true}}. If the lock is not available then the current thread becomes disabled for thread scheduling purposes and lies dormant until one of three things happens… {quote} The logic here is that the timed-wait does not start until after the lock status is known, which requires a server response. Hence for a timed call we only do a timed park when the lock state is known locally, otherwise we wait for a server response - which in your case will never come.

Having said this in your case I think the permanent disconnect (due to the healthchecker) is probably sufficient cause for us to allow the tryLock calls to return {{false}}, given the fact that the server response will now never come. It could in fact be argued that the lock is deterministically “not available” now since we can never receive the award from the server, so a {{false}} return would conform to the contract.

Chris Dennis 2011-06-24

I think this probably deserves a fix, however I’m pretty certain that this is a toolkit only issue (don’t think it has any impact for Ehcache or Quartz). I don’t know if this guy is a customer or anything, so I’ll leave the targeting up to IRB, but my opinion would be Vicente unless there is a compelling reason to push it in to Ulloa.

Fiona OShea 2011-12-15

IF this is easy can we do it in the dot releases. thanks