Multiple XA Caches potentially deadlock each other

Type: Bug
Status: Open
Priority: 2 Major
Resolution:
Component/s: ehcache-core
Labels:

Assignee: prodmgmt
Reporter: alexsnaps
Created: September 02, 2010
Votes: 0
Watchers: 2
Updated:
April 20, 2011
Resolved:

Description

This comes from: http://forums.terracotta.org/forums/posts/list/4136.page#22155

Given that TransactionManager do not execute two-phase commits on XAResource in an ordered fashion, multiple XA Transaction involving multiple XACaches can result in 2pc deadlocks.

Comments

Alexander Snaps 2010-09-02

One way of solving this would be to register multiple Caches as a single XAResource (per CacheManager, or even across multiple ones?). We could have a layer simulating as an XAResource to the TransactionManager, and as the TransactionManager to the current EhcacheXAResource. It’d bridge operations from one to the other, but on prepare, commit/rollback it’d execute the operation on the underlying caches (XAResources) in an ordered fashion. This would solve the deadlock issue, but potentially slow the 2pc down (no more // between XAResources). Another approach would be the softlock one with some back off mechanism when a deadlock is detected… This is currently less clear to me on how to achieve this nicely, still investigating.

Ludovic Orban 2010-09-02

I’m not in favor of the single XAResource facade for all EhcacheXAResource as it sounds like a hack and would kill performance (running parallel 2PC proved to largely improve it in h2lcperf) but having a single XAResource for the whole CacheManager sounds more and more like a good idea to me.

I need to ponder that too.

Alexander Snaps 2010-09-02

Not sure how you expect to not loose performance with the latter… The gain is basically about being able to execute the 2pc in //. So, imho, having a single XAResource per CacheManager will reduce the performance whether it acts like a facade to other XAResources or executes the actual work directly. Also having one per CacheManager still would have the problem arise as soon as one would use multiple CacheManagers within a single XATransaction (which could easily be the case if one uses Hibernate w/ Ehcache and some other CacheManager for some other purpose). Hence the other option of detecting deadlocks… Or maybe even better prevent them! We could deny parallel 2pc when one XAResource (whatever it represents, but the finer the control probably better again, i.e. Cache) is already in process… the mechanism of doing so through out the cluster could potentially come quite a high price though! Throwing out wild ideas…

Fiona OShea 2010-09-09

Is this something we want to try and fix in 2.3 (i.e. soon) Or can we wait until Fremantle? It is not a regression from what I can tell

Alexander Snaps 2010-09-29

As of now, I still think the best solution to that issue is the SoftLock approach. This would indeed enable “per key” locking with standalone. If the two-phase commit only locks a single segment at a time, we should be able to prevent any “dead locks”. We would probably still need to rollback the TX if, while owning a segment lock, some key has been previously locked by another tx (or wait?). Still trying to figure it all out in my little brain. I’ll send a recap in a mail when I think I have a solution nailed. But that would definitively have to wait for Fremantle though…

Fiona OShea 2010-12-10

Has this already been resolved?

Ludovic Orban 2011-01-25

I don’t think it is possible to implement a ‘back-off and retry’ policy without compromising isolation.

The new XA store doesn’t completely get rid of the deadlocks but some effort has been made to try to avoid them as much as possible.

From what I noticed in my testing they became rare enough that we probably don’t have to worry about them anymore.

Alexander Snaps 2011-01-25

Not that I disagree with that statement, but I must admit I don’t understand the reasons this could compromise isolation… Maybe we should organize a meeting about this, should we want to proceed with addressing this.

Fiona OShea 2011-01-25

Can you and Ludovic have the meetings needed and then update the Jira? thanks

Alexander Snaps 2011-01-27

After looking through the code with Ludovic, it turns out XA caches don’t deadlock anymore… Local transaction ones on the other hand still can “deadlock”, but an Exception is thrown if that happens. So at least the VM will not grind to a halt. XA stores are more drastic: they will fail right away, should the expected value not be present in the store. That includes SoftLock instances… So that, as soon as a SoftLock is encountered a OptimisticLockFailureException is thrown… Currently that means an ABA problem on XA stores (which wasn’t the case the before the use of SoftLocks).

Potential improvements are:

LocalTransaction could “rewind and replay” the tx being prepared when they encounter a “more important TxID” rather than throw the Exception
XA Store could fail “less” fast, when encountering a SoftLock and share the LocalTransaction “rewind & replay” behavior

But currently, afaict, the cross-cache deadlock we experienced before shouldn’t occur anymore…

Alexander Snaps 2011-01-31

I think PM should decide what to do from here…

Fiona OShea 2011-01-31

Review at next DRB

Fiona OShea 2011-02-08

lets wait until we have more customer feedback to help us decide which path to take