Write lock leak

Type: Bug
Status: Closed
Priority: 2 Major
Resolution: Not a Bug
Component/s: ehcache-core
Labels:

Assignee: alexsnaps
Reporter: jandam
Created: June 07, 2012
Votes: 0
Watchers: 5
Updated:
July 27, 2012
Resolved:
June 29, 2012

Attachments

jandam (4.00 k) application/octet-stream ehcache-deadlock.tdump
jandam (2.00 k) application/octet-stream exception.tdump

Description

SelfPolulatingCache is leaking write locks when exception is thrown in BlockingCache.get() at line 160: element = underlyingCache.getQuiet(key); Write lock is acquired but SelfPopulatingCache don't catch this exception to call put(new Element(key, null)) to release write lock.

Comments

Martin JANDA 2012-06-07

Attached striped threaddump

Martin JANDA 2012-06-07

Prerequisities: 1) Corrupted disk storage 2) tested on 4 core machine in ForkJoinPool 3) high load for cache get Data dimension: ~700k mem hit, ~500 values in persistent disk cache, except corrupted element 100% hit ratio (mem+disk)

            CacheConfiguration configuration = new CacheConfiguration(cacheName, 150)
                    .statistics(false)
                    .memoryStoreEvictionPolicy(MemoryStoreEvictionPolicy.LFU);
            configuration.eternal(true)
                        .overflowToDisk(true)
                        .diskSpoolBufferSizeMB(2)
                        .maxElementsOnDisk(2000)
                        .diskStorePath(System.getProperty("java.io.tmpdir") + File.separatorChar + "data-cache-" + cacheName + File.separator)
                        .diskPersistent(true);
            }
            SelfPopulatingCache selfPopulatingCache = new SelfPopulatingCache(new Cache(configuration), new CacheEntryFactory() 
         ...

Martin JANDA 2012-06-07

What leads to leak of write lock followed by deadlock!

T1, T2 = threads

T1: cache.get() - miss in memory FrontEndCacheTier.getQuiet - new Fault is added to faults authority.getQuiet()

T2: cache.get() - is same as from thread T1 ... BlockingCache.get... FrontEndCacheTier.getQuiet - Fault is found in faults fault.get() - waiting till 'complete' T1: BlockingCache.get... authority.getQuiet() - throws exception due to corrupted disk storage (line numbers can differ from 2.5.2 release) -------- net.sf.ehcache.CacheException: java.io.StreamCorruptedException: invalid stream header: 35013501 at net.sf.ehcache.store.disk.DiskStorageFactory.retrieve(DiskStorageFactory.java:954) at net.sf.ehcache.store.disk.Segment.decodeHit(Segment.java:178) at net.sf.ehcache.store.disk.Segment.get(Segment.java:216) at net.sf.ehcache.store.disk.DiskStore.get(DiskStore.java:506) at net.sf.ehcache.store.disk.DiskStore.getQuiet(DiskStore.java:513) at net.sf.ehcache.store.FrontEndCacheTier.getQuiet(FrontEndCacheTier.java:231) -------- fault.complete(null) - exception is thrown and Element e stays with value null, fault removed from faults T2: fault.get() - is completed, value is null element in BlockingCahce.get:154 is null -> element not found !!! acquireWriteLock at line 159 element = underlyingCache.getQuiet(key) - this call fails because fault is not registered any more so another try to read value from authority.getQuiet()

Write lock is not released in SelfPopulatingCache nor BlockingCache in this scenario.

Alexander Snaps 2012-06-11

I’m not sure I follow everything in your last comment, but SelfPopulatingCache does require you (since it’s behaviour is based on BlockingCache) to handle failure situations manually. So if a get() throws, you need to put back to release the lock as per BlockingCache.get() javadoc. Or am I missing the point you’re making here ?

Fiona OShea 2012-06-11

Gladstone Phase II and merge to trunk, 2.5.x

So after July Gladstone Phase I release

Martin JANDA 2012-06-12

Sorry for my horrible english. Yes SelfPopulatingCache is resposible for calling put(new Element(key, null)) when exception occurs to release write lock. But it is not called when exception occurs on line 160 “element = underlyingCache.getQuiet(key);” in BlockingCache.get()

Alexander Snaps 2012-06-12

I think we’ve indeed talked pass each other. My point is that it is up to your code to handle exceptions being thrown. SelfPopulatingCache will only unlock if the creation of the value is responsible for the failure. Now, I think you have a larger problem here. Indeed, since your DiskStore is corrupted, you probably have no other choice than stopping the server and deleting the file. Ehcache won’t ever be able to recover from this situation. While we do try to discover corrupted state of the DiskStore when bootstrapping the cache, if it did manage to go STATUS_ALIVE, nothing will be done about corrupted state anymore. Basically, all get to that key will throw… forever.

Alexander Snaps 2012-06-29

Works as designed

Martin JANDA 2012-06-29

If I understand you say that internal exception in ehcache core leading to application deadlock is correct. I expect that EHCACHE will throw exceptions when DiskStore is corrupted or will not be working at all. But I don’t expect deadlock in client application.