• Bug
  • Status: Open
  • 2 Major
  • Resolution:
  • ehcache-core
  • Reporter: jchristi
  • March 04, 2010
  • 2
  • Watchers: 4
  • October 02, 2012


If using ManualRMICacheManager (when not using automatic/multicast discovery), I believe I’m hitting what appears to be a significant performance issue. I am also using the RMISynchronousCacheReplicator in this case.

For every replicated cache change going through, the call goes through listRemoteCachePeers (which is synchronized) Every call to listRemoteCachePeers makes a call to lookupRemoteCachePeer, which will cause an outbound network connection to be made.

Perhaps a separate thread similar to the heartbeat sender used for the MulticastRMICacheManagerPeerProvider could be employed to maintain a CopyOnWriteArrayList of peers rather than every single message needing replication doing an additional lookup to each peer node.

I’m currently looking at Thread Dump Analyzer showing where I have 40 different threads all waiting on the thread below to get out of the sync block.


“httpSSLWorkerThread-8080-33” daemon prio=10 tid=0x00002aac546e6800 nid=0x3a60 runnable [0x000000004783d000..0x000000004783eb10] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked <0x00002aaba059dce8> (a java.io.BufferedInputStream) at java.io.DataInputStream.readByte(DataInputStream.java:248) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:195) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:359) at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source) at java.rmi.Naming.lookup(Naming.java:84) at net.sf.ehcache.distribution.RMICacheManagerPeerProvider.lookupRemoteCachePeer(RMICacheManagerPeerProvider.java:127) at net.sf.ehcache.distribution.ManualRMICacheManagerPeerProvider.listRemoteCachePeers(ManualRMICacheManagerPeerProvider.java:95) - locked <0x00002aaae6fe8270> (a net.sf.ehcache.distribution.ManualRMICacheManagerPeerProvider) at net.sf.ehcache.distribution.RMISynchronousCacheReplicator.listRemoteCachePeers(RMISynchronousCacheReplicator.java:335) at net.sf.ehcache.distribution.RMISynchronousCacheReplicator.replicateRemovalNotification(RMISynchronousCacheReplicator.java:239) at net.sf.ehcache.distribution.RMISynchronousCacheReplicator.notifyElementRemoved(RMISynchronousCacheReplicator.java:229) at net.sf.ehcache.event.RegisteredEventListeners.notifyElementRemoved(RegisteredEventListeners.java:77) at net.sf.ehcache.Cache.remove(Cache.java:1567) at net.sf.ehcache.Cache.remove(Cache.java:1463) at net.sf.ehcache.Cache.remove(Cache.java:1421) at net.sf.ehcache.constructs.blocking.BlockingCache.put(BlockingCache.java:507) at net.sf.ehcache.constructs.blocking.SelfPopulatingCache.get(SelfPopulatingCache.java:74) at net.sf.ehcache.constructs.blocking.BlockingCache.get(BlockingCache.java:563)


Fiona OShea 2010-09-01

Moving all unresolved “Fix Revision 2.2.1” to fix revision “unknown” as we are releasing Magnum first which is 2.3. Currently not sure which fix version these will actually be in, but they are targeted for Fremantle release

Fiona OShea 2011-02-22

MOving unresolved P2 jiras to Ulloa - to be reviewed by Chris, Fiona, Greg soon

Volker Kleinschmidt 2012-10-02

This is a pretty major bug - you obviously can’t make the slow network requests from lookupRemoteCachePeer inside a synchronized method (listRemoteCachePeers), that contradicts the basics of designing for scalability.

This has increased overhead the more peer nodes there are, so in a non-trivial cluster it is quickly crippling. And the issue is compounded by having multiple caches defined in the same application, since they all share/lock the same ManualRMICacheManager (same peers after all), but there certainly isn’t any sense in having each cache do its own validation of the peer’s reachability!

Note that while this issue makes RMISynchronousCacheReplicator completely unusable, it still creates massive blocking with asynchronous replication as well.