Problems when peer lookup fails when receiving heartbeats

Type: Bug
Status: Closed
Priority:
Resolution: Fixed
Component/s:
Labels:

Assignee: drb
Reporter: sourceforgetracker
Created: September 21, 2009
Votes: 0
Watchers: 0
Updated:
September 22, 2009
Resolved:
September 22, 2009

Description

The scenario:

Three cache managers, a, b and c, all using a RMICacheManagerPeerProviderFactory with peerDiscovery=automatic and a RMICacheManagerPeerListenerFactory. Each cache manager contains a number of distributed caches.

c has a problem, all Naming.lookup() to it timeout and throws a RemoteException (perhaps due to a firewall blocking the RMI port). It still sends out heartbeats however.

What happens:

That changes will not be replicated to c is expected. However, neither a nor b receive any cache updates. It turns out that the MulticastKeepaliveHeartbeatReceiver.MulticastReceiverThread and the MulticastRMICacheManagerPeerProvider.listRemoteCachePeers() are effectively blocked.

Why:

Each call to MulticastRMICacheManagerPeerProvider.registerPeer() will take several seconds if the rmiUrl comes from c. This method is called n number of times for each heartbeat from c, once for each (distributed) cache in c. Hence, processing a single heartbeat from c takes a looooong time. Since the MulticastReceiverThread only processes one heartbeat at a time, no other heartbeats will be processed during this time.

In addition to this, MulticastRMICacheManagerPeerProvider.registerPeer() is synchronized. This prevents MulticastRMICacheManagerPeerProvider.listRemoteCachePeers() from running, effectively preventing any peers from being notified about cache changes.

Fix:

Made the heartbeat processing multithreaded
When one rmiUri failed, stop processing the rest (make sense for us at least)
Synchronize only peerUrls.put() in MulticastRMICacheManagerPeerProvider.registerPeer()

What do you guys think?

Attached are the patched source files we used to get our caches up and running again.

Sourceforge Ticket ID: 1715492 - Opened By: daniel_wiell - 9 May 2007 07:14 UTC

Comments

Sourceforge Tracker 2009-09-21

Logged In: YES user_id=693320 Originator: NO

Daniel

Fredrik Bertilson reported a similar problem a week or so ago. I have studied the code and agree there is a problem.

I have applied your patch. I changed the JDK1.5 stuff to use backport concurrent to support 1.4 as well. The classes have the same names and behaviour, only the packages changed.

Just running tests now and will commit it. I am looking at a beta 3 release because this is a big (but necessary) change. I have pinged Fredrik to test this too.

Thanks

Greg

Comment by: gregluck - 21 May 2007 10:36 UTC

Fiona OShea 2009-09-22

Re-opening so that I can properly close out these issues and have correct Resolution status in Jira