“Synchronous” JGroups replication is not really synchronous

Type: Bug
Status: Open
Priority: 2 Major
Resolution:
Component/s: ehcache-jgroupsreplication
Labels:

Assignee:
Reporter: mads1980
Created: July 27, 2011
Votes: 1
Watchers: 3
Updated:
May 28, 2013
Resolved:

Attachments

mads1980 (11.00 k) text/plain patch.txt
mads1980 (13.00 k) text/plain patch-3.0.0.txt
mads1980 (31.00 k) application/octet-stream patch-custom-settings-per-cache-3.0.0.diff

Description

Version 1.4 of the ehcache-jgroupsreplication module has synchronous and asynchronous replication modes. But actually, BOTH are asynchronous.

This is the deal: internally, the module buffers replication notifications when operating in asynchronous mode, and send them in bulk every second by default by invoking JGroups’ channel.send() for the whole bulk. When a synchronous notification is required, no bufffering occurs, and channel.send() is invoked directly.

This might seems correct behaviour, but it’s not. The reason is that channel.send() does NOT block until the replication notification is acknowledged by all peers. It simply dispatches the message to all peers, without waiting for the replication to complete.

Some use cases require real 100% synchronous operation. We’re currently working on a custom distributed HttpSession management solution based on EhCache + JGroups, and we need to make sure the puts() are replicated before finishing request processing (the idea is to have distributed sessions with an optimized SessionDelta updater, and have a plain load-balancer in front without any container-specific configuration to deal with).

The solution is fairly simple. JGroups provides a MessageDispatcher building block with a castMessage() method that allows specifying whether we should block until all, some or none of the recipients complete processing of the message. We have patched the code to allow configuring two parameters: syncMode and syncTimeout in order to control this behaviour. If the parameters are not supplied, then we have the current non-blocking, pseudo-synchronous but really asynchronous behaviour.

We would appreciate peer review from the module maintainers, and if this patch is accepted, we’d like to see it in ehcache-jgroupsreplication version 1.5

Comments

Manuel Dominguez Sarmiento 2011-07-27

The attached patch for current version 1.4 implements the proposed solution.

Manuel Dominguez Sarmiento 2011-07-27

Example config:

<cacheManagerPeerProviderFactory class="net.sf.ehcache.distribution.jgroups.JGroupsCacheManagerPeerProviderFactory"
	properties="file=jgroups.xml, syncMode=GET\_ALL, syncTimeout=10000" />

Sync modes are documented in org.jgroups.blocks.Request:

GET_ALL blocks until all peers complete processing
GET_NONE does not block, just like current behaviour using channel.send()
GET_FIRST blocks until the first peer completes
GET_MAJORITY blocks until a majority of the peers (non-faulty members) complete
GET_ABS_MAJORITY blocks until a majority of the peers (including faulty members) complete
GET_N has not been implemented (seems redundant and non-adaptive to peer groups changing the number of participants)

Manuel Dominguez Sarmiento 2011-08-03

This patch does the same as the other one, but applies all necessary changes to be compatible with JGroups 3.0.0 which was just released and contains some minor API changes.

Fiona OShea 2011-08-09

Manuel has signed a contributor agreement (rec’d 05/24/2010)

Manuel Dominguez Sarmiento 2011-08-29

Attached a new patch which allows setting syncMode and syncTimeout per-cache. The general setting per cacheManager is still available (default if no per-cache custom values have been set).