Clients are not recieving operations enabled event properly

Type: Bug
Status: Closed
Priority: 2 Major
Resolution: Fixed
Component/s: DSO:L1,DSO:L2
Labels:

Assignee: kkannaiy
Reporter: rsingh
Created: December 08, 2009
Votes: 0
Watchers: 0
Updated:
February 12, 2013
Resolved:
January 04, 2010

Attachments

pierop (37.00 k) application/zip client-server-logs.zip
rsingh (58.00 k) application/x-zip-compressed postOfficeApp.zip

Description

Attached is the app which reproduces this problem

Steps to reproduce

Start an active and passive server.
Start 5 clients C0-C4 using the attached app on the same machine.
Kill active
Kill C4 and start a new client C5 while passive is taking over
When passive takes over all the clients should get operations enabled event and the connected clients should resume there work but instead the cluster gets frozen

Comments

Raghvendra Singh 2009-12-08

More discussion of this issue is at http://forums.terracotta.org/forums/posts/list/2775.page

Raghvendra Singh 2009-12-08

Seems like the servers are indeed firing the events but somehow clients are stuck here

“WorkerThread(client_coordination_stage, 0)” daemon prio=10 tid=0x00002aab32814400 nid=0x925 in Object.wait() [0x0000000042951000..0x0000000042951aa0] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00002aab0b51adc0> (a com.tc.object.ClusterMetaDataManagerImpl) at java.lang.Object.wait(Object.java:485) at com.tc.object.ClusterMetaDataManagerImpl.waitUntilRunning(ClusterMetaDataManagerImpl.java:297) - locked <0x00002aab0b51adc0> (a com.tc.object.ClusterMetaDataManagerImpl) at com.tc.object.ClusterMetaDataManagerImpl.retrieveMetaDataForDsoNode(ClusterMetaDataManagerImpl.java:139) at com.tc.cluster.DsoClusterImpl.retrieveMetaDataForDsoNode(DsoClusterImpl.java:247) at com.tc.cluster.DsoClusterImpl.fireNodeJoinedInternal(DsoClusterImpl.java:328) at com.tc.cluster.DsoClusterImpl.fireNodeJoined(DsoClusterImpl.java:322) at com.tc.object.handler.ClientCoordinationHandler.handleClusterMembershipMessage(ClientCoordinationHandler.java:54) at com.tc.object.handler.ClientCoordinationHandler.handleEvent(ClientCoordinationHandler.java:30) at com.tc.async.impl.StageImpl$WorkerThread.run(StageImpl.java:127)

Piero Positivo 2009-12-08

Here are the logs of the postOfficeApp. I have run many times on both MacOSX machines and Linux machines. They all reproduce the problem. There are two TC servers called TC1 and TC2 in active-passive mode and 3 clients. I have included the 4 client logs where the fourth is the client that attempts to join the cluster while the passive takes over after client 3 has been killed.

Steve Harris 2009-12-08

If this is a bug we should probably look at it in the darwin timeframe

Raghvendra Singh 2010-01-04

fixed in trunk with r14254, merged in 3.2 with r14255