Tomcat crashed due to session clients rejected by the Terracotta server array

Type: Bug
Status: Closed
Priority: 3 Minor
Resolution: Fixed
Component/s:
Labels:

Assignee: klalithr
Reporter: allenq
Created: October 01, 2012
Votes: 0
Watchers: 10
Updated:
October 17, 2012
Resolved:
October 15, 2012

Attachments

allenq (4.00 k) text/xml tc-config.xml
allenq (580.00 k) application/zip terracotta-client.zip
allenq (327.00 k) application/zip terracotta-server.zip
allenq (405.00 k) application/zip threadDump_10011350.zip

Description

I have 2 Tomcat session servers running in active/standby mode.

There are 14 app servers running Tomcat 6.0 and connecting to the session servers.

Problem: The Tomcat server will crash sometimes and unable to recover. This has happened on prod a few times and I have been able to reproduce it on our test env. It seems load is a factor contributing to the failure but it’s not the root cause. Sometimes the problem happens when the load is normal.

I believe this has something to do with the Terracotta setup when I checked the TC server and client log file.

Below are some of the errors I found after a load test. The problem started to happen after 13:40 in the log file.

I’ve attached the following info:

TC server log
TC client log
Tomcat thread dump
TC config

Please let me know if you need anything else.

TC Client: {noformat} 2012-10-01 13:41:02,563 [http-8080-275] INFO com.tc.object.tx.RemoteTransactionManagerImpl - ClientID[17]: TransactionID=[879335] : Took more than 1000ms to add to sequencer : 1004 ms 2012-10-01 13:41:02,564 [http-8080-728] INFO com.tc.object.tx.RemoteTransactionManagerImpl - ClientID[17]: TransactionID=[879338] : Took more than 1000ms to add to sequencer : 1012 ms 2012-10-01 13:41:11,397 [L1_L2:TCComm Main Selector Thread_R (listen 0.0.0.0:64336)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - lt-mobses01.sb.com.au:9510 might be in Long GC. Ping-probe cycles completed since last reply : 1 2012-10-01 13:41:15,388 [L1_L2:TCComm Main Selector Thread_R (listen 0.0.0.0:64336)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - lt-mobses01.sb.com.au:9510 might be in Long GC. Ping-probe cycles completed since last reply : 2 2012-10-01 13:41:19,491 [L1_L2:TCComm Main Selector Thread_R (listen 0.0.0.0:64336)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - lt-mobses01.sb.com.au:9510 might be in Long GC. Ping-probe cycles completed since last reply : 3 2012-10-01 13:41:23,472 [L1_L2:TCComm Main Selector Thread_R (listen 0.0.0.0:64336)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - lt-mobses01.sb.com.au:9510 might be in Long GC. Ping-probe cycles completed since last reply : 4 2012-10-01 13:41:27,412 [L1_L2:TCComm Main Selector Thread_R (listen 0.0.0.0:64336)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - lt-mobses01.sb.com.au:9510 might be in Long GC. Ping-probe cycles completed since last reply : 5 2012-10-01 13:41:31,420 [L1_L2:TCComm Main Selector Thread_R (listen 0.0.0.0:64336)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - lt-mobses01.sb.com.au:9510 might be in Long GC. Ping-probe cycles completed since last reply : 6 2012-10-01 13:41:34,210 [ClientLockManager LockGC] INFO com.tc.object.locks.ClientLockManager - ClientID[17]: Lock GC collected 254 garbage locks 2012-10-01 13:41:41,425 [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - Socket Connect to lt-mobses01.sb.com.au:9510(callbackport:9510) taking long time. probably not reachable. 2012-10-01 13:41:41,425 [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - lt-mobses01.sb.com.au:9510 is DEAD 2012-10-01 13:41:41,432 [HealthChecker] ERROR com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO Client - Declared connection dead ConnectionID(17.c8dd256f25c04e5eabe8bfff0c9ffbea.31348ea2-7275-420a-9e1f-e41b63c9fadc-13a1a1eea42)[] idle time 38812ms 2012-10-01 13:41:41,560 [L1_L2:TCComm Main Selector Thread_W (listen 0.0.0.0:64336)] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(17.c8dd256f25c04e5eabe8bfff0c9ffbea.31348ea2-7275-420a-9e1f-e41b63c9fadc-13a1a1eea42)[]: CLOSE EVENT : com.tc.net.core.TCConnectionImpl@430312049: connected: false, closed: true local=10.32.4.81:12735 remote=10.32.4.95:9510 connect=[Mon Oct 01 12:19:32 EST 2012] idle=7149ms [627982104 read, 908628612 write]. STATUS : DISCONNECTED 2012-10-01 13:41:41,564 [L1_L2:TCComm Main Selector Thread_W (listen 0.0.0.0:64336)] INFO com.tc.net.protocol.tcm.ClientMessageChannel - ClientMessageChannel moves to SessionID=[1] 2012-10-01 13:41:41,564 [WorkerThread(client_coordination_stage, 0)] INFO com.tc.object.handshakemanager.ClientHandshakeManagerImpl - ClientID[17]: Disconnected: Pausing from State[ RUNNING ] RemoteNode : GroupID[0]. Disconnect count: 0 2012-10-01 13:41:41,564 [WorkerThread(client_coordination_stage, 0)] INFO com.tc.object.handshakemanager.ClientHandshakeManagerImpl - ClientID[17]: ClientHandshakeManager moves to SessionID=[1] 2012-10-01 13:41:41,679 [ConnectionEstablisher] INFO com.tc.net.protocol.transport.ClientConnectionEstablisher$AsyncReconnect - Handling connection request: ConnectionRequest [type=RECONNECT, cmt=com.tc.net.protocol.transport.ClientMessageTransport@32f0c513] 2012-10-01 13:41:41,687 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(17.c8dd256f25c04e5eabe8bfff0c9ffbea.31348ea2-7275-420a-9e1f-e41b63c9fadc-13a1a1eea42)[]: Reconnect attempt 0 of unlimited reconnect tries to lt-mobses02.sb.com.au:9510:Tc-Group-0, timeout=10000 2012-10-01 13:41:41,695 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(17.c8dd256f25c04e5eabe8bfff0c9ffbea.31348ea2-7275-420a-9e1f-e41b63c9fadc-13a1a1eea42)[]: Connection refused [lossy interval: 10000ms] 2012-10-01 13:41:45,698 [ConnectionEstablisher] INFO com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(17.c8dd256f25c04e5eabe8bfff0c9ffbea.31348ea2-7275-420a-9e1f-e41b63c9fadc-13a1a1eea42)[]: Attaching new connection: com.tc.net.core.TCConnectionImpl@1318131925: connected: true, closed: false local=10.32.4.81:39188 remote=10.32.4.95:9510 connect=[Mon Oct 01 13:41:45 EST 2012] idle=3001ms [0 read, 0 write] 2012-10-01 13:41:45,775 [ConnectionEstablisher] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - Health monitoring agent started for lt-mobses01.sb.com.au:9510 2012-10-01 13:41:45,790 [L1_L2:TCComm Main Selector Thread_R (listen 0.0.0.0:64336)] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - HealthCheckCallbackPort verification PASSED for lt-mobses01.sb.com.au:9510(callbackport: 9510) 2012-10-01 13:41:45,821 [WorkerThread(client_coordination_stage, 0)] INFO com.tc.object.handshakemanager.ClientHandshakeManagerImpl - ClientID[17]: Connected: Unpausing from State[ PAUSED ] RemoteNode : GroupID[0]. Disconnect count : 1 2012-10-01 13:41:45,847 [L1_L2:TCComm Main Selector Thread_R (listen 0.0.0.0:64336)] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(17.c8dd256f25c04e5eabe8bfff0c9ffbea.31348ea2-7275-420a-9e1f-e41b63c9fadc-13a1a1eea42)[]: CLOSE EVENT : com.tc.net.core.TCConnectionImpl@1318131925: connected: false, closed: true local=10.32.4.81:39188 remote=10.32.4.95:9510 connect=[Mon Oct 01 13:41:45 EST 2012] idle=23ms [216 read, 352 write]. STATUS : ESTABLISHED 2012-10-01 13:41:45,848 [L1_L2:TCComm Main Selector Thread_R (listen 0.0.0.0:64336)] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO Client - Connection to [lt-mobses01.sb.com.au:9510] DISCONNECTED. Health Monitoring for this node is now disabled. 2012-10-01 13:41:45,859 [ConnectionEstablisher] INFO com.tc.net.protocol.transport.ClientConnectionEstablisher$AsyncReconnect - Handling connection request: RestoreConnectionRequest [type=RESTORE_CONNECTION, clientMessageTransport=com.tc.net.protocol.transport.ClientMessageTransport@32f0c513, callback=com.tc.net.protocol.delivery.OOOConnectionWatcher@506bda8, timeoutMillis=5000, sa=10.32.4.95:9510] 2012-10-01 13:41:45,870 [ConnectionEstablisher] INFO com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(17.c8dd256f25c04e5eabe8bfff0c9ffbea.31348ea2-7275-420a-9e1f-e41b63c9fadc-13a1a1eea42)[]: Attaching new connection: com.tc.net.core.TCConnectionImpl@2106417963: connected: true, closed: false local=10.32.4.81:39190 remote=10.32.4.95:9510 connect=[Mon Oct 01 13:41:45 EST 2012] idle=11ms [0 read, 0 write] 2012-10-01 13:41:45,960 [L1_L2:TCComm Main Selector Thread_R (listen 0.0.0.0:64336)] ERROR com.tc.net.protocol.transport.TransportHandshakeErrorHandlerForL1 - com.tc.net.protocol.transport.TransportHandshakeErrorContext: com.tc.net.protocol.transport.TransportHandshakeErrorContext: “Client Cannot Reconnect. ConnectionID(17.c8dd256f25c04e5eabe8bfff0c9ffbea.31348ea2-7275-420a-9e1f-e41b63c9fadc-13a1a1eea42)[] not found. Connection attempts from the Terracotta node at 10.32.4.81:39190 are being rejected by the Terracotta server array.”Message Class: com.tc.net.protocol.transport.TransportMessageImpl Sealed: true, Header Length: 32, Data Length: 426, Total Length: 458 Header (com.tc.net.protocol.transport.WireProtocolHeader) Version: 2, Header Length: 8, TOS: 0, TTL: 64, Protocol: TRANSPORT HANDSHAKE Total Packet Length: 458 Adler32 Checksum: 1988298213 (valid: true) Source Addresss: 10.32.4.95 Destination Addresss: 10.32.4.81 Source Port: 9510, Destination Port: 39190 Total Msg Count: 1 Header Validity: true (no message) Payload: type: SYN_ACK, connectionId: ConnectionID(17.c8dd256f25c04e5eabe8bfff0c9ffbea.31348ea2-7275-420a-9e1f-e41b63c9fadc-13a1a1eea42)[], errorContext com.tc.net.protocol.transport.TransportHandshakeErrorContext: “Client Cannot Reconnect. ConnectionID(17.c8dd256f25c04e5eabe8bfff0c9ffbea.31348ea2-7275-420a-9e1f-e41b63c9fadc-13a1a1eea42)[] not found. Connection attempts from the Terracotta node at 10.32.4.81:39190 are being rejected by the Terracotta server array.”

{noformat}

TC Server log {noformat} 2012-10-01 13:52:31,852 [L2_L2:TCWorkerComm # 10_R] WARN com.tc.net.protocol.transport.ServerMessageTransport - ConnectionID(7.3efdc8a7efb44662af7065f819369af7.e1c35822-4c48-4c11-831a-f8c4a58ec3e5-13a1a1ec057)[]: closing down connection - com.tc.net.core.TCConnectionImpl@845175459: connected: false, closed: true local=10.32.4.95:9530 remote=10.32.4.96:7112 connect=[Mon Oct 01 13:52:31 EST 2012] idle=448ms [136 read, 0 write] 2012-10-01 13:52:31,755 [L2_L2:TCWorkerComm # 6_W] WARN com.tc.net.core.CoreNIOServices - Exception trying to process interest request: java.nio.channels.ClosedChannelException 2012-10-01 13:52:31,880 [L2_L2:TCWorkerComm # 12_R] WARN com.tc.net.protocol.transport.ServerMessageTransport - ConnectionID(8.3efdc8a7efb44662af7065f819369af7.e1c35822-4c48-4c11-831a-f8c4a58ec3e5-13a1a1ec057)[]: CLOSE EVENT : com.tc.net.core.TCConnectionImpl@1651010777: connected: false, closed: true local=10.32.4.95:9530 remote=10.32.4.96:7113 connect=[Mon Oct 01 13:52:31 EST 2012] idle=492ms [136 read, 135 write]. STATUS : START 2012-10-01 13:52:31,880 [WorkerThread(group_handshake_message_stage, 0)] WARN tc.operator.event - NODE : lt-mobses01.sb.com.au Subsystem: CLUSTER_TOPOLOGY Message: Node lt-mobses02.sb.com.au left the cluster 2012-10-01 13:52:31,882 [WorkerThread(group_handshake_message_stage, 0)] WARN com.tc.net.groups.TCGroupManagerImpl - Remove dead member from waitFor response list, dead member: NodeID[lt-mobses02.sb.com.au:9510] 2012-10-01 13:52:31,933 [L2_L2:TCWorkerComm # 12_R] WARN com.tc.net.protocol.transport.ServerMessageTransport - ConnectionID(8.3efdc8a7efb44662af7065f819369af7.e1c35822-4c48-4c11-831a-f8c4a58ec3e5-13a1a1ec057)[]: closing down connection - com.tc.net.core.TCConnectionImpl@1651010777: connected: false, closed: true local=10.32.4.95:9530 remote=10.32.4.96:7113 connect=[Mon Oct 01 13:52:31 EST 2012] idle=492ms [136 read, 135 write] 2012-10-01 13:52:31,970 [WorkerThread(garbage_collect_stage, 0)] INFO tc.operator.event - NODE : lt-mobses01.sb.com.au Subsystem: DGC Message: DGC[ 24 ] started. 2012-10-01 13:52:31,936 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] INFO com.tc.net.protocol.transport.ConnectionPolicyImpl - Allocated connection license for jvm f78856f5-791e-400e-8757-1393a2f133d1-13a1a1eea1f; ConnectionPolicy[maxConnections=2147483647, connectedJvmCount=1] 2012-10-01 13:52:31,935 [WorkerThread(group_events_dispatch_stage, 0)] WARN com.tc.l2.ha.L2HACoordinator - NodeID[lt-mobses02.sb.com.au:9510] left the cluster 2012-10-01 13:52:31,935 [WorkerThread(group_handshake_message_stage, 0)] WARN com.tc.net.groups.TCGroupManagerImpl - Removed old member Group Member: NodeID[lt-mobses01.sb.com.au:9510] <-> NodeID[lt-mobses02.sb.com.au:9510] ChannelID[NULL_ID, Status:CLOSED]:10.32.4.95:9530 <–> 10.32.4.96:6895; Ready:false; Joined: false; memberAdding:false; HighPri: true for NodeID[lt-mobses02.sb.com.au:9510] 2012-10-01 13:52:31,973 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] INFO com.tc.net.protocol.transport.ConnectionPolicyImpl - New connection [11] from jvm f78856f5-791e-400e-8757-1393a2f133d1-13a1a1eea1f 2012-10-01 13:52:31,990 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] INFO com.tc.net.protocol.transport.ServerStackProvider - User null successfully authenticated 2012-10-01 13:52:31,991 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] INFO com.tc.net.core.CoreNIOServices - selection key cancelled key@735607632 2012-10-01 13:52:31,993 [WorkerThread(group_handshake_message_stage, 0)] INFO tc.operator.event - NODE : lt-mobses01.sb.com.au Subsystem: CLUSTER_TOPOLOGY Message: Node lt-mobses02.sb.com.au joined the cluster 2012-10-01 13:52:32,060 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] INFO com.tc.net.protocol.transport.ServerMessageTransport - ConnectionID(2.c8dd256f25c04e5eabe8bfff0c9ffbea.88dd022a-04b0-4e0d-b9ec-c2c785ced97c-13a1a1eea1e)[]: Attaching new connection: com.tc.net.core.TCConnectionImpl@1712347668: connected: true, closed: false local=10.32.4.95:9510 remote=10.32.4.124:22752 connect=[Mon Oct 01 13:52:31 EST 2012] idle=68ms [0 read, 0 write] 2012-10-01 13:52:31,973 [WorkerThread(group_events_dispatch_stage, 0)] WARN com.terracottatech.console - NodeID[lt-mobses02.sb.com.au:9510] left the cluster 2012-10-01 13:52:32,092 [WorkerThread(channel_life_cycle_stage, 0)] INFO com.tc.objectserver.handler.ChannelLifeCycleHandler - : Received transport disconnect. Shutting down client NodeID[lt-mobses02.sb.com.au:9510] 2012-10-01 13:52:32,093 [WorkerThread(l2_state_message_handler_stage, 0)] ERROR com.tc.l2.state.StateManagerImpl - State[ ACTIVE-COORDINATOR ] Received Election Won Msg : L2StateMessage [ NodeID[lt-mobses02.sb.com.au:9510], type = ELECTION_WON_ALREADY, Enrollment [ NodeID[lt-mobses02.sb.com.au:9510], isNew = false, weights = 9223372036854775807,9223372036854775807 ]]. A Terracotta server tried to join the mirror group as a second ACTIVE 2012-10-01 13:52:32,093 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] ERROR com.tc.l2.ha.ReplicatedClusterStateManagerImpl - Recd wrong response from : NodeID[lt-mobses02.sb.com.au:9510] : msg = com.tc.l2.msg.ClusterStateMessage@490cb88d while publishing Cluster State: Killing the node 2012-10-01 13:52:32,093 [WorkerThread(group_discovery_stage, 3, 0)] INFO com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(-1.ffffffffffffffffffffffffffffffff.92f6a363-81ef-4ab7-955c-9bf242cd6c15-13a1a1dd3fe)[]: Attaching new connection: com.tc.net.core.TCConnectionImpl@618420666: connected: true, closed: false local=10.32.4.95:31166 remote=10.32.4.96:9530 connect=[Mon Oct 01 13:52:32 EST 2012] idle=156ms [0 read, 0 write] 2012-10-01 13:52:32,094 [WorkerThread(channel_life_cycle_stage, 0)] INFO com.tc.objectserver.persistence.db.TransactionStoreImpl - shutdownClient() : Removing txns from DB : 0 2012-10-01 13:52:32,135 [L2_L1:TCWorkerComm # 2_R] WARN com.tc.net.protocol.transport.ServerMessageTransport - ConnectionID(11.c8dd256f25c04e5eabe8bfff0c9ffbea.f78856f5-791e-400e-8757-1393a2f133d1-13a1a1eea1f)[]: CLOSE EVENT : com.tc.net.core.TCConnectionImpl@1874829407: connected: false, closed: true local=10.32.4.95:9510 remote=10.32.4.85:43507 connect=[Mon Oct 01 13:52:31 EST 2012] idle=144ms [136 read, 0 write]. STATUS : START 2012-10-01 13:52:32,136 [L2_L1:TCWorkerComm # 2_R] WARN com.tc.net.protocol.transport.ServerMessageTransport - ConnectionID(11.c8dd256f25c04e5eabe8bfff0c9ffbea.f78856f5-791e-400e-8757-1393a2f133d1-13a1a1eea1f)[]: closing down connection - com.tc.net.core.TCConnectionImpl@1874829407: connected: false, closed: true local=10.32.4.95:9510 remote=10.32.4.85:43507 connect=[Mon Oct 01 13:52:31 EST 2012] idle=145ms [136 read, 0 write] 2012-10-01 13:52:32,136 [L2_L1:TCWorkerComm # 2_W] WARN com.tc.net.core.CoreNIOServices - Exception trying to process interest request: java.nio.channels.ClosedChannelException 2012-10-01 13:52:32,136 [WorkerThread(group_discovery_stage, 3, 0)] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager - Health monitoring agent started for lt-mobses02.sb.com.au:9530 2012-10-01 13:52:32,136 [WorkerThread(receive_group_message_stage, 0)] WARN com.tc.l2.ha.L2HAZapNodeRequestProcessor - State[ ACTIVE-COORDINATOR ] received Zap Node request from another State[ ACTIVE-COORDINATOR ] NodeID : NodeID[lt-mobses02.sb.com.au:9510] Error Type : Two or more Active servers detected in the cluster Details : Recd ClusterStateMessage from : NodeID[lt-mobses01.sb.com.au:9510] while in ACTIVE-COORDINATOR state Exception : java.lang.Throwable at com.tc.l2.ha.ReplicatedClusterStateManagerImpl.handleClusterStateMessage(ReplicatedClusterStateManagerImpl.java:155) at com.tc.l2.ha.ReplicatedClusterStateManagerImpl.messageReceived(ReplicatedClusterStateManagerImpl.java:143) at com.tc.net.groups.TCGroupManagerImpl.fireMessageReceivedEvent(TCGroupManagerImpl.java:721) at com.tc.net.groups.TCGroupManagerImpl.messageReceived(TCGroupManagerImpl.java:681) at com.tc.objectserver.handler.ReceiveGroupMessageHandler.handleEvent(ReceiveGroupMessageHandler.java:22) at com.tc.async.impl.StageImpl$WorkerThread.run(StageImpl.java:145)

2012-10-01 13:52:32,812 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] WARN com.terracottatech.console - Requesting node to quit : NodeID : NodeID[lt-mobses02.sb.com.au:9510] Error Type : Two or more Active servers detected in the cluster 2012-10-01 13:52:32,812 [WorkerThread(l2_state_message_handler_stage, 0)] WARN com.terracottatech.console - Requesting node to quit : NodeID : NodeID[lt-mobses02.sb.com.au:9510] Error Type : Two or more Active servers detected in the cluster 2012-10-01 13:52:33,002 [WorkerThread(group_events_dispatch_stage, 0)] INFO com.tc.l2.ha.L2HACoordinator - NodeID[lt-mobses02.sb.com.au:9510] joined the cluster 2012-10-01 13:52:33,317 [L2_L2:TCComm Main Selector Thread_R (listen 0.0.0.0:9530)] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. TCGroupManager - HealthCheckCallbackPort verification PASSED for lt-mobses02.sb.com.au:9530(callbackport: 9530) 2012-10-01 13:52:33,003 [WorkerThread(group_discovery_stage, 3, 0)] WARN com.tc.net.groups.TCGroupMemberDiscoveryStatic - DiscoverStateMachine [Node{host=lt-mobses02.sb.com.au:9510}]: Ignored switching state from: Connecting, to: Connected, current: Member-In-Group 2012-10-01 13:52:33,363 [WorkerThread(group_events_dispatch_stage, 0)] INFO com.terracottatech.console - NodeID[lt-mobses02.sb.com.au:9510] joined the cluster 2012-10-01 13:52:33,621 [WorkerThread(group_handshake_message_stage, 0)] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: TCGroupManager - Connection to [lt-mobses02.sb.com.au:9530] CLOSED. Health Monitoring for this node is now disabled. 2012-10-01 13:52:33,622 [WorkerThread(group_handshake_message_stage, 0)] WARN com.tc.net.groups.TCGroupMemberDiscoveryStatic - DiscoverStateMachine [Node{host=lt-mobses02.sb.com.au:9510}]: Ignored switching state from: Connecting, to: Init, current: Member-In-Group 2012-10-01 13:52:33,622 [WorkerThread(group_handshake_message_stage, 0)] WARN com.tc.net.groups.TCGroupMemberDiscoveryStatic - DiscoverStateMachine [Node{host=lt-mobses02.sb.com.au:9510}]: Ignored switching state from: Connected, to: Init, current: Member-In-Group 2012-10-01 13:52:33,745 [WorkerThread(receive_group_message_stage, 0)] INFO com.tc.net.core.TCConnectionManager - com.tc.net.core.TCConnectionImpl@835486647: connected: true, closed: false local=10.32.4.95:9510 remote=10.32.4.131:57157 connect=[Mon Oct 01 13:52:31 EST 2012] idle=1753ms [0 read, 0 write] is not active; Max allowed Idle time:8000; Transport Established: false 2012-10-01 13:52:33,745 [WorkerThread(receive_group_message_stage, 0)] INFO com.tc.net.core.TCConnectionManager - com.tc.net.core.TCConnectionImpl@1712347668: connected: true, closed: false local=10.32.4.95:9510 remote=10.32.4.124:22752 connect=[Mon Oct 01 13:52:31 EST 2012] idle=1753ms [0 read, 0 write] is not active; Max allowed Idle time:8000; Transport Established: false 2012-10-01 13:52:33,799 [WorkerThread(receive_group_message_stage, 0)] INFO com.tc.net.core.TCConnectionManager - Active connections : 0 out of 2 2012-10-01 13:52:33,800 [WorkerThread(receive_group_message_stage, 0)] WARN com.tc.l2.ha.L2HAZapNodeRequestProcessor - A Terracotta server tried to join the mirror group as a second ACTIVE : My weights = 0,-3576867000,12980939,11132628575526,8778404219038062035 Other servers weights = 7,-42000,227168,11132628641062,-2552508849076222291 2012-10-01 13:52:33,800 [WorkerThread(l2_state_message_handler_stage, 0)] INFO com.tc.net.core.TCConnectionManager - com.tc.net.core.TCConnectionImpl@835486647: connected: true, closed: false local=10.32.4.95:9510 remote=10.32.4.131:57157 connect=[Mon Oct 01 13:52:31 EST 2012] idle=1808ms [0 read, 0 write] is not active; Max allowed Idle time:8000; Transport Established: false 2012-10-01 13:52:33,800 [WorkerThread(receive_group_message_stage, 0)] FATAL tc.operator.event - NODE : lt-mobses01.sb.com.au Subsystem: CLUSTER_TOPOLOGY Message: SPLIT BRAIN, lt-mobses01.sb.com.au and lt-mobses02.sb.com.au are ACTIVE 2012-10-01 13:52:33,800 [WorkerThread(l2_state_message_handler_stage, 0)] INFO com.tc.net.core.TCConnectionManager - com.tc.net.core.TCConnectionImpl@1712347668: connected: true, closed: false local=10.32.4.95:9510 remote=10.32.4.124:22752 connect=[Mon Oct 01 13:52:31 EST 2012] idle=1808ms [0 read, 0 write] is not active; Max allowed Idle time:8000; Transport Established: false 2012-10-01 13:52:33,800 [WorkerThread(receive_group_message_stage, 0)] WARN com.tc.l2.ha.L2HAZapNodeRequestProcessor - NodeID[lt-mobses02.sb.com.au:9510] wins : Backing off : Exiting !!! 2012-10-01 13:52:33,800 [WorkerThread(l2_state_message_handler_stage, 0)] INFO com.tc.net.core.TCConnectionManager - Active connections : 0 out of 2 2012-10-01 13:52:33,801 [WorkerThread(receive_group_message_stage, 0)] FATAL tc.operator.event - NODE : lt-mobses01.sb.com.au Subsystem: CLUSTER_TOPOLOGY Message: NodeID[lt-mobses02.sb.com.au:9510] has more clients. Exiting!! 2012-10-01 13:52:33,801 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] INFO com.tc.net.core.TCConnectionManager - com.tc.net.core.TCConnectionImpl@835486647: connected: true, closed: false local=10.32.4.95:9510 remote=10.32.4.131:57157 connect=[Mon Oct 01 13:52:31 EST 2012] idle=1809ms [0 read, 0 write] is not active; Max allowed Idle time:8000; Transport Established: false 2012-10-01 13:52:33,802 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] INFO com.tc.net.core.TCConnectionManager - com.tc.net.core.TCConnectionImpl@1712347668: connected: true, closed: false local=10.32.4.95:9510 remote=10.32.4.124:22752 connect=[Mon Oct 01 13:52:31 EST 2012] idle=1810ms [0 read, 0 write] is not active; Max allowed Idle time:8000; Transport Established: false 2012-10-01 13:52:33,802 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] INFO com.tc.net.core.TCConnectionManager - Active connections : 0 out of 2 2012-10-01 13:52:33,801 [WorkerThread(l2_state_message_handler_stage, 0)] WARN com.tc.net.groups.TCGroupManagerImpl - Zapping node : NodeID[lt-mobses02.sb.com.au:9510] type = 255 reason = State[ ACTIVE-COORDINATOR ] Received Election Won Msg : L2StateMessage [ NodeID[lt-mobses02.sb.com.au:9510], type = ELECTION_WON_ALREADY, Enrollment [ NodeID[lt-mobses02.sb.com.au:9510], isNew = false, weights = 9223372036854775807,9223372036854775807 ]]. A Terracotta server tried to join the mirror group as a second ACTIVE my weight = [0, -3578404000, 12980939, 11132628575526, 4964923724550007331] 2012-10-01 13:52:33,802 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] WARN com.tc.net.groups.TCGroupManagerImpl - Zapping node : NodeID[lt-mobses02.sb.com.au:9510] type = 255 reason = Recd wrong response from : NodeID[lt-mobses02.sb.com.au:9510] while publishing Cluster State Exception : java.lang.Throwable at com.tc.l2.ha.ReplicatedClusterStateManagerImpl.validateResponse(ReplicatedClusterStateManagerImpl.java:89) at com.tc.l2.ha.ReplicatedClusterStateManagerImpl.publishToAll(ReplicatedClusterStateManagerImpl.java:132) at com.tc.l2.ha.ReplicatedClusterStateManagerImpl.connectionIDCreated(ReplicatedClusterStateManagerImpl.java:118) at com.tc.objectserver.persistence.db.ConnectionIDFactoryImpl.fireCreationEvent(ConnectionIDFactoryImpl.java:71) at com.tc.objectserver.persistence.db.ConnectionIDFactoryImpl.restoreConnectionId(ConnectionIDFactoryImpl.java:65) at com.tc.net.protocol.transport.ServerStackProvider.attachNewConnection(ServerStackProvider.java:124) at com.tc.net.protocol.transport.ServerStackProvider$MessageSink.handleSyn(ServerStackProvider.java:301) at com.tc.net.protocol.transport.ServerStackProvider$MessageSink.verifyAndHandleSyn(ServerStackProvider.java:245) at com.tc.net.protocol.transport.ServerStackProvider$MessageSink.putMessage(ServerStackProvider.java:226) at com.tc.net.protocol.transport.WireProtocolAdaptorImpl.addReadData(WireProtocolAdaptorImpl.java:54) at com.tc.net.protocol.ProtocolSwitch.addReadData(ProtocolSwitch.java:50) at com.tc.net.core.TCConnectionImpl.addNetworkData(TCConnectionImpl.java:777) at com.tc.net.core.TCConnectionImpl.doReadFromBufferInternal(TCConnectionImpl.java:463) at com.tc.net.core.TCConnectionImpl.doReadFromBuffer(TCConnectionImpl.java:306) at com.tc.net.core.TCConnectionImpl.doReadInternal(TCConnectionImpl.java:290) at com.tc.net.core.TCConnectionImpl.doRead(TCConnectionImpl.java:266) at com.tc.net.core.CoreNIOServices$CommThread.selectLoop(CoreNIOServices.java:624) at com.tc.net.core.CoreNIOServices$CommThread.run(CoreNIOServices.java:290)

my weight = [0, -3579675000, 12980939, 11132628575526, 359689043280585659] 2012-10-01 13:52:33,946 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] INFO com.tc.net.protocol.transport.ConnectionPolicyImpl - Allocated connection license for jvm 88dd022a-04b0-4e0d-b9ec-c2c785ced97c-13a1a1eea1e; ConnectionPolicy[maxConnections=2147483647, connectedJvmCount=2] 2012-10-01 13:52:33,947 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] INFO com.tc.net.protocol.transport.ConnectionPolicyImpl - New connection [2] from jvm 88dd022a-04b0-4e0d-b9ec-c2c785ced97c-13a1a1eea1e 2012-10-01 13:52:33,947 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] INFO com.tc.net.protocol.transport.ServerStackProvider - User null successfully authenticated 2012-10-01 13:52:33,947 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] INFO com.tc.net.core.TCWorkerCommManager - Selecting [L2_L1:TCWorkerComm # 3, FD, wt:0] from [[L2_L1:TCWorkerComm # 0, FD, wt:0], [L2_L1:TCWorkerComm # 1, FD, wt:0], [L2_L1:TCWorkerComm # 2, FD, wt:0], [L2_L1:TCWorkerComm # 3, FD, wt:0], [L2_L1:TCWorkerComm # 4, FD, wt:0], [L2_L1:TCWorkerComm # 5, FD, wt:0], [L2_L1:TCWorkerComm # 6, FD, wt:0], [L2_L1:TCWorkerComm # 7, FD, wt:0], [L2_L1:TCWorkerComm # 8, FD, wt:0], [L2_L1:TCWorkerComm # 9, FD, wt:0], [L2_L1:TCWorkerComm # 10, FD, wt:0], [L2_L1:TCWorkerComm # 11, FD, wt:0], [L2_L1:TCWorkerComm # 12, FD, wt:0], [L2_L1:TCWorkerComm # 13, FD, wt:0], [L2_L1:TCWorkerComm # 14, FD, wt:0], [L2_L1:TCWorkerComm # 15, FD, wt:0]] 2012-10-01 13:52:33,947 [L2_L1:TCComm Main Selector Thread_R (listen 0.0.0.0:9510)] INFO com.tc.net.core.CoreNIOServices - selection key cancelled key@998902629 {noformat}

When I had a look at the Tomcat thread dump, it seem all the threads are in a BLOCKED status.

{noformat} Thread 28127: (state = BLOCKED)

sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=156 (Interpreted frame)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=1987 (Interpreted frame)
com.tc.object.locks.ClientLockManagerImpl.waitUntilRunning() @bci=98, line=636 (Interpreted frame)
com.tc.object.locks.ClientLockManagerImpl.lock(com.tc.object.locks.LockID, com.tc.object.locks.LockLevel) @bci=1, line=93 (Compiled frame)
com.tc.object.bytecode.ManagerImpl.lock(com.tc.object.locks.LockID, com.tc.object.locks.LockLevel) @bci=14, line=869 (Compiled frame)
com.tc.object.bytecode.ManagerUtil.beginLock(java.lang.String, int) @bci=18, line=221 (Compiled frame)
com.tc.object.bytecode.ManagerUtil.beginLockWithoutTxn(java.lang.String, int) @bci=2, line=229 (Compiled frame)
com.terracotta.session.util.Lock.getReadLock() @bci=5, line=48 (Compiled frame)
com.terracotta.session.util.DefaultSessionId.getSessionInvalidatorReadLock() @bci=12, line=83 (Compiled frame)
com.terracotta.session.SessionDataStore.find(com.terracotta.session.SessionId) @bci=15, line=140 (Compiled frame)
com.terracotta.session.TerracottaSessionManager.getSessionIfExists(com.terracotta.session.SessionId, javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse) @bci=46, line=579 (Compiled frame)
com.terracotta.session.SessionRequest.getTerracottaSession(boolean) @bci=55, line=150 (Compiled frame)
com.terracotta.session.SessionRequest.getSession(boolean) @bci=2, line=109 (Compiled frame)
org.springframework.web.context.request.ServletRequestAttributes.(javax.servlet.http.HttpServletRequest) @bci=29, line=81 (Compiled frame)
org.springframework.web.servlet.DispatcherServlet.doDispatch(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse) @bci=35, line=825 (Compiled frame)
org.springframework.web.servlet.DispatcherServlet.doService(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse) @bci=241, line=793 (Compiled frame)
org.springframework.web.servlet.FrameworkServlet.processRequest(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse) @bci=10, line=476 (Compiled frame)
org.springframework.web.servlet.FrameworkServlet.doGet(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse) @bci=3, line=431 (Compiled frame)
javax.servlet.http.HttpServlet.service(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse) @bci=35, line=617 (Compiled frame)
javax.servlet.http.HttpServlet.service(javax.servlet.ServletRequest, javax.servlet.ServletResponse) @bci=30, line=717 (Compiled frame)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(javax.servlet.ServletRequest, javax.servlet.ServletResponse) @bci=376, line=290 (Compiled frame)
org.apache.catalina.core.ApplicationFilterChain.doFilter(javax.servlet.ServletRequest, javax.servlet.ServletResponse) @bci=101, line=206 (Compiled frame)
com.mobenga.core.filter.DebugFilter.doFilter(javax.servlet.ServletRequest, javax.servlet.ServletResponse, javax.servlet.FilterChain) @bci=24, line=36 (Compiled frame)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(javax.servlet.ServletRequest, javax.servlet.ServletResponse) @bci=119, line=235 (Compiled frame)
org.apache.catalina.core.ApplicationFilterChain.doFilter(javax.servlet.ServletRequest, javax.servlet.ServletResponse) @bci=101, line=206 (Compiled frame)
org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse, javax.servlet.FilterChain) @bci=59, line=96 (Compiled frame)
org.springframework.web.filter.OncePerRequestFilter.doFilter(javax.servlet.ServletRequest, javax.servlet.ServletResponse, javax.servlet.FilterChain) @bci=90, line=75 (Compiled frame)
org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(javax.servlet.Filter, javax.servlet.ServletRequest, javax.servlet.ServletResponse, javax.servlet.FilterChain) @bci=5, line=183 (Compiled frame)
org.springframework.web.filter.DelegatingFilterProxy.doFilter(javax.servlet.ServletRequest, javax.servlet.ServletResponse, javax.servlet.FilterChain) @bci=34, line=138 (Compiled frame)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(javax.servlet.ServletRequest, javax.servlet.ServletResponse) @bci=119, line=235 (Compiled frame)
org.apache.catalina.core.ApplicationFilterChain.doFilter(javax.servlet.ServletRequest, javax.servlet.ServletResponse) @bci=101, line=206 (Compiled frame)
org.apache.catalina.core.StandardWrapperValve.invoke(org.apache.catalina.connector.Request, org.apache.catalina.connector.Response) @bci=804, line=233 (Compiled frame)
org.apache.catalina.core.StandardContextValve.invoke(org.apache.catalina.connector.Request, org.apache.catalina.connector.Response) @bci=365, line=191 (Compiled frame)
org.terracotta.modules.tomcat.tomcat_5_5.SessionValve55.tcInvoke(org.apache.catalina.connector.Request, org.apache.catalina.connector.Response) @bci=89, line=100 (Compiled frame)
org.terracotta.modules.tomcat.tomcat_5_5.SessionValve55.invoke(org.apache.catalina.connector.Request, org.apache.catalina.connector.Response) @bci=24, line=85 (Compiled frame)
org.terracotta.session.ModernTomcatSessionValve.invoke(org.apache.catalina.connector.Request, org.apache.catalina.connector.Response) @bci=6, line=66 (Compiled frame)
org.apache.catalina.core.StandardHostValve.invoke(org.apache.catalina.connector.Request, org.apache.catalina.connector.Response) @bci=64, line=127 (Compiled frame)
org.apache.catalina.valves.ErrorReportValve.invoke(org.apache.catalina.connector.Request, org.apache.catalina.connector.Response) @bci=6, line=102 (Compiled frame)
org.apache.catalina.core.StandardEngineValve.invoke(org.apache.catalina.connector.Request, org.apache.catalina.connector.Response) @bci=42, line=109 (Compiled frame)
org.apache.catalina.valves.AccessLogValve.invoke(org.apache.catalina.connector.Request, org.apache.catalina.connector.Response) @bci=24, line=615 (Compiled frame)
org.apache.catalina.connector.CoyoteAdapter.service(org.apache.coyote.Request, org.apache.coyote.Response) @bci=158, line=293 (Compiled frame)
org.apache.coyote.http11.Http11Processor.process(java.net.Socket) @bci=514, line=859 (Compiled frame)
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(java.net.Socket) @bci=82, line=602 (Compiled frame)
org.apache.tomcat.util.net.JIoEndpoint$Worker.run() @bci=41, line=489 (Compiled frame)
java.lang.Thread.run() @bci=11, line=662 (Interpreted frame) {noformat}

Can someone please take a look? Thank you very much.

Comments

Fiona OShea 2012-10-01

Is this a feature request?

Tim Eck 2012-10-01

I don’t see that he is asking for a feature request, although a non-stop and/or rejoin feature to sessions could be useful this user.

The thing I don’t quite understand is why L1 reconnect applicable here? The client killed its own connection due to health check. Does that bypass reconnect?

Sending over to platform as I don’t think this is session related.

Nishant Bangarwa 2012-10-04

Its a Split Brain Scenario -

2012-10-01 13:52:32,136 [WorkerThread(receive_group_message_stage, 0)] WARN com.tc.l2.ha.L2HAZapNodeRequestProcessor - State[ ACTIVE-COORDINATOR ] received Zap Node request from another State[ ACTIVE-COORDINATOR ]
NodeID : NodeID[lt-mobses02.sb.com.au:9510] Error Type : Two or more Active servers detected in the cluster Details : Recd ClusterStateMessage from : NodeID[lt-mobses01.sb.com.au:9510] while in ACTIVE-COORDINATOR state
Exception : java.lang.Throwable

Since we do not have have functionality for rejoin + sessions, the clients got stuck after one of the server nodes got zapped and keep trying for a reconnect.

One possible reason here for the split brain to occur is that he is running with high load which caused high GC pauses.

Nishant Bangarwa 2012-10-09

Long GC seems to have caused split brain. There is a GC Pause of 140 sec at the same time when Split Brain happened. we have traces for Long GC in logs -

From Server Logs -

2012-10-01 13:52:30,060 [TC Memory Monitor] WARN com.tc.runtime.logging.LongGCLogger - Detected long GC>8,000ms. GC count:1. GC Time:140,725ms. Frequent long GC cycles cause severe performance degradation. Use Terracotta BigMemory to eliminate them.

Can someone from Field help him to tune this.

Fiona OShea 2012-10-09

send to Field? Or ask the user to add a forum post?

Karthik Lalithraj 2012-10-15

A few things

a) Firstly you are using the DSO mode. I would strongly recommend you to move to the Express mode as this mode has more scale options.

b) You have numerous custom berkley DB tuning options. Not sure why these are present. I would suggest to remove them. In face as far as I can tell, you can remove ALL your custom tc properties

    <property name="l2.berkeleydb.je.cleaner.bytesInterval" value="20000000" />
    <property name="l2.berkeleydb.je.checkpointer.bytesInterval" value="20000000" />
    <property name="l2.berkeleydb.je.cleaner.lookAheadCacheSize" value="65536" />
    <property name="l2.berkeleydb.je.cleaner.minAge" value="1" />
    <property name="l2.berkeleydb.je.cleaner.maxBatchFiles" value="100" />
    <property name="l2.berkeleydb.je.cleaner.rmwFix" value="false" />
    <property name="l2.berkeleydb.je.cleaner.threads" value="8" />
    <property name="l2.objectmanager.deleteBatchSize" value="40000" />
    <property name="l2.objectmanager.loadObjectID.checkpoint.maxlimit" value="4000000" />

c) You have different client reconnect windows for the active and passive. You need to change this to be the defaults.

d) Another recommendation is to add a mirror group so that your code remains clean.

e) You are using a 32768m heap. Thats a 32GB heap. Needless to say this will cause huge GC pauses and disconnect issues. You need to either consider reducing the heap to 2-3 GB and add more stripes (enterprise) or adding bigmemory (enterprise as well) to avoid GC issues.

f) The disconnect issues that you are seeing (and the client sessions being rejected by TSA) are due to your very huge heap and the GC pause associated with that.

For additional technical support please contact [email protected] and someone should be able to assist you.

Karthik Lalithraj 2012-10-15

Adviced customer to tune TSA and move to express mode.

Allen Qin 2012-10-16

Thank you very much for the reply.

I will talk to our dev to implement the changes recommended.