• Bug
  • Status: Closed
  • 1 Critical
  • Resolution: Fixed
  • hhuynh
  • Reporter: tgautier
  • October 29, 2007
  • 0
  • Watchers: 2
  • February 18, 2008
  • January 02, 2008

Attachments

Description

Submitted on behalf of : James Heanly at Shine Technologies.

We currently only have one client using the application of which Terracotta forms a core component. Recently, they have been experiencing some memory problems with the Terracotta server JVM. The problem appears to be that for some reason, the Terracotta server JVM is not freeing memory, until the point that the JVM exceeds its available memory and the server crashes (all currently running clients therefore lose their connection to the server).

Now, I’m not entirely sure whether you are the right person to help me, or whether I should be posting my issue on the Terracotta forum (please let me know if there is a more correct mechanism for having my queries answered).

I have attached a document that shows the memory usage (via JConsole) for the Terracotta server JVM. In my simple test case, I have the Terracotta server running along with only one client. Now, the client is not permanently connected to the server. We have a shell script that starts the client (via cron) every minute. The client connects to the Terracotta server via access to the root defined in the tc-config.xml file. It then checks a database table to see if it has any work to do (which in my test case it doesn’t) and completes (the client JVM terminates).

I ran this test scenario for approximately one hour, over which time the memory usage of the Terracotta server steadily grew. I then stopped any further execution of the client, but for some reason the servers memory usage continued to grow.

This is where my problem lies. I was expecting the memory usage of the server to at least plateau, or preferably, return to a level somewhere around the usage at the startup of the server. What I am doing wrong? Do I need to tune the Terracotta server garbage collection? Is there some means of more cleanly disconnecting a client from the server (ie in our case, the server does not need to check for reconnecting clients - each client that runs is effectively treated as a new client)?

Any help you can provide would be greatly appreciated.

By the way, my test case was run against the latest version of Terracotta (2.4.5). I can supply source code if required.

Comments

Fiona OShea 2007-10-30

work with Hung to get a test to recreate this.

Taylor Gautier 2007-11-02

Sorry for the delay getting back to you. I don’t think there is anything specific that we need wrt a heap dump, but I have asked our field engineering and engineering team leads to let me know if there is anything particular we need.

We look forward to getting a snapshot from you to analyze. If you have found anything further on your end, please let me know.

You should just attach the heap dump to this bug using the file attachment procedure.

Saravanan Subbiah 2007-11-06

Our test that simulates this use case doesnt produce this error. Unless we either get the original application or atleast a heap dump on OOME, there is nothing else we can do here.

James Heanly 2007-11-19

I have attached some sample code, as well as a chart showing the memory usage of the Terracotta server over an extended period of time.

The attached also includes some notes on the process (see README.txt)

orion 2007-11-30

Now that we have more info, please look into this.

Also, close the CRT issue.

Hung Huynh 2007-12-01

I’ve set up a similar scenario with one client spawning every minute.

Heap dump of only live objects will be taken every 30 minutes and drop here http://fileserver/ftp/pub/users/hhuynh/dumps/

For the first hour, I noticed an 8MB increase in heap size.

The test is still running.


After 18 hours, here’s some preliminary data:

First heap snapshot taken at 6PM NOV-30: 15.5MB (attachment first-heap.png) Second heap snapshot taken at 12PM DEV-01: 86.9MB (attachment second-heap.png)

Check out the delta of the two heaps with attachment delta-in-18-hours.png

Clearly there’s memory leak. com.tc.stats.counter.sampled.TimeStampedCounterValue objects count is increasing as the test run.

Hung Huynh 2007-12-02

Server heap grew up to 138M after 30 hours and crashed with:

java.lang.Error: Untranslated exception at sun.nio.ch.Net.translateToSocketException(Net.java:65) at sun.nio.ch.SocketAdaptor.close(SocketAdaptor.java:354) at com.tc.server.TerracottaConnector$SocketWrapper.close(TerracottaConnector.java:67) at org.mortbay.io.bio.SocketEndPoint.close(SocketEndPoint.java:59) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:214) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:475) Caused by: java.io.IOException: Bad file descriptor at sun.nio.ch.FileDispatcher.close0(Native Method) at sun.nio.ch.SocketDispatcher.close(SocketDispatcher.java:37) at sun.nio.ch.SocketChannelImpl.kill(SocketChannelImpl.java:725) at sun.nio.ch.SocketChannelImpl.implCloseSelectableChannel(SocketChannelImpl.java:707) at java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:201) at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:97) at sun.nio.ch.SocketAdaptor.close(SocketAdaptor.java:352) … 4 more

Saravanan Subbiah 2007-12-03

After analyzing the heap dumps, we found that the ServerMessageTransportImpl is leaking. Further analysis revealed that on every client disconnect, we throw an event into a JMX stage for clean up and that thread is slow in responding to cleanups. It is struck in JMX code trying to connect to some server. The theory is that if that thread takes more time to clean up than the time between client disconnect and new client connects, then we will end up with an OOME.

“WorkerThread(jmxremote_connect_stage,0)” daemon prio=10 tid=0x9fe21c00 nid=0x502 in Object.wait() [0x9f5c4000..0x9f5c4ed0] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xa3018298> (a com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl$ResponseMsgWrapper) at com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl.sendWithReturn(ClientSynchroMessageConnectionImpl.java:240) - locked <0xa3018298> (a com.sun.jmx.remote.generic.ClientSynchroMessageConnectionImpl$ResponseMsgWrapper) at javax.management.remote.generic.ClientIntermediary.mBeanServerRequest(ClientIntermediary.java:975) at javax.management.remote.generic.ClientIntermediary.mBeanServerRequest(ClientIntermediary.java:957) at javax.management.remote.generic.ClientIntermediary.getMBeanInfo(ClientIntermediary.java:537) at javax.management.remote.generic.GenericConnector$RemoteMBeanServerConnection.getMBeanInfo(GenericConnector.java:597) at javax.management.MBeanServerInvocationHandler.invokeBroadcasterMethod(MBeanServerInvocationHandler.java:388) at javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:239) at $Proxy1.getNotificationInfo(Unknown Source) at com.sun.jmx.mbeanserver.MBeanIntrospector.findNotifications(MBeanIntrospector.java:415) at com.sun.jmx.mbeanserver.MBeanIntrospector.getMBeanInfo(MBeanIntrospector.java:362) at com.sun.jmx.mbeanserver.MBeanSupport.(MBeanSupport.java:148) at com.sun.jmx.mbeanserver.StandardMBeanSupport.(StandardMBeanSupport.java:81) at javax.management.StandardMBean.construct(StandardMBean.java:169) at javax.management.StandardMBean.(StandardMBean.java:200) at com.tc.management.remote.connect.ClientConnectEventHandler.addJmxConnection(ClientConnectEventHandler.java:130) - locked <0xa2dda100> (a java.util.HashMap) at com.tc.management.remote.connect.ClientConnectEventHandler.handleEvent(ClientConnectEventHandler.java:85) at com.tc.async.impl.StageImpl$WorkerThread.run(StageImpl.java:142)

James Heanly 2007-12-17

Hi.

Can someone please give me an update on how the investigation into this problem is progressing? Has a fix been identified, and if so, when is it likely to be available?

Thanks.

James.

James Heanly 2008-02-18

I have attached a new zip file containing the logs from some testing we have been performing against the 2.5.1 release.

I have some concern that the out of memory error has not been fully resolved.

Running the same test code as before, after a period of time the Terracotta Server seems to refuse client connections.

I am going to run the tests again, but wanted to let you know that we are still having problems.

Did you have any problems when running our sample code against the 2.5.1 release?