• Bug
  • Status: Closed
  • 2 Major
  • Resolution: Fixed
  • nelrahma
  • Reporter: nelrahma
  • April 26, 2010
  • 0
  • Watchers: 1
  • July 27, 2012
  • February 27, 2011

Attachments

Description

From Forum post: http://forums.terracotta.org/forums/posts/list/3532.page#19745

Dear All,

We have been evaluating a distributed Ehcache using a simple application that gets references to objects in the cache and modifies them. The set up was:

  • 1 TC server, 1 Mirror on separate servers (not in persistent mode)
  • 3 TC clients accessing the cache (standalone Java app)
  • 1 TC client webapp to view cache contents (in Tomcat)

We have been testing this setup at a rate of 500 updates per second to the cache (strictly speaking they are not “cache updates”, since the object in the cache is modified but not replaced). These updates are made randomly to 70 objects in the cache (there is no cache eviction).

The GC was tuned to run 1/minute for young collections and every 5 minutes for full collections, which stabilized the number of live objects (the application is creating many new Timestamp objects, although this could be modified).

The main TC server crashed after running at this rate for one month, with the following exception, caught during the garbage collection:

2010-04-15 13:16:03,161 [DGC-Thread] INFO com.tc.objectserver.dgc.impl.MarkAndSweepGarbageCollector - DGC[ 41059 ] YoungGen DGC start 2010-04-15 13:16:03,285 [DGC-Thread] INFO com.tc.objectserver.dgc.impl.MarkAndSweepGarbageCollector - DGC[ 41059 ] pre-DGC managed id count: 66386 2010-04-15 13:16:03,313 [DGC-Thread] ERROR com.tc.server.TCServerMain - Thread:Thread[DGC-Thread,5,TC Thread Group] got an uncaught exception. calling Callba\ ckOnExitDefaultHandlers. com.tc.util.TCAssertionError: Assertion failed: Bit index out of range at com.tc.util.Assert.failure(Assert.java:60) at com.tc.util.Assert.eval(Assert.java:80) at com.tc.util.Assert.assertTrue(Assert.java:100) at com.tc.util.OidLongArray.bit(OidLongArray.java:49) at com.tc.util.OidLongArray.isSet(OidLongArray.java:135) at com.tc.util.OidBitsArrayMapImpl.contains(OidBitsArrayMapImpl.java:72) at com.tc.objectserver.impl.NoReferencesIDStoreImpl$OidBitsStore.hasNoReferences(NoReferencesIDStoreImpl.java:84) at com.tc.objectserver.impl.NoReferencesIDStoreImpl.hasNoReferences(NoReferencesIDStoreImpl.java:46) at com.tc.objectserver.impl.ObjectManagerImpl.getObjectReferencesFrom(ObjectManagerImpl.java:641) at com.tc.objectserver.dgc.impl.YoungGCHook.getObjectReferencesFrom(YoungGCHook.java:57) at com.tc.objectserver.dgc.impl.MarkAndSweepGCAlgorithm.collectRoot(MarkAndSweepGCAlgorithm.java:167) at com.tc.objectserver.dgc.impl.MarkAndSweepGCAlgorithm.collect(MarkAndSweepGCAlgorithm.java:150) at com.tc.objectserver.dgc.impl.MarkAndSweepGCAlgorithm.doGC(MarkAndSweepGCAlgorithm.java:67) at com.tc.objectserver.dgc.impl.MarkAndSweepGarbageCollector.doGC(MarkAndSweepGarbageCollector.java:71) at com.tc.objectserver.dgc.impl.GarbageCollectorThread.doYoungGC(GarbageCollectorThread.java:94) at com.tc.objectserver.dgc.impl.GarbageCollectorThread.run(GarbageCollectorThread.java:75)

The mirror server took over correctly at this point, but crashed with the same message during the next distributed garbage collection.

Although some of the machines were being stretched by the testing, I did not detect any significant problems during the month of operation (using the metrics provided in the TC Dev. console. The TC temporary disk storage did vary quite a bit, but never seemed to grow out of control (max around 100MB). So this crash has left me rather puzzled so far…

Does this ring a bell for anybody? I realize this is not much to go by… Let me know if the full thread JVM dump in the TC log would help.

Thanks for any indications of where I should be looking.

Mark

Comments

Nabib El-Rahman 2010-04-27

Email thread comments:

Ran a test maps with null values test and found out that a map.put(“key1”, null) does lead to an ObjectID.NULL_ID in the references map in PartialMapManagedObjectState.

We don’t filter Object.NULL_ID these at the managed object state class for DGC, instead we filter it in the DGC algorithm.

private void collectRoot(Filter filter, ObjectID rootId, Set managedObjectIds, LifeCycleState lifeCycleState) { Set toBeVisited = new ObjectIDSet(); toBeVisited.add(rootId);

while (!toBeVisited.isEmpty()&& !managedObjectIds.isEmpty()) {

for (Iterator i = new ObjectIDSet(toBeVisited).iterator(); i.hasNext()&& !managedObjectIds.isEmpty();) { ObjectID id = (ObjectID) i.next(); if (lifeCycleState.isStopRequested()) return; Set references = gcHook.getObjectReferencesFrom(id); toBeVisited.remove(id);

for (Iterator r = references.iterator(); r.hasNext();) { ObjectID mid = (ObjectID) r.next(); if (mid == null) { // see CDV-765 MarkAndSweepGarbageCollector.logger.error(“null value returned from getObjectReferences() on “ + id); continue; } if (mid.isNull() || !managedObjectIds.contains(mid)) continue; if (filter.shouldVisit(mid)) toBeVisited.add(mid); managedObjectIds.remove(mid); } } } }

Notice the mid.isNull() line..

The only way this can fail is if the rootId happens to be an Object.NULL_ID, if you notice in YoungGenChangeCollectorImpl.notifyObjectEvicted. If PartialMapManagedObjectState happens to make it into the rememberedSet where its children does have Object.NULL_ID, then this can happen.

public synchronized void notifyObjectsEvicted(Collection evicted) { for (Iterator i = evicted.iterator(); i.hasNext();) { ManagedObject mo = (ManagedObject) i.next(); ObjectID id = mo.getID(); removeReferencesTo(id); Set references = mo.getObjectReferences(); references.retainAll(this.youngGenObjectIDs.keySet()); this.rememberedSet.addAll(references); } }

Need more details about the forum failure to see if its the same issue, but the code is reading that way. Still trying to get this to fail in the test, any pointers on how we can expose this?

Nabib El-Rahman 2010-04-27

Next steps:

  1. Fix OidBitsArrayMapImpl to accept negative numbers.
  2. See if it acceptable for YoungGen to have NULL_ID’s

Nabib El-Rahman 2010-04-27

Logs from forum post.

Saravanan Subbiah 2010-05-05

  1. OidBitsArrayMapImpl should atleast be fixed to handle Null ID possibily negative ids
  2. We should probably not have NullIDs in remembered set or youngGenObjectIDs inYDGC. This can be done without paying extra penality of iterating over the set.

Nabib El-Rahman 2010-05-25

Ran a test maps with null values test and found out that when you put a null, the value is an ObjectID.NULL_ID. We don’t filter these at the managed object state class for DGC, instead we filter it in the DGC algorithm.

private void collectRoot(Filter filter, ObjectID rootId, Set managedObjectIds, LifeCycleState lifeCycleState) { Set toBeVisited = new ObjectIDSet(); toBeVisited.add(rootId);

while (!toBeVisited.isEmpty() && !managedObjectIds.isEmpty()) {

  for (Iterator i = new ObjectIDSet(toBeVisited).iterator(); i.hasNext() && !managedObjectIds.isEmpty();) {
    ObjectID id = (ObjectID) i.next();
    if (lifeCycleState.isStopRequested()) return;
    Set references = gcHook.getObjectReferencesFrom(id);
    toBeVisited.remove(id);

    for (Iterator r = references.iterator(); r.hasNext();) {
      ObjectID mid = (ObjectID) r.next();
      if (mid == null) {
        // see CDV-765
        MarkAndSweepGarbageCollector.logger.error("null value returned from getObjectReferences() on " + id);
        continue;
      }
      if (mid.isNull() || !managedObjectIds.contains(mid)) continue;
      if (filter.shouldVisit(mid)) toBeVisited.add(mid);
      managedObjectIds.remove(mid);
    }
  }
}   \}

Notice the mid.isNull() line..

The only way this can fail is if the rootId happens to be an Object.NULL_ID, if you notice in YoungGenChangeCollectorImpl.notifyObjectEvicted. If a logical object happens to make it into the rememberedSet where its children does have Object.NULL_ID, then this can happen, the the failure we saw in the forum can happen.

public synchronized void notifyObjectsEvicted(Collection evicted) { for (Iterator i = evicted.iterator(); i.hasNext();) { ManagedObject mo = (ManagedObject) i.next(); ObjectID id = mo.getID(); removeReferencesTo(id); Set references = mo.getObjectReferences(); references.retainAll(this.youngGenObjectIDs.keySet()); this.rememberedSet.addAll(references); } }

Still trying to get this to fail in the test, any pointers on how we can expose this?

-Nabib

Michal L 2010-07-07

Is this bug fixed in TC 3.2.1_2? If not - Is there any known workaround for it? Is there anything we can do to avoid it? Do you have a fix for it? It is a real show stopper!!! Thanks.

Nabib El-Rahman 2010-07-07

This assertion shouldn’t happen in TC_3.2.1_2. You should be ok with that revision.

Erh-Yuan Tsai 2010-07-20

Made ObitBitsArrayMapImpl to accept negative ObjectID, trunk: rev15957, branch 3.3: 15958.

Reassign back to Nabib.