• Bug
  • Status: Open
  • 2 Major
  • Resolution:
  • DSO:L1
  • prodmgmt
  • Reporter: dconnard
  • October 28, 2008
  • 3
  • Watchers: 6
  • March 19, 2010

Description

This text has been copied in and edited from http://forums.terracotta.org/forums/posts/list/1499.page

We’ve got a web application whose DB access (via hibernate) is not performing as well as it previously did under load (where previously refers to before we instrumented our DAO classes). Our DAO classes use direct field access in Hibernate (which may not be the most common pattern?). When direct field access is used, Hibernate uses reflection to access the fields on the DAO object, and (from below) it appears that Terracotta has inserted itself under the VM reflection field access layer, and inadvertently triggered inter-thread synchronization as a result.

When we take a stack dump, we are seeing a significant numbers of threads all blocked on synchronisation locks during the hibernate field-level access, as below. Note that the blocked threads are not for the same fields, nor even the same DAO (I can tell this from further down the stack traces, bits that I haven’t included). The synchronization is occurring VM wide as far as our web application is concerned.

Common locks - we are seeing 10-15 concurrent waits on this during peak loads:

“worker-800” daemon prio=1 tid=0x09d04220 nid=0x52c4 waiting for monitor entry {0x36485000..0x36486eb0} at com.tc.aspectwerkz.reflect.impl.java.JavaClassInfoRepository.getRepository(JavaClassInfoRepository.java:51) - waiting to lock <0xad7d73b0> (a java.lang.Class) at com.tc.aspectwerkz.reflect.impl.java.JavaClassInfo.getClassInfo(JavaClassInfo.java:154) at com.tc.object.bytecode.ManagerImpl.isRoot(ManagerImpl.java:773) at com.tc.object.bytecode.ManagerUtil.isRoot(ManagerUtil.java:440) at com.tc.util.FieldUtils.get(FieldUtils.java:36) at java.lang.reflect.Field.get(Field.java) at org.hibernate.property.DirectPropertyAccessor$DirectGetter.get(DirectPropertyAccessor.java:32) at org.hibernate.tuple.entity.AbstractEntityTuplizer.getIdentifier(AbstractEntityTuplizer.java:183) at org.hibernate.persister.entity.AbstractEntityPersister.getIdentifier(AbstractEntityPersister.java:3596

and, less commonly, we are also seeing some of the following waits:

“worker-814” daemon prio=1 tid=0x0a1dc9e0 nid=0x52d2 runnable {0x362b7000..0x362b8fb0} at com.tc.aspectwerkz.reflect.impl.java.JavaClassInfo.getFields(JavaClassInfo.java:333) - waiting to lock <0x4da31528> (a com.tc.aspectwerkz.reflect.impl.java.JavaClassInfo) at com.tc.object.bytecode.ManagerImpl.isRoot(ManagerImpl.java:775) at com.tc.object.bytecode.ManagerUtil.isRoot(ManagerUtil.java:440) at com.tc.util.FieldUtils.get(FieldUtils.java:36) at java.lang.reflect.Field.get(Field.java) at org.hibernate.property.DirectPropertyAccessor$DirectGetter.get(DirectPropertyAccessor.java:32) at org.hibernate.tuple.entity.AbstractEntityTuplizer.getPropertyValues(AbstractEntityTuplizer.java:256

and, also some waits on:

“worker-1291” daemon prio=1 tid=0x3de46390 nid=0x1e47 runnable {0x2e321000..0x2e322130} at EDU.oswego.cs.dl.util.concurrent.CopyOnWriteArrayList.array(Unknown Source) - waiting to lock <0x54ebbea0> (a EDU.oswego.cs.dl.util.concurrent.CopyOnWriteArrayList) at EDU.oswego.cs.dl.util.concurrent.CopyOnWriteArrayList.iterator(Unknown Source) at com.tc.object.config.StandardDSOClientConfigHelperImpl.findMatchingRootDefinition(StandardDSOClientConfigHelperImpl.java:1269) at com.tc.object.config.StandardDSOClientConfigHelperImpl.isRoot(StandardDSOClientConfigHelperImpl.java:1259) at com.tc.object.bytecode.ManagerImpl.isRoot(ManagerImpl.java:778) at com.tc.object.bytecode.ManagerUtil.isRoot(ManagerUtil.java:440) at com.tc.util.FieldUtils.get(FieldUtils.java:36) at java.lang.reflect.Field.get(Field.java) at org.hibernate.property.DirectPropertyAccessor$DirectGetter.get(DirectPropertyAccessor.java:32) at org.hibernate.tuple.entity.AbstractEntityTuplizer.getPropertyValues(AbstractEntityTuplizer.java:256)

What these three stack traces have in common is that they all start from the FieldUtils.get() method, going into the ManagerImpl.isRoot() method.

It seems that ManagerImpl.isRoot() can potentially be an expensive call to make in a heavily threaded environment. It would appear that the AspectWerkz library (the underlying culprit in the first stack pattern) uses a WeakHashMap, and they’ve (correctly) synchronized this on get() too (yes, you need to do this with WeakHashMaps, that’s not an error on their behalf).

While we found it via Hibernate, the problem isn’t Hibernate specific. It affects any reflective access to fields on instrumented objects, and triggers synchronisation between all threads doing that reflective access, regardless of the object. Hibernate in particular can be sorely affected (if you use accessType=field, like we have), but the effects don’t stop there.

Tests in our production envinroment with a patch for this have shown a solid 12-20% improvement in CPU load, with the higher figures occurring during peak loads. Not quite as dramatic an improvement as I’d hoped, but very significant nonetheless. The big variation in the observed improvement (between normal and peak loads) is unsurprising, as the issue is a concurrency one, and doesn’t really start to hurt badly until there is a degree of concurrency of reflection based field access.

The load-related variation really makes this issue quite crticial. When your servers are under load, the Terracotta client is artificially increasing the load even further. This can be seen in the shape of our load profile graphs (contained within the forum page mentioned at the top) - once the load reaches a certain point, the unpatched server’s load starts to increase at a faster rate, which in turn further compounds the load and increases the concurrency problem. This is, in effect, a feedback cycle, and could conceivably cripple a server if it was under higher load than ours are.

The patch I applied to test this is below. This was done against 2.6.4 source. This patch leaves the cache unbounded in size, but dumps it regularly based on an expiry time. This solution is not necessarily appropriate for everyone (the unbounded cache could be an issue, an LRU cache would be better). For us, we have set the expiry to be 3600s (ie. 1h), and set the initialSize to be 1400 (which was just above our usual cache size after 1h).

ManagerImpl.java.java, line 765:

// rootFields is a local cache, not clustered transient final long rootFieldsCacheAge = Long.parseLong(System.getProperty(“tc.rootFieldsCache.expiry”, “0”)); transient final int rootFieldsCacheInitialSize = Integer.parseInt(System.getProperty(“tc.rootFieldsCache.initialSize”, “0”)); transient ConcurrentHashMap rootFields = new ConcurrentHashMap(rootFieldsCacheInitialSize); transient long rootFieldsCacheCreated = System.currentTimeMillis(); transient long rootFieldsCacheExpiry = System.currentTimeMillis() + rootFieldsCacheAge * 1000;

public boolean isRoot(Field field) { Boolean isRoot = null;

 // check if root field caching is enabled (off by default)
 if (rootFieldsCacheAge > 0) {
   // manage the root fields cache
   final long now = System.currentTimeMillis();
   if (now > rootFieldsCacheExpiry) {
     rootFieldsCacheExpiry = now + rootFieldsCacheAge * 1000;
     rootFieldsCacheCreated = now;
     ConcurrentHashMap previousCache = rootFields;
     rootFields = new ConcurrentHashMap(rootFieldsCacheInitialSize);
     System.out.println("Purging rootFieldsCache as age limit of " + rootFieldsCacheAge + "s was reached, size reached " + previousCache.size());
   }
 
   // fetch first from the cache
   isRoot = (Boolean) rootFields.get(field);
 }
 
 if (isRoot == null) {
   isRoot = Boolean.valueOf(isRootImpl(field));
   if (rootFieldsCacheAge > 0) {
     rootFields.put(field, isRoot);
   }
 }
 
 return isRoot.booleanValue();    \}

private boolean isRootImpl(Field field) { // … original isRoot() implementation }

This is one possible solution. Looking at FieldUtils.java (the caller), it’s clear that it hasn’t been written considering that the ManagerUtil.isRoot() call could prove to be expensive, as the call is sometimes being made multiple times in sequence (eg. within isStaticAndNonRootField() and then afterwards explicitely). Perhaps a starting point is a slight rework of the FieldUtils logic, and perhaps the caching idea should be engaged there instead.

Comments

Steve Harris 2009-06-23

One thing worth noting is that if objects using reflection aren’t being clustered you can actually turn that stuff off completely. Their is a config element in tc-config to turn of reflection instrumentation