• Bug
  • Status: Open
  • 2 Major
  • Resolution:
  • ehcache
  • frs,gafeedback,reliability
  • cdennis
  • Reporter: mads1980
  • July 02, 2013
  • 0
  • Watchers: 7
  • May 08, 2014

Attachments

Description

We’re randomly getting recovery errors from the Fast Restart Store, particularly after killing the java process. It does not always happen, but when it does, it is particularly problematic, because not only are the cache contents lost, but the startup procedure halts with an unrecoverable exception, and our webapp fails to start. This requires cleaning up the DiskStore to be able to restart the webapp.

Of course, the point of the Fast Restart Store is to be able to survive crashes, and we find this is not always happening.

Besides trying to fix what is causing these failures, we’d like to suggest a feature: allow an option to discard the cache contents (logging a warning of course) when such a scenario happens. We’d much rather have the app start and lose the cache, than not having the webapp start at all. These needs to be configurable, since other users may prefer not to discard the cache contents, even if corrupted.

Comments

Manuel Dominguez Sarmiento 2013-07-02

This is an example error log:

Caused by: net.sf.ehcache.CacheException: com.terracottatech.frs.recovery.RecoveryException: Caught an error recovering from log at /home/cms/cache/persistence/cachedata at net.sf.ehcache.store.restartability.EhcacheRestartability.startStore(EhcacheRestartability.java:257) at net.sf.ehcache.store.restartability.EhcacheRestartability.bootstrapRestartability(EhcacheRestartability.java:223) at net.sf.ehcache.store.restartability.EhcacheRestartability.startup(EhcacheRestartability.java:96) at net.sf.ehcache.EnterpriseFeaturesManager.startup(EnterpriseFeaturesManager.java:100) at net.sf.ehcache.CacheManager.doInit(CacheManager.java:523) at net.sf.ehcache.CacheManager.init(CacheManager.java:392) at net.sf.ehcache.CacheManager.(CacheManager.java:352) at org.springframework.cache.ehcache.EhCacheManagerFactoryBean.afterPropertiesSet(EhCacheManagerFactoryBean.java:139) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeInitMethods(AbstractAutowireCapableBeanFactory.java:1547) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1485) ... 161 more Caused by: com.terracottatech.frs.recovery.RecoveryException: Caught an error recovering from log at /home/cms/cache/persistence/cachedata at com.terracottatech.frs.recovery.RecoveryManagerImpl$ReplayFilter.checkError(RecoveryManagerImpl.java:194) at com.terracottatech.frs.recovery.RecoveryManagerImpl.recover(RecoveryManagerImpl.java:83) at com.terracottatech.frs.RestartStoreImpl.startup(RestartStoreImpl.java:75) at net.sf.ehcache.store.restartability.ControlledTransactionRestartStore.startup(ControlledTransactionRestartStore.java:31) at net.sf.ehcache.store.restartability.EhcacheRestartability.startStore(EhcacheRestartability.java:255) ... 170 more Caused by: java.lang.ClassCastException: cannot assign instance of java.lang.Boolean to field com.renxo.cms.domain.network.ShortNumber.networkOperator of type com.renxo.cms.domain.network.NetworkOperator in instance of com.renxo.cms.domain.network.ShortNumber at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2063) at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1241) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) at java.util.HashMap.readObject(HashMap.java:1154) at sun.reflect.GeneratedMethodAccessor53.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) at com.renxo.cms.domain.EntityImpl.readObject(EntityImpl.java:325) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) at com.renxo.cms.domain.EntityImpl.readObject(EntityImpl.java:325) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) at java.util.concurrent.ConcurrentHashMap.readObject(ConcurrentHashMap.java:1557) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

Manuel Dominguez Sarmiento 2013-07-02

The root cause is this: java.lang.ClassCastException: cannot assign instance of java.lang.Boolean to field com.renxo.cms.domain.network.ShortNumber.networkOperator of type com.renxo.cms.domain.network.NetworkOperator in instance of com.renxo.cms.domain.network.ShortNumber

However, object serialization for our classes works fine (we heavily use serialization both for persistence and for remoting over the wire), and we’ve never had similar exceptions, except for these random Fast Restart Store recovery errors. We’re certainly not assigning Booleans to our custom domain classes.

Manuel Dominguez Sarmiento 2013-07-02

Does the Fast Restart Store do anything fancy, like custom serialization? I mean, to rule out other problems, like possible incorrect processing of corrupted redo-logs after having killed the java process.

Tim Eck 2013-07-02

Both the feature request (configurable drop disk store on failed recovery) and the serialization error are interesting :-)

Any possibility you can distill the serialization error into a reproducible test case? A test case we can debug will ensure it will be fixed (and will be the right fix for you as well)

Manuel Dominguez Sarmiento 2013-07-02

Unfortunately we haven’t found a way to consistently reproduce it, until it happens. Once it happens, it is reproducible on all webapp restart attempts. So I guess the proper way to reproduce it on your end would be for us to ship the corrupted DiskStore contents to you. However, the problem is that for you to be able to deserialize the contents, you would need access to our proprietary classes, which understandably we cannot do that easily.

What do you suggest?

Tim Wu 2013-07-08

There is a bit of custom serialization mainly to optimize out the object stream descriptors I believe. Chris would know more about it.

Manuel Dominguez Sarmiento 2013-07-08

Dealing with object stream descriptors may mess up standard serialization’s built-in class evolution if I’m not mistaken. That is something to look into.

Chris Dennis 2013-07-11

You are correct that without being careful you can cause problems here. Obviously I can’t go in to details regarding what the proprietary code is doing, but I’m pretty sure this is behaving correctly and not causing any issues. It is of course possible that there is a bug here though. With that in mind, has the definition of {{com.renxo.cms.domain.network.ShortNumber}} changed since the Cache was created? If it has, could you show us the stripped down versions of that class? (You can remove most of the methods, but you would need to leave the fields and any custom serialization code intact).

Manuel Dominguez Sarmiento 2013-07-11

Hi Chris, I’ve attached the class hierarchy for ShortNumber for you to inspect. However, please note that we’ve had this happen with many different entity classes in our system, not just ShortNumber (which happens to be in the log we submitted, but just as well could have been any other one).

Regarding custom serialization, we have customized writeObject() and readObject() methods to force the order in which certain properties are serialized (see EntityImpl.java), as well as readResolve(), which has a fairly complex instance management scheme in order to save memory of heavily used deep object graphs.

Now that you mention it, we’ve seen this error happen almost every time we add or remove a property to our entity classes. Although we are also randomly getting this error WITHOUT any changes to our classes. So it might be related, but it is not the only cause. However, all our class changes are designed with standard Java serialization evolution in mind. Adding or removing fields should not trigger serialization errors.

Chris Dennis 2013-07-18

I’ve added a whole load more testing to the serialization logic here and everything is holding up perfectly. I have not been able to reproduce the kinds of corruption you are seeing, nor have I been able to reason a way this could happen through code inspection.

To me this leaves two possible causes for this bug. Firstly - this could be something in the underying binary persistence layer corrupting the data. I think this is highly unlikely since if this were happening I would expect to see a large variety of different errors since the persistence code is ignorant of the stream structure and therefore would not target it’s corruption in such an explicit way. The only other possibility I see is that of a bug on your side of the fence - either in your code or in library code that you are using. I notice that you have references to Hibernate and Javassist in your code. This strongly implies that your classes are subject to load-time weaving by the Javassist library. I think this may be one possible source of the errors we are seeing.

I think the only realistic path forward here is for you to try and supply me with a reproducible test case. At first this could take the form of an existing set of data files, and the domain objects necessary to attempt recovery and reproduce the error. Ideally this would also include a way to reproduce the original corruption - although I understand that may be hard to replicate.

I’m also going to file a new EHC issue for adding new functionality for (subject to configuration) discarding unrecoverable entries so that we don’t mix these two issues up.

Manuel Dominguez Sarmiento 2013-07-23

Hi Chris, thanks for your feedback. Next time we hit this, we’ll save a snapshot of the persistent store, and try to produce a reproducible recovery failure scenario, with all relevant classes necessary for recovery.

Manuel Dominguez Sarmiento 2013-08-20

We’ve run into this issue again, this time in a development environment. I’ve attached the Fast Restart Store cache snapshot for you to analyze.

We cannot post the JAR with our business objects in this public forum for obvious reasons. Please indicate how you would like us to send this.

James House 2013-08-20

I will mail you an address to which to send the Jar. Thanks.

James House 2013-09-16

Please evaluate.

Manuel Dominguez Sarmiento 2014-03-30

Any news regarding this issue? We’re still seeing the same behaviour on 2.8.1 whenever we add fields to our Serializable classes. EhCache’s Fast Restart Store is doing something to the serialization stream that corrupts deserialization when new fields are added. EhCache should tranparently respect the Java Serialization specification, in particular with respect to compatible class changes.

Manuel Dominguez Sarmiento 2014-03-30

SerializablePortability.OOS.writeClassDescriptor() / OIS.readClassDescriptor() are probably messing something up. Is there any way to disable this customized class descriptor handling?

Chris Dennis 2014-04-02

A giant mea culpa on this. After you poked this issue to the top of my internal queue by commenting on it, I took a fresh look at the data logs you provided and discovered the cause of the bug. I managed to reproduce the failure case with an internal test, and have located and fixed the bug. Many thanks for being a good user and staying on this issue, and my apologies for being a bad developer and assuming that my test coverage was good enough in this area (and therefore not giving the bug it’s due attention). The fix is obviously in a separate proprietary module, so I’m going to leave this JIRA open until the fix bubbles up through the various release processes.

Manuel Dominguez Sarmiento 2014-04-03

Hi Chris, thanks for looking into this. This issue has been randomly bugging us since we began testing the Fast Restart Store, however even for us it was difficult to reproduce, at least without known failing diskstore logs.

When do you expect the fix to be available in a mainstream release? Any chance we might get access to an internal build in the meantime, so that we can cross-check the fix?

Lastly, was this bug related to the customized ObjectStreamClass handling, as we suspected?

Chris Dennis 2014-04-03

bq. When do you expect the fix to be available in a mainstream release? Any chance we might get access to an internal build in the meantime, so that we can cross-check the fix?

In order for other people to be able to answer this (this is outside of my sphere) we’ll need to know what product you are using (more specifically what license). Basically: are you a BigMemory Go customer or did you pay for a license?

bq. Lastly, was this bug related to the customized ObjectStreamClass handling, as we suspected? Yes it was in the custom ObjectStreamClass handling, but I can’t really say more than that on a public jira.

Manuel Dominguez Sarmiento 2014-04-03

Hi Chris, we’re testing on BigMemory Go.

Chris Dennis 2014-04-03

I’ve merged the fix for this to 2.7.x, 2.8.x and 2.9.x (trunk).

Manuel Dominguez Sarmiento 2014-04-17

Hi Chris, is the fix available on BigMemory 4.1.2 / EhCache 2.8.2 ?

Manuel Dominguez Sarmiento 2014-05-08

Hi Chris, any idea when will this fix be available?