Attachments

thagg75 (3.00 k) application/zip EHCache-bug.zip

Description

We have deployed an application in production (4 servers with JBoss 4.2.3/EHCache 2.4.1) and the application logs are filled with the exception mentioned in the subject in a random manner, both for get() and put() method calls against the configured caches.

Attached: sample of stacktrace for put() and configuration of ehcache (ehcache.xml)

Comments

Thanos Agelatos 2011-12-05

Please note that this is only happening for caches that are set with ‘overflowToDisk=true’. Other caches do not exhibit such behavior even though they participate in the RMI distributed cluster.

Fiona OShea 2011-12-05

Chris can you take a look and evaluate what it would take to fix? thanks

Thanos Agelatos 2011-12-09

Hello, any updates on this issue? Do you need anything more?

Chris Dennis 2011-12-09

What happening here is that when the cache attempts to read a serialized entry from the disk the data at the file-offset recorded in the cache is not a valid serialized form. There are three main reasons that I think could be causing this:

Two or more cache managers are accessing the same data files on disk.

An unclean shutdown has occurred (failure to call shutdown on the CM when finished with it). This can leave the index file (the serialized version of the keys and file pointers normally stored in heap) out of sync with the data file (the serialized values) - this can result in entries in the index file pointing to invalid locations in the data file. This obviously only happens for disk persistent caches.

I’ve listed these roughly in order of likelihood. As far as I can tell we haven’t seen any cases of bugs like this in internal testing, and we haven’t fixed any disk store corruption bugs like this since 2.4.1 was released. If you can rule out both the first two points as causes then the next step will be to try create a reproducible test case that we can run in house.

Thanos Agelatos 2011-12-09

Thanks for your reply. A small parenthesis here: in our internal testing, we noticed that this version of EHCache is (more) agressively caching stuff to disk (e.g. inmem size = 120, ondisk=120) whereas in the past it used to be that the memory cache was X and the disk cache size would be a fraction of that.

About the points one after the other:

We have 4 deployed web archives in each server, all use the same module to load the caches and get a cache manager from the same configuration. This was working in the past (since 1.x version, latest production was 2.00) Could there be a change in 2.41 to make this not work anymore? We get the cache manager via the constructor with URL.
I doubt that also. Every time we restart we clear the tmp/ folder, and caches are not persistent
no opinion there. We could try to replicate via test case but keep in mind that it only happened on our production system (not dev, not test, not UAT servers). It happens some time near 10pm CET and onwards.

Hope you can get some further insight.

Thanos Agelatos 2011-12-15

Hello, any updates on this issue?

Fiona OShea 2011-12-21

Mike checking with Field to take this one.

Fiona OShea 2012-02-17

sending back to DRB , not sure where this ended up.

Rajan Gupta 2012-03-12

Thanos, ANy update about the status of this issue from your side whether got resolved or still open?? I hope, Chris explanation has helped you to clarify this issue.

Attachments

Description

Comments

Thanos Agelatos 2011-12-05

Fiona OShea 2011-12-05

Thanos Agelatos 2011-12-09

Chris Dennis 2011-12-09

Two or more cache managers are accessing the same data files on disk.

A bug in Ehcache, most likely related to mistakenly handing out the chunk of disk to two different keys at the same time.

Thanos Agelatos 2011-12-09

Thanos Agelatos 2011-12-15

Fiona OShea 2011-12-21

Fiona OShea 2012-02-17

Rajan Gupta 2012-03-12