User test failure question

Type: Bug
Status: Closed
Priority: 2 Major
Resolution: As Designed
Component/s:
Labels:

Assignee: drb
Reporter: steve
Created: December 18, 2009
Votes: 0
Watchers: 2
Updated:
February 18, 2010
Resolved:
February 04, 2010

Attachments

steve (3.00 k) text/plain SearchResultsSupportTest.java

Description

As mentioned on Twitter we have this small test that was passing under 1.6.1, but fails under 1.7.1 and 1.7.2-SNAPSHOT ( which I just pulled from svn and built locally ).

Orginally we were using an LRU cache and noticed that after 100 items were put in the cache, keys we expected to be there no longer appeared, when switching to using a FIFO cache the problem went away, and this test cache tracked that.

Under 1.7.0/1.7.1 and 1.7.2-SNAPSHOT I see it fail consistently.

Cheers, Mark

– Pull me down under…

Comments

Chris Dennis 2010-01-28

In order to increase performance all of the standard memory stores have been moved to using a sampling eviction algorithm to enforce capacity limits. That is when a cache reaches capacity we select 15 elements at random from the cache and using the configured eviction policy we remove the least useful of these 15 elements. This means there is no requirement to either sweep the entire cache to discover the absolute least useful element, or alternatively keeping a linked list of all entries in the map, and therefore requiring synchronization across the entire map on both puts and gets. If you require absolute accuracy and use, or can tolerate an LRU eviction scheme then by setting the “net.sf.ehcache.use.classic.lru” system property you can enable the old fully locked LinkedHashMap based MemoryStore that will ensure complete accuracy in LRU eviction at the cost of concurrent access performance.

Mark Derricutt 2010-02-03

Chris,

The test in question goes through a process of adding entries to the cache, and looks back 10 items in a sliding window making sure the entry still exists, if you are as you say, taking 15 elements and taking the least useful of them to remove, why would the 10th item ever be removed? Assuming that under FIFO mode, the least useful would be the earliest of the 15 entries added into the cache?

Could this behavior (of the test failing) be due to adding each entry into the cache -really really really- fast and just tripping up a race condition of some kind?

gluck 2010-02-03

Mark

The observed behaviour is caused by the probabilistic evictor which is now used for all evictors.

You can turn on the Ehcache 1.5 MemoryStore, which uses a deterministic evictor, with System.setProperty(Cache.NET_SF_EHCACHE_USE_CLASSIC_LRU, “true”);, or simply by adding it as a System property to your Java command line.

I modified your test as follows and got it to pass every time.

public synchronized Cache findOrBuildEventCache(Class aClass, String id) {

    String cacheName = aClass.getName() + ":" + id;

    System.setProperty(Cache.NET_SF_EHCACHE_USE_CLASSIC_LRU, "true");
    

    CacheManager cacheManager = CacheManager.getInstance();
    if (!cacheManager.cacheExists(cacheName)) {
        final Cache cache = new Cache(cacheName, 100, MemoryStoreEvictionPolicy.LRU, false,
                null, false, 0, 1200, false, Cache.DEFAULT_EXPIRY_THREAD_INTERVAL_SECONDS, null, null);
        cacheManager.addCache(cache);
    }

    return cacheManager.getCache(cacheName);
}

There is another test that demonstrates the probabilistic eviction behaviour in LruMemoryStoreTest:

/**
 * Test the LRU policy
 */
@Test
public void testProbabilisticEvictionPolicy() throws Exception {
    createMemoryOnlyStore(MemoryStoreEvictionPolicy.LRU, 500);

    //Make sure that the store is empty to start with
    assertEquals(0, cache.getSize());

    // Populate the store till the max limit
    Element element = new Element("key1", "value1");
    for (int i = 0; i < 500; i++) {
        cache.put(new Element("" + i, "value1"));
    }
    Thread.sleep(3000);
    for (int i = 0; i < 500; i++) {
        cache.get("" + i);
    }

    Thread.sleep(3000);
    //evict some
    for (int i = 501; i < 750; i++) {
        cache.put(new Element("" + i, "value1"));
    }

    int lastPutCount = 0;
    for (int i = 501; i < 750; i++) {
        if (cache.get("" + i) != null) {
            lastPutCount++;
        }
    }

    assertTrue("Ineffective eviction algorithm. Less than 230 of the last 249 put Elements remain: " + lastPutCount, lastPutCount >= 245);
}

Now what is exacerbating the issue somewhat is that we used to store last access to the millisecond. Now to save space it is only to the second which makes it much harder to select a good eviction candidate in tests like this.

I am going to reopen this because I think we need to think a little more about whether to set the deterministic evictor to be the default.

Chris Dennis 2010-02-04

Hi Mark,

Apologies for not giving your test my full attention. You’re correct that my explanation did not at all explain why your test is failing. I’ve had another look at it this morning, and added some debugging code to the MemoryStore to show us what is happening in the eviction sampling. Here is the logging from the eviction for the cycle on which the test failed:

Run 108 Eviction Sample: 100, created=1265296325000 64, created=1265296325000 95, created=1265296325000 19, created=1265296325000 14, created=1265296325000 26, created=1265296325000 43, created=1265296325000 0, created=1265296325000 88, created=1265296325000 7, created=1265296325000 105, created=1265296325000 87, created=1265296325000 80, created=1265296325000 36, created=1265296325000 52, created=1265296325000 Selected: 100, created=1265296325000

The eviction sample is selected at random, and contains many entries older than the one finally selected. Looking at the creation times we can see that this is because they all have the same creation time. This is because the Elements are only storing timestamps accurate to within a second. If you add a sleep time of say a couple of 100 milliseconds to the end of the loop in your test the timestamps become disparate enough for the test to pass.

This is obviously of some concern. I’m going to close this JIRA and reopen one linked to it, reporting that the storing of timestamps accurate to only a second causes problems for the statistical eviction when a large fraction of the cache population is (to within a second) of the same age.

Chris Dennis 2010-02-04

I’m resolving this as “As Designed” although I think the design is probably wrong. I’ve created a more suitably named issue describing this problem - and linked it to this JIRA.