• Bug
  • Status: Closed
  • 2 Major
  • Resolution: Fixed
  • ehcache-core
  • alexsnaps
  • Reporter: archie2
  • April 28, 2011
  • 0
  • Watchers: 4
  • July 27, 2012
  • May 19, 2011

Description

Referring to this code in {{Timestamper.java}}: {noformat} public static long next() { int runs = 0; while (true) { if (runs++ > 0 && LOG.isInfoEnabled()) { LOG.info(“Thread spin-waits on time to pass. Looped “ + runs + “ times, you might want to increase -Dnet.sf.ehcache.util.Timestamper.shift”); } long base = System.currentTimeMillis() « BIN_DIGITS; long maxValue = base + ONE_MS - 1;

        for (long current = VALUE.get(), update = Math.max(base, current + 1); update < maxValue;
             current = VALUE.get(), update = Math.max(base, current + 1)) {
            if (VALUE.compareAndSet(current, update)) {
                return update;
            }
        }
    }
} \{noformat\} I have no idea how exactly it happened, but we saw this error message: \{noformat\} Thread spin-waits on time to pass. Looped 1563731743 times,   you might want to increase -Dnet.sf.ehcache.util.Timestamper.shift \{noformat\}

*If this code can ever possibly loop a billion times, it’s by definition broken.*

Replace the too-fancy CAS hack with a simple {{synchronized}} statement.

My guess is that the clock was set backwards (behavior that any user of {{System.currentTimeMillis()}} must be prepared to handle) and so the inner loop only looped one time.

But it doesn’t matter. This algorithm is broken… nothing provably guarantees that a thread will ever exit this method. Just because it seems like that would never happen you can’t assume it won’t. Obviously it did once, at least for a billion loops (same as infinity as far as I’m concerned).

Comments

Archie Cobbs 2011-04-28

I just realized (duh) that when time goes backwards, the {{compareAndSet()}} is never even attempted.

So this code is even more broken than I thought.

Basically, if time goes backwards, every thread that enters this method goes into an infinite loop until the clock catches back up to where it was.

Chris Dennis 2011-04-28

As observed in EHC-777 we are reluctant to revert to a synchronized solution since the original move to a CASing approach was due to this code being a bottleneck in performance testing. A revert here would therefore result in a performance degradation. However I think there are three things that we can fix here:

We should ensure that spin-wait logging is only done on exit from the method when the final spin count is known - this will prevent us from spamming peoples logs with this message.

We should (and can) detect the backwards clock movement and log it’s occurrence. In a well configured NTP client backwards shifts are usually avoided at all costs (skewing is preferred).

We should allow users to configure the subsequent behavior after the clock drifts backwards, three immediately obvious choices are:

Sleep until the system clock has caught up (this is the ‘most correct’ behavior since it prevents the timestamp values from moving backwards or being duplicated).

Immediately cut to the new clock value.

Repeat the current timestamp value on every request until the system clock has caught up.

The last two behaviors could both cause stale data visibility/corruption effects in Hibernate. I’m sure there are many more potential behaviors that I’ve not listed, and I’m also not sure which one of these is the best default behavior.

Archie Cobbs 2011-04-29

*Comment #1*

I definitely agree with your #1 (hard to argue with reducing spam).

*Comment #2*

I don’t know how this class is used so this next comment may be mis-informed.

Anyway, it seems to me that the underlying problem here is that we are attempting to create a single {{long}} value which is both (a) a timestamp and (b) unique.

Things would get much simpler if the goals of (a) and (b) were split, i.e., into two separate fields.

Then timestamping is easy, just invoke {{System.currentTimeMillis()}}, and uniqueness is easy, just invoke {{AtomicLong.incrementAndGet()}}.

*Comment #3*

If there is a requirement that the timestamps in (a) never go backwards, then this is a separate sub-problem, and it should be solved separately. See for example {{[SafeClock http://download.oracle.com/docs/cd/E15357_01/coh.360/e15725/com/tangosol/util/SafeClock.html]}} (never used this but found it via google). Either a proper solution like that should be included, or Ehcache should clearly document “this code doesn’t work if your clock goes backwards”.
But this also brings up the question: is the timestamp used to record _absolute time_, or simply used to compute _time intervals_? I’m guessing the latter. If so, then the solution is to use {{[System.nanoTime() http://download.oracle.com/javase/6/docs/api/java/lang/System.html#nanoTime%28%29]}} instead of {{System.currentTimeMillis()}} and the backwards clock problem disappears.

Alexander Snaps 2011-05-19

Timestamper now:

  • handles time going back, slewing threads to catch up on previously highest value
  • logs only every second at most if we spin

Performance is now slightly lower, but we’re still around ~3x faster than the original implementation. (Used to be ~3.7x)