Deadlock and data corruption when transaction size too small

Type: Bug
Status: Closed
Priority: 1 Critical
Resolution: Fixed
Component/s:
Labels:

Assignee: hhuynh
Reporter: rbodkin
Created: May 03, 2007
Votes: 0
Watchers: 0
Updated:
April 25, 2008
Resolved:
July 20, 2007

Attachments

rbodkin (286.00 k) application/zip cdv253.zip
rbodkin (9.00 k) application/zip cdv253-bigtxn.zip

Description

I am testing a bulk load transaction into a hash map of hash maps. For efficiency I am inserting a moderate number of items (about 2000 changes) per operation. However, I find that if I use 2000 changes per transaction the transaction queue builds up to the maximum allowed (30) and then the L1 hangs, blocking on the queue. Moreover, it becomes impossible to connect to the L2 server from the admin console. If I kill the L1 and L2 processes and restart them, then I can’t even start my L1 application. If I erase the server’s persistent data then I can restart. If I lower this to 100 changes it makes more progress but eventually fails (presumably it hits the transaction size limit). Lowering this to 25 changes, it gets further still, but eventually fails (I notice that in the smaller cases I get more than 1 TxnsInBatch). It would be much better if TC would explicitly fail if a transaction exceeds the size limit (better yet if it could grow the size limit dynamically).

Comments

Steve Harris 2007-05-03

Saro, Can you look into this. I think their must be something more two this then what we know so far since we test with larger data loads than this every night :-).

Tim Eck 2007-05-03

for tracking purposes, the next steps here are (1) reproduce with trunk, and (2) add more logging to determine if a something is getting permanently stuck in pending state

Tim Eck 2007-05-03

added tc.property to debug txn flow in the server (use with care)

-Dcom.tc.l2.transactionmanager.logging.enabled=true

Ron Bodkin 2007-05-03

I see the same bug in nightly build 2842. I wil ltry the new -D option when I can download the build from tonight.

Ron Bodkin 2007-05-04

I tried nightly build 2864 with the new -D option when running the L2 server but I didn’t see any additional logging. Was it included in this build?

Hung Huynh 2007-05-04

Tim pushed the change to rev2872 so that build didn’t have it yet. I just pushed another nightly build, you can use that.

Tim Eck 2007-05-04

sorry, the change was revision 2872. Only if you’re interested, you can track our source and what not in fisheye (http://svn.terracotta.org/fisheye/browse/Terracotta) and/or the FishEye here on this item in JIRA

Ron Bodkin 2007-05-07

client and server logs. I ran the client with -Dcom.tc.l1.transactionmanager.logging.enabled=true and the server with -Dcom.tc.l2.transactionmanager.logging.enabled=true

Note that I killed the l2 server and left the client running for a while, so there are uninteresting messages at the end of that log.

Of note here is that the client maintains a flow of transactions to the server, yet the application is stalled. I believe that somehow the client is failing quietly and resubmitting doomed transactions over and over.

Ron Bodkin 2007-05-07

Here’s another log where the system fails quickly because I made the batch size much larger.

Saravanan Subbiah 2007-05-15

From the logs you attached, I can see that the transactions are getting struck in the server for some reason. Unfortunately that is all I can tell.

Can you please attach the program that caused this ? It can greatly help us in solving this issue.

Ron Bodkin 2007-05-16

Hi Saravanan, the program used for this test ties in with a lot of library code and can’t be shared. It might be possible to capture some logs for an extra diagnostic version of the application, if you could provide a way of running that?

Fiona OShea 2007-05-31

Keep watching this

Saravanan Subbiah 2007-06-19

Hi Ron,

We are thinking this could be related to Out Of Memory in the L2. Just to verify, can you please rerun the test with -verbosegc -XX:+PrintGCDetails for both L1 and L2 and resend the logs ? Also can you please set -Dcom.tc.l2.cachemanager.logging.enabled=true for L2 and -Dcom.tc.l1.cachemanager.logging.enabled=true for L1 when running the test ?

This will give us more output that will help us debug.

thanks, Saravanan

Saravanan Subbiah 2007-07-20

We were able to reproduce this issue with an inhouse test. The problem could happen when a single transaction could contain changes to more than 5000 objects.

The fix was pushed into 2.4 and trunk.

Fiona OShea 2008-02-22

Appears that this has been resolved months ago.