After my last post I spent a full day chasing my bug with very little progress. Around 7pm, just as I was winding down for the day I found a small clue. It didn't seem like much, but it was nice to end the day on any sort of positive note.
This morning, using the clue, I was able to find the problem. It didn't turn out to be a low level synchronization issue, it was a higher level logical error, although still related to concurrency. That explained the consistency in the error. I had missed one type of transaction conflict, and that meant under certain circumstances one transaction would overwrite another. The fix was easy (two lines of code) once I figured it out.
Even with the clue, it wasn't exactly easy to track down. I ended up digging through a 100,000 line log file. Luckily I wasn't just looking through it, I was searching for particular things. It was a matter of finding the particular 50 lines where the error happened. After that it was fairly obvious.
Since fixing the bug I've run millions of iterations of a variety of scenarios for as long as 30 minutes with no problems. This evening I'll let it run for a couple of hours. I'll also think up some additional testing scenarios - there are still a few things that I'm not exercising.
Cleaning up the code before sending it to version control I found an entire data structure (a hash map of transactions) that wasn't being used! I was carefully adding and removing from it, but I never actually used it. I must have at some point. So I removed it and everything worked the same. Humorous.
I don't want to be overly optimistic, I'm sure there are still bugs (there always are), but it's starting to feel like I'm over the worst of it.