The good news is that I've fixed a number of bugs and came up with a reasonable (I think) solution for my design flaw. The solution involved the classic addition of indirection. Of course, it's not the indirection that is the trick, it's how you use it.
The bad news is that after I'd done all this, I was still getting the original error! It only occurs about once every 200,000 transactions (with 2 threads). (Thank goodness for fast computers - 200,000 transaction only takes about 5 seconds.) Frustratingly, it doesn't happen in the debugger. With this kind of problem it's not much use adding print statements because you get way too much irrelevant output. A technique I've been finding useful is to have each transaction keep a log of what it's doing. Then when I get the error I can print the log from the offending transaction. It's not perfect because with concurrency problems you really need to see what the other thread was doing, but it's better than nothing.
It was also annoying because it was the end of the day so I had to leave it with a known error :-(
Thinking about it, I realized I had rushed coding some of the changes, hadn't really reviewed them, and hadn't written any tests. Not good. When I went back to it this morning, sure enough I had made mistakes in my rush job. Obviously, that self imposed pressure to get things resolved by the end of the day is not always a good thing.
So now I'll go back and review the code and write some tests before I worry about whether I've fixed the original problem.