Tuesday, September 22, 2015

A Bug Hunter's Life

I recently spent two weeks tracking down a single bug. Needless to say, it was a frustrating process. Like most such things, what made it difficult was not a single problem, it was a whole cascade of errors, including many during the debugging process. I like to think I'm fairly organized and methodical, especially when working on a tough bug. I take notes on what I'm doing, what my assumptions are, what I've tried, what the results were. And yet I still managed to make multiple major mistakes and go a long way down the wrong paths based on false assumptions I should have questioned much sooner.

It started with a customer getting a corrupted database. Most of the time this is from hardware issues or crashing. But this one appeared to be caused by a bug in the code.

It wasn't easy to look into because all I had was the corrupt database, with no clues as to how it might have gotten that way. Thankfully, jSuneido's append-only database means that the database itself forms a kind of log. Based on that I could see that two transactions had committed conflicting updates - something that the whole transaction system is designed to prevent.

It seemed a pretty safe bet that this was a concurrency related issue - my least favorite kind of bug :-( I wasn't surprised that I couldn't make it happen. With roughly 1000 systems running 24x7 this was the only case I was aware of. So my sole sources of information were that database and the code itself.

In order to detect conflicts at commit time (jSuneido uses optimistic concurrency control) you need to know the outstanding overlapping update transactions (i.e. ones that had started after this one and had not completed yet). Perhaps the bug was in determining this set of transactions? I wrote a completely new implementation of this code and used it to cross-check.

We deployed this version to our customers and pretty quickly started to get lots of errors. But only a couple were from the new cross-check. The rest were from code I hadn't touched. And yet they quite clearly started when the new version was deployed. Strange.

The errors indicated that sometimes when it went to complete a transaction it would be missing from the list of outstanding ones. Again, this shouldn't be possible. That would explain why the overlapping set would be wrong, so it made a certain amount of sense. I spent days studying the code and trying (unsuccessfully) to figure out how it was happening.

This turned out to be completely off track - it was a just problem with my new cross checking code.

The stack traces showed a transaction commit that turned into an abort. That happens if an update transaction doesn't end up doing any updates. So I wrote a bunch of tests with those kinds of transactions in the mix, but they all worked fine. If nothing else, the code was getting reviewed and tested pretty heavily.

Finally, after wasting days on this, I realized there was another way for a commit to turn into an abort. The relevant part of the code looked like:

try {
     // commit stuff
} finally {
     abortIfNotComplete();
}

What was happening was that an exception during the commit stuff would get overridden (and lost) if there was an error in abortIfNotComplete(). So I wasn't seeing the original error at all. And the reason that the transaction was missing from the list was that it had got far enough into the commit to remove it.

I'm not sure why I wrote it using finally. In hindsight it was obviously a bad idea. I rewrote it like:

try {
     // commit stuff
} catch (Throwable e) {
     try {
          abortIfNotComplete();
     } catch (Throwable e2) {
          e.addSuppressed(e2);
          throw e;
     }

addSuppressed was added in Java 7. It is used, for example, in try-with-resources for exceptions when closing after another exception.

Part of the reason I didn't realize which path was being taken in the code was that I didn't have line numbers in the stack traces.  That was because in the Ant build task "debuglevel=vars" had been added. The intent had been to increase the amount of debugging information. But because of the way Ant and the Java -g option work, this had actually meant only variable information. I changed it to "debug=true" which becomes "-g" which means all debug information (file, line, and vars). Once this change percolated through and I started to get stack traces with line numbers, then I could see which path was being taken in the code.

One big step forward was when I was finally able to recreate the bug. Running random updates (of certain kinds) in a bunch of threads would encounter the bug within a few seconds. It wasn't the ideal recreation because the bug would occur buried within a lot of other activity, so it still wasn't possible to determine exactly what was happening.

I couldn't be 100% sure that what I was able to recreate was the same bug, but the end result was more or less identical - two transactions deleting the same record. But confusingly, the errors that I saw were about duplicate records being output, leading me to spend a bunch of time looking at the output code.

A huge mistake that I made repeatedly, without realizing it, was that I was building one variation of the jSuneido jar, and then testing with a different out of date one. So I wasn't actually testing with the changes I was making. This was because, for faster turnaround, I wasn't running a full build, I was just building the main jar that we deploy. But I was testing using the jar with Win32 support so I could use the IDE. This explained the totally baffling behavior where the debugging output that I was 100% certain should be triggered didn't appear at all. Which confused me and made me think I was totally wrong in my ideas about what was happening.

This mistake cost me about 4 days. I actually had the bug fixed, but it kept happening because I was testing with the old jar. Of course, since the bug still happened I assumed that I had not located it yet and I went looking for it in other places where, of course, I didn't find it.

Even worse, I thought my method of recreating the bug was quite "fragile". For instance, it would not happen running in the Eclipse debugger. Coming from C and C++, I didn't find that too surprising. And although Java should be more predicable, there would still be timing and threading differences which could affect concurrency. But I was totally wrong - the reason I couldn't recreate it within Eclipse was that I was running the latest code, which had the fix. Arghhh!

The actual bug and fix are obvious in hindsight. Throughout the long process I suspected that would be the likely result. But that didn't make it any easier, in fact it made it even more frustrating.

jSuneido's database uses optimistic concurrency. That means concurrent transactions proceed independently without any locking. Then when it comes time to commit, they verify that nothing has happened that would conflict. For example, they might find that another transaction has output a duplicate key. In which case the transaction will abort due to the conflict. Or it might find that another transaction has deleted a key that we also deleted. And that's where the bug was. I had made the assumption (back when I wrote this code) that it was ok if two concurrent transactions deleted the same key. After all, the end result is the same. And it's still serializable (i.e. the result is as if the transactions had happened one after another instead of concurrently).

The problem is not the deletes by themselves. If that was all the transactions were doing, my logic was correct. But transactions aren't that simple. They perform other actions based on whether the delete succeeded. For example, deleting an old version of a record and outputting a new version.  If you allow two transactions to think they deleted the old version, then they will, incorrectly, both output a new version. (leading to the duplicate outputs I was seeing)

All I had to do to fix the bug was to make duplicate deletes a conflict, like I already did with duplicate outputs. Two weeks for one line of code. How's that for productivity.

One interesting part of this whole adventure was that although it was a concurrency bug, it wasn't the kind of subtle interaction problem that I really fear. It was, in hindsight, a fairly clear, obvious issue.