Tuesday, September 06, 2011

Concurrency Continued

I figured I'd better write a concurrency stress test for my DbHashTrie semi-immutable data structure, to increase my confidence that there were no problems. (Of course, a test can never prove the absence of problems.)

I write my test and it passes. But as a good cynical/skeptical tester, I tweak it so it should fail. It still passes?! I realize that I'm running in multiple threads and if an assert fails it'll just throw an exception and that thread will silently die, swallowing the exception. Oops. I add try-catch around the thread body.

Now it fails, as expected. I remove the tweak, expecting it to now succeed. It still fails! I remove the multi-threading. It still fails. Huh!? This code is in live use, if there's a bug, that's definitely not good.

After thrashing for a (thankfully short) time, I realize it was a bug in my test, not the actual code. To make it worse, I've been caught by the same thing multiple times. Some day I'll learn! When building a map-like data structure with random keys, you have to remember that you could get duplicate keys, which in most maps will overwrite the previous value. If your test checks for the original value, it'll fail since it'll get the new value.

Finally the test seems to be working properly. Except that it's not maxing out my cores. Since it's entirely in memory, with no IO, I can't see why it wouldn't. I play around with various parameters like the number of threads, size of data, and number of iterations but it still doesn't max out my cpu usage. (I use iStat Menus to show my cpu usage all the time on the menu bar so I can easily monitor what's going on.)

The only thing I can think of is the locking. I comment out the two places with synchronized and re-run the tests. Voila, all eight cores at 100%. Not normally what you want to see, but in this case it was.

Hmmm... if it wasn't maxing out the cpu, it must have been running slower. I add some timing. Yikes, the synchronized version is over 10 times slower! (45 seconds with synchronized, 3.8 seconds without.)

Of course, this is close to a worst case scenario since the test doesn't do much more than call the one synchronized method. And although the locking is per node, every access has to start at the root node. Java locks are very fast these days, but only when uncontested. And in this case the nodes at the root of the tree will be heavily contested.

If it was a single variable I could just make it volatile, but it's an array, and putting volatile on an array  only makes the reference volatile, not the contents. That's where AtomicReferenceArray comes in. It had crossed my mind a few times, but the interface is quite different from an array so it would be a bit of a hassle to switch. I'd also have to write my own versions of Arrays.copyOf and System.arraycopy

I changed the code to use AtomicReferenceArray. It was a bit slower than the raw array, but not a lot (5 seconds, versus 3.8).

I also did some checking, and the JVM spec does guarantee that updates are atomic (except for long's and double's). So I think I'm safe not doing anything special for concurrent access. That means threads could read stale values. Which means there could potentially be a data race where two threads could end up loading the same node at the same time. But as far as I can figure, this should be benign since they will both get the same node (it's never modified in the memory-mapped file) and therefore it doesn't matter which one wins. I think it's equivalent to how String calculates its hashCode lazily.

The question is, do I trust this reasoning? Or do I go with the safer, but slower and uglier AtomicReferenceArray?

A little searching on the web turns up: Date-Race-Ful Lazy Initialization for Performance. Ironically, this both confirms my reasoning, and also identifies a bug in my code. I was accessing the shared variable more than once in my code, which can lead to problems due to memory access re-ordering. Easily fixed, but yet another example that it's hard to reason correctly about this stuff. (Just to hammer the point home, another article I found, Implementing hashCode, has the same bug with reading the variable multiple times without using volatile.)

The blog post also references a section in Effective Java (2nd ed) by Joshua Bloch so I dig my copy out and review it. It confirms that the single-check idiom is valid, but also recommends using double checked locking with a volatile variable instead. Even for the single check idiom, it recommends using a volatile variable, although it's not strictly necessary.

After all that, I think I'm going to go with the simplest code with the raw array, non-volatile, and unsynchronized. That's not because I trust my reasoning, but because I trust the explanation and examples of people who are much smarter than me.


Monday, September 05, 2011

Concurrency Strikes Again

I thought it would be a relatively quick job to multiple-thread building the indexes when loading a database dump. Three days of thrashing later I more or less had it working.

But it was only about 10% faster so I ripped out my three days worth of work - not a big enough improvement to justify the extra complexity. I'm a bit puzzled by the limited improvement, but the speed is probably dominated by other factors. (e.g. reading and writing the files.)

The story doesn't end there though. Along the way I was seeing some very puzzling debugging output. So after abandoning my initial goal I started to dig into what was going on.

The bad news is that I found some serious concurrency flaws in my transaction handling. The good news is that I fixed them. Obviously, I need more/better tests.

Immutable data structures are great for concurrency, but they can have significant overhead compared to regular mutable ones. So, to try to get the best of both worlds, I have some semi-immutable data structures. The idea is that they are immutable when shared between threads, but mutable when confined to one thread. As always, the devil is in the details.

I found that I had inadvertently ended up sharing when mutable, leading to some ugly bugs. The data structure becomes immutable after you persist it to disk. But I had updated the master copy before saving instead of after. A small mistake with big consequences. To make sure I don't make a similar problem in the future I added asserts in relevant places to ensure that it was immutable when it should be.

Another potential issue was that even when immutable, it's only "effectively" immutable since the data structure is loaded from disk lazily. I had synchronized the updates but not the reads. That might actually be ok in this case, as long as writing a reference (pointer) is atomic. But I decided it was better to synchronize in a few more places than depend on something I wasn't sure about.

Friday, September 02, 2011

Append-Only Database Performance

I've been working on the append-only storage engine for jSuneido and most recently dumping and loading. I was flipping back and forth between storage engines and it seemed like the new one was faster. I decided to time it and sure enough, with the new append-only storage engine, loading a database was close to twice as fast - nice!

Note: This was not any kind of scientific benchmark. It was a relatively small database (130mb) and only loading. But still, a promising early result.

This is also still single-threaded. I've been itching to try using multiple threads to build indexes. Each index is relatively independent so it should be feasible, but I'm currently using an "exclusive" transaction so it's a little tricky. This should speed up loading even more.