Wednesday, May 07, 2008

More Concurrency

While I was waiting for the iMac transfer I was working on my project to multi-thread Suneido. The next piece I decided to tackle are the global hash tables for symbols and global names.

One obvious solution is to simply use many-readers/single-writer locking.

That still leaves the question of how to implement it. Do I mix the locking into the existing code? Or do I write a "wrapper" class that handles the locking and then delegates to the existing class. Neither is ideal - mixing in the locking will complicate the code but a wrapper will add overhead (although probably minimal).

But I'm not completely happy with adding locking overhead to readers. Especially for these tables. Once the server is "warmed up" these tables will almost never be written to. It would be really nice to avoid any overhead for readers.

My next thought was lock-free hash tables. I've heard various noises about this so I thought it was worth seeing what was available. As far as I could find, the answer was: not much.

The Intel Threading Building Blocks have some concurrent C++ data structures. But Boost doesn't. And STL doesn't (yet).

The most interesting thing I came across was Cliff Click's work. If you've got an hour to spare, check out the video of him giving a presentation about it. I see from his blog that he's extended his work to other data structures. Apart from the impressive benchmarks, what I really like about his approach is that it's simple and relatively understandable. (Although I still have a hard time wrapping my head around re-ordering of memory accesses and the required "fencing".)

Unfortunately, his work is all in Java. The only mention of C++ is someone commenting on the blog that they'd like a C++ version. Part of me is really tempted to implement it. It seems like it wouldn't be that hard. Famous last words.

Side note: I ran across some references to the advantages of power-of-two size + AND hash tables versus prime size + MOD (more traditional and what I'm using). One of the reasons is that MOD is slow (relative to AND). But I wonder how much that really affects overall speed.

Another thought I had was that I could just give each thread their own tables. The downside is the extra memory and redundant work maintaining the tables. This approach is closer to using separate processes instead of threads.

That led me to consider two copies - one for reading and one for writing. The reading one would be lock free. The write copy would be protected by a mutex. To write you would lock the write copy, update it, swap the two copies using an atomic compare-and-swap, and then update the other copy to keep in sync. This seems simple to implement, and simple enough to be reasonably sure it's "correct". It uses double the memory, but only for the hash table of pointers, not for the data itself. (Both copies can point to the same data.)

Strangely, I haven't seen any mention of this approach. Perhaps because it's not general purpose - it would not scale to many writers. But this shouldn't matter for my purposes. Or maybe there is some flaw that I'm missing? (concurrency issues can be like that)

The writing bottleneck could slow down startup. If so, (and obviously I'd want to measure that) then it would be fairly easy to write out symbol tables from a "warmed up" server and use these during startup to initialize.

As far as the atomic compare-and-swap, that seems readily available: in Win32 it's InterlockedCompareExchange. ACE has ACE_Atomic_Op, although it doesn't appear to have compare-and-swap. As of version 4.1 GCC has built-in atomic operations.

Given all the attention on concurrency these days, it's surprising that there aren't more tools available.

1 comment:

Larry Reid said...

I heard a podcast of a YouTube DBA who said that they basically used the one-writer, multiple-readers approach for the YouTube database. The relatively few updates go to one database instance, which all the reads get load-balanced over a bunch of others. Their challenge is to make sure the replication happens fast enough.

It's a different situation, but some validation that a one-writer approach works in the real world.