Tuesday, March 02, 2010

A Faster, Cleaner Database Design for Suneido

Current design:
  • entire database is kept in a single file
  • the database file is accessed via memory mapping
  • data records are immutable, updates add new versions
  • indexes are mutable
  • data records are used like a "log" for crash recovery
  • indexes are not recovered, they are rebuilt
For concurrency, it would be really nice if the database was a persistent immutable data structure. But this is difficult. For example, you don't want to write a new copy of a btree index node to disk every time you update it. (At least, I'm assuming that the resulting database growth would be impractical.)

In theory Suneido happily accesses databases bigger than physical memory. In practice, performance drops off rapidly when this happens due to swapping (page faults). You would think there would be a "working set" of data that would be a small portion of the database. But users tend to run things like historical reports or searches that access large portions of the data.

The other issue is that memory mapping the database makes reads fast (once the file is mapped into memory). But writes still have to go to disk (albeit somewhat asynchronously). And the index update writes are essentially random which is the worst case for disk performance. (Maybe SSD drives would help with this?)

Which led me to think, why not just keep the database in memory. But...
  • Despite the performance issues, I'm not sure it's reasonable to limit database size to physical memory.
  • Loading the database into memory on startup and saving it to disk on shutdown would be slow.
  • You still need to write something to disk for crash recovery.
Next iteration, why not memory map the database, write data records to disk (so they can still be used as a "log" for recovery), but keep index changes in memory and only write them out when you close the database.

This has some nice features:
  • Similar start up speed (although slower shutdown)
  • Less disk writes (only data, not indexes) - faster
  • Disk writes are sequential  (index updates were random) - much faster
  • Allows databases bigger than memory
Suneido's crash recovery rebuilds the indexes anyway, so it's not a problem that they're kept in memory. Theoretically, if indexes are now immutable you could take the last good state and roll forward from there for faster recovery.

You could also periodically update the on-disk indexes, presumably during idle time on a background thread.

This seems pretty good, but there's more. Now that index updates are not being written to disk immediately, it becomes feasible to make the indexes persistent immutable data structures.

jSuneido already sort of does this, but in a complicated round-about way. Each transaction maintains an "overlay" of updates to indexes. When the transaction commits, these are merged into the real indexes on disk and other concurrent transactions are given a copy of the old index nodes so they don't see the changes. It works, but it's ugly and error prone.

With this new scheme, each transaction has, in effect, it's own copy of the indexes. Since it's a persistent immutable data structure, updates by transactions have no effect on each other. When a transaction commits, you merge it's indexes with the master. I think this would make the code quite a bit simpler.

It's hard to say how much faster this would be. Updates would be considerably faster, but reading would be much the same. So the overall speedup would depend on the mix of updates and reads. However, in a multi-user scenario (the current target for jSuneido) if you can reduce the load, that benefits everyone, including reads.

I'm not going to drop everything and implement this, but it's an interesting idea. I'll have to mull it over some more - there may be flaws that I haven't thought of yet. It might be something to work on once jSuneido is "finished" and deployed.

As with any "bright idea" like this, I can't help wonder if I'm reinventing the wheel. I wouldn't be surprised if someone was already using this approach. Of course, that doesn't mean that there's any way to reuse their code, or even learn from them. The internet might give us access to the world's knowledge, but it's still hard to find this kind of stuff.

Related posts:
Language Connoisseur
A Java Immutable Persistent Map
A Java Immutable Persistent List
jSuneido Progress

No comments: