Monday, December 20, 2010

Append Only Databases

The more I think about the redirection idea from RethinkDB the more I like it.

The usual way to implement a persistent immutable tree data structure is "copy on write". i.e. when you need to update a node, you copy it and update the copy. The trick is that other nodes that point to this one then also need to be updated (to point to the new node) so they have to be copied as well, and this continues up to the root of the tree. The new root becomes the way to access the new version of the tree. Old roots provide access to old versions of the tree. In memory, this works well since copying nodes is fast and old nodes that are no longer referenced will be garbage collected. But when you're writing to disk, instead of just updating one node, now you have to write every node up the tree to the root. Even if the tree is shallow, as btrees usually are, it's still a lot of extra writing. And on top of the speed issues, it also means your database grows much faster.

The redirection idea replaces creating new copies of all the parent nodes with just adding a "redirection" specifying that accesses to the old version of the leaf node should be redirected to the new leaf node. A new version of the tree now consists of a set of redirections, rather than a new root. And you only need a single redirection for the updated leaf node, regardless of the depth of the tree. There is added overhead checking for redirection as you access the tree, but this is minimal, assuming the redirections are in memory (they're small).

One catch is that redirections will accumulate over time. Although, if you update the same node multiple times (which is fairly common) you will just be "replacing" the same redirection. (Both will exist on disk, but in memory you only need the most recent.)

At first I didn't see the big advantage of redirection. I could get similar performance improvements by lazily writing index nodes.

But the weakness of delayed writes is that if you crash you can't count on the indexes being intact. Any delayed writes that hadn't happened yet would be lost.

The current Suneido database has a similar weakness. Although it doesn't delay writes, it updates index nodes, and if the computer or OS crashes, you don't know if those writes succeeded.

The current solution is that crash recovery simply rebuilds indexes. This is simple and works well for small databases. But for big databases, it can take a significant amount of time, during which the customer's system is down. Crashes are supposed to be rare, but it's amazing how often it happens. (bad hardware, bad power, human factors)

Of course, you don't need the redirection trick to make an append only index. But without it you do a lot more writing to disk and performance suffers.

Even with an append only database, you still don't know for sure that all your data got physically written to disk, especially with memory mapped access, and writes being re-ordered. But the advantage is that you only have to worry about the end of the file, and you can easily use checksums to verify what was written.

On top of the crash recovery benefits, there are a number of other interesting benefits from an append only database. Backups become trivial, even while the database is running - you just copy the file, ignoring any partial writes at the end. Replication is similarly easy - you just copy the new parts as they're added to the database.

Concurrency also benefits, as it usually does with immutable data structures. Read transactions do not require any locking, and so should scale well. Commits need to be appended to the end of the database one at a time, obviously, but writing to a memory mapped file is fast.

Another nice side benefit is that the redirections can also be viewed as a list of the changed nodes. That makes comparing and merging changes a lot easier than tree compares and merges. (When a transaction commits, it needs to merge its changes, which it has been making on top of the state of the database when it started, with the current state of the database, which may have been modified by commits that happened since it started.)

Ironically, I was already using a similar sort of redirection internally to handle updated nodes that hadn't been committed (written) yet. But I never made the connection to tree updating.

I love having a tricky design problem to chew on. Other people might like to do puzzles or play computer games, I like to work on software design problems. I like to get absorbed enough that when I half wake up in the night and stagger to the bathroom, my brain picks up where it left off and I start puzzling over some thorny issue, even though I'm half asleep.


Anonymous said...

I try to find such dbms (append-only) for my logging project (save simple struct's from hosts and then query then and make chart) but can't find simple and durable solution. Where can I look at your Suneido?
Thank you!

andrew said...

The source code is in Mercurial on SourceForge

The append-only database code is not complete yet, I'm still working on it.

Also, the database code is mixed with the rest of Suneido so it would take some work to use it separately.

Daniel Waterworth said...

You may be interested in aodbm ( ).