Wednesday, July 22, 2020

Immutable B-tree Redirects

Regular update-in-place B-trees suffer from write amplification. To update a single key you usually end up (re)writing a whole node. In addition, SSD storage also suffers from write amplification because it can't update in place, it has to write new copies to an erased block.

Immutable B-trees are worse. Not only do you have to write a whole node, you have to allocate new space for that node. Plus you need to update the tree path leading to that node, which means allocating and writing several more nodes. (This "path copying" is common in immutable persistent data structures.) This can make index file space grow excessively, and require more frequent compaction or garbage collection. (On the positive side the writing is sequential which is more efficient.)

I ran into this problem when I wrote the jSuneido immutable B-tree implementation. To reduce it to a manageable level, I ended up persisting indexes to disk periodically (once per minute) rather than after each transaction.This reduces both the write and space amplification.

There are various other approaches to reducing the amplification. For example, instead of writing out a copy of an entire node, you can write only the delta (differences). The downside is that to get the current version of the node, you need to read possibly multiple deltas plus the original node.

With the new version of the database that I'm working on for gSuneido, I took the same approach of periodically writing index updates. I knew from jSuneido that this approach was workable but it still bugged me. After all, the whole point of a new design was to make improvements.

To reduce memory allocation (and subsequent garbage collection) I use a hash map of "redirects" and defer path copying till writing to disk. An obvious approach is to write the redirects to disk instead of doing the path copying. But it seemed like that had the same problem as writing node deltas (redirects are really just a kind of delta) - they improve writing at the cost of slowing down reading.

My aha moment was when I realized that I already have those redirects in memory in a fast hash table for use during transactions. They would only need to be read from disk when the database starts up. The only additional cost would be larger redirect hash tables, but this has minimal affect on speed.

Of course, at some point too many redirects will accumulate and it makes sense to "flatten" them, i.e. rewrite the nodes and eliminate the redirects.

Thinking through the ramifications of this design (long runs are good for that), I realized that one of the tricky parts was going to be "flattening" the redirects. If all you store are the original and updated offsets for the redirect, then you'd need to search the whole B-tree to find the spot to update. My first thought was that I just needed to also store the node the redirection applied to. But that's not sufficient because you have to update its parents all the way up the tree. Obviously, to path copy, you need the path. But to store the offsets of all the nodes on the path from the root to the leaf would double or triple the size of the redirections. This would somewhat defeat the purpose given that we're storing the redirections to save space.
Instead of storing the path to each redirection, I ended up storing a separate set of tree nodes on the path to redirects. This is smaller because paths share a lot of tree nodes e.g. the root node is on every path. When traversing the tree to find nodes to flatten, only tree nodes in the set need to be followed.

I have this implemented and it seems to work well. I don't have a complete enough system to benchmark yet. It'll be interesting to see what overall impact it has.


No comments: