Thursday, April 14, 2011

B-tree Arcana

I recently ran across a new article on Stratified B-trees and versioning dictionaries. I have to admit I didn't really follow their data structure or algorithm. I looked at some of the references, for example to Cache-Oblivious Streaming B-trees but the ideas seem too complicated to be directly useful to me.

However, there were some bits that were interesting to me. They talked about the space blowups and inefficient caching in path copying CoW (copy-on-write) B-trees.

each update may cause an entire new path to be written – to update a 16-byte key/value pair in a tree of depth 3 with 256K block size, one must first do 3x256K random reads and then write 768K of data (see which reported 2GB of log data per 50MB of index data)

I'm hoping to avoid this using redirection (based on ideas from RethinkDB). By redirecting instead of path copying, you only need to copy the node you're updating, not all the nodes on the path to the root. (And the nodes can be orders of magnitude smaller - see below.)

Many systems use indirect access through a mapping table all the time. Instead, I use direct access whenever possible, and only indirect to updated data. That feels like it should be better, but without a lot of analysis it's hard to be sure. It does reduce space overhead somewhat.

One of the big issues with CoW B-trees is garbage collecting the old nodes. I have to admit I punt on this, requiring occasional database compaction instead. (This also removes all the redirection.) Not so good if you want to run 24x7 but it's a tricky problem and for our customers occasional maintenance downtime isn't a big deal.

Another interesting point was on SSD's (solid state drives).

SSDs handle small random reads very effectively, but perform poorly at random writes (the recent Intel X25M can perform 35,000 4KB random reads/s, but an order of magnitude fewer writes). However, they can perform very efficiently for large sequential writes.
That fits really well with CoW append-only B-trees since they do exactly that - random reads but sequential writes. And I recently got a new computer at work with an SSD :-) so I'll be able to do some testing.

Another point that caught my eye was that their arrays doubled in size between levels. Not understanding their system, I'm not completely sure what this means. But it rang a bell because I've been considering doing something similar with my B-tree nodes. Often, B-trees use large node sizes to reduce the number of random disk IO's. But large nodes don't work well with copy-on-write. And random IO's aren't as bad with SSD or just large disk caches.

I still need to do some testing to determine the best size, but I think it will be quite small. But I wonder if it would make sense to make nodes larger as you go "up" the tree towards the root. The leaf nodes are updated most often so it makes sense to make them small. But as you go up the tree the nodes are read more and updated less. Increasing (e.g. doubling) the size of the nodes at each level would decrease the number of tree levels and reduce the number of random IO's. And since larger nodes would split less often, higher tree levels would get updated even less frequently.

If I was in the academic world I'd need to figure out the theoretical performance of my approach under DAM (direct access machine) or CO (cache oblivious) models. Instead, I'll do some quick and dirty benchmarks and if there's not a big performance difference go with the simplest code.


Riyad said...

Andrew, what did you find with your research? I have a feeling your redirection table idea for updated nodes, at scale, essentially just becomes a massive, hard to read "node" that takes multiple block reads to pull in... Making it the same or worse than std CoW. Just a thought, I would like to know how your big root small leaf idea benchmarked.

Andrew McKinlay said...

Performance seems good, but I'm still having problems with excessive database growth (see recent posts)

I'm storing the redirections in a hash trie, not in one big blob. And since it just maps int to int, it's not that large and should stay swapped in (memory mapped access).

The approach I'm currently looking at is to only "checkpoint" (write out index data) periodically rather than after every transaction. With this approach it may not be worth having the redirections since you'd only be writing out the top levels of the tree once per checkpoint.

I haven't got around to trying the big root / small leaf idea although I'd still like to.

You can see my posts on this area by selecting the immudb tag.