Thursday, July 23, 2020

Hash Array Mapped Trie in Go

When you want an immutable persistent (in the functional sense) hash table, a Hash Array Mapped Trie is a common choice.

I recently implemented one to use in the gSuneido database. I procrastinated because I had a vague memory that it had been tricky to implement in jSuneido. But when I finally got around to doing it, it turned out to be relatively easy.

A Hash Array Mapped Trie (HAMT) is a tree of nodes. Each node has an integer bit map (commonly 32 bits) that specifies which slots are occupied, and a variable size array of items. The bitmap means you don't waste space on empty slots like you do in most hash tables.

Each level of the tree consumes 5 bits (for a bitmap of 32 bits) of the hash value. In most implementations, the items are pointers, and can either point to a value, or a child node. The items could be void* in C++ or Object in Java. The equivalent in Go would be interface{} which can hold a value of any type. But Go interface{} is double the size of a simple pointer. And on top of that, it requires casting to get the proper types back out. This approach also doesn't allow embedding small values, they have to be referenced by pointer, so it's not as cache friendly. (An interface can reference a non-pointer type like a strict but internally it’s still using a pointer.)

I decided to take a slightly different approach and use two bitmaps and two slices (Go's variable sized arrays). One bitmap and slice for values, and one bitmap and slice for child pointers. That way the slices could be the proper types, no casting was required, and small values could be embedded rather than referenced by pointer. The tradeoff was that it uses slightly more space.

After finishing an initial implementation, I realized that this design allowed a slight improvement over the usual approach.

At a given node, if you have a value in a slot, and you want to insert another value with the same 5 bits of hash (a collision), you replace the value with a pointer to a child node and put the two values in the new node. If your hash is bad, they can collide again forcing yet more child nodes, poor space usage, and longer searches. (When you run out of hash bits you either have to rehash or use overflow nodes.)

But with separate slices for child pointers and values, you can have both for a given slot. So when you get a collision, you still create a child node, but you don't need to move the old value. So the new node only has one element and there cannot be any further collisions from this insertion, even if the hash values are identical. And it actually simplifies the code and reduces the average search length slightly.

The other variation in my implementation is that the data structure operates in two modes, either immutable (read-only) and safely sharable between threads, or mutable (updatable) and not sharable. Like most persistent immutable tree data structures, making a change requires path copying to the root. That's a lot of overhead to do for every insert or delete. My code tends to operate in two modes, either a single thread is updating, or multiple threads are sharing immutably. By having a mutable mode, the path copying is shared/reduced. e.g. every insert/delete doesn't have to make a new root node. I use a "generation" integer value to know whether a node is mutable. If the node's generation doesn't match the current generation, then we make a copy of the node, and set its generation to the current one so it will be mutable from now on. To switch everything back to immutable (freeze it) we simply increment the current generation. (Don't have to touch the nodes.) 

Until Go gets generics, I'm using the Genny code generator to create typed versions. It's not perfect, but it works well enough.

The code is in Github as usual.

Update 2020-09-15

I discovered a bug in my implementation. If you insert two colliding items, the second will go into a child node. If you then delete the one from the parent node, you end up with a situation my code didn't handle. It assumed that if the value slot of a node was empty, then that hash didn't exist. One solution would be to check child nodes. But it made more sense to put the burden on the less common delete operation. Now when you delete a value from a node, it "pulls up" a value from a child node if there is one. This maintains the invariant that if a value slot of a node is empty, then there are no child nodes of that slot. And by pulling up from the lowest possible child, it also keeps the tree as shallow as possible.

Wednesday, July 22, 2020

Immutable B-tree Redirects

Regular update-in-place B-trees suffer from write amplification. To update a single key you usually end up (re)writing a whole node. In addition, SSD storage also suffers from write amplification because it can't update in place, it has to write new copies to an erased block.

Immutable B-trees are worse. Not only do you have to write a whole node, you have to allocate new space for that node. Plus you need to update the tree path leading to that node, which means allocating and writing several more nodes. (This "path copying" is common in immutable persistent data structures.) This can make index file space grow excessively, and require more frequent compaction or garbage collection. (On the positive side the writing is sequential which is more efficient.)

I ran into this problem when I wrote the jSuneido immutable B-tree implementation. To reduce it to a manageable level, I ended up persisting indexes to disk periodically (once per minute) rather than after each transaction.This reduces both the write and space amplification.

There are various other approaches to reducing the amplification. For example, instead of writing out a copy of an entire node, you can write only the delta (differences). The downside is that to get the current version of the node, you need to read possibly multiple deltas plus the original node.

With the new version of the database that I'm working on for gSuneido, I took the same approach of periodically writing index updates. I knew from jSuneido that this approach was workable but it still bugged me. After all, the whole point of a new design was to make improvements.

To reduce memory allocation (and subsequent garbage collection) I use a hash map of "redirects" and defer path copying till writing to disk. An obvious approach is to write the redirects to disk instead of doing the path copying. But it seemed like that had the same problem as writing node deltas (redirects are really just a kind of delta) - they improve writing at the cost of slowing down reading.

My aha moment was when I realized that I already have those redirects in memory in a fast hash table for use during transactions. They would only need to be read from disk when the database starts up. The only additional cost would be larger redirect hash tables, but this has minimal affect on speed.

Of course, at some point too many redirects will accumulate and it makes sense to "flatten" them, i.e. rewrite the nodes and eliminate the redirects.

Thinking through the ramifications of this design (long runs are good for that), I realized that one of the tricky parts was going to be "flattening" the redirects. If all you store are the original and updated offsets for the redirect, then you'd need to search the whole B-tree to find the spot to update. My first thought was that I just needed to also store the node the redirection applied to. But that's not sufficient because you have to update its parents all the way up the tree. Obviously, to path copy, you need the path. But to store the offsets of all the nodes on the path from the root to the leaf would double or triple the size of the redirections. This would somewhat defeat the purpose given that we're storing the redirections to save space.
Instead of storing the path to each redirection, I ended up storing a separate set of tree nodes on the path to redirects. This is smaller because paths share a lot of tree nodes e.g. the root node is on every path. When traversing the tree to find nodes to flatten, only tree nodes in the set need to be followed.

I have this implemented and it seems to work well. I don't have a complete enough system to benchmark yet. It'll be interesting to see what overall impact it has.