Thursday, July 23, 2020

Hash Array Mapped Trie in Go

When you want an immutable persistent (in the functional sense) hash table, a Hash Array Mapped Trie is a common choice.

I recently implemented one to use in the gSuneido database. I procrastinated because I had a vague memory that it had been tricky to implement in jSuneido. But when I finally got around to doing it, it turned out to be relatively easy.

A Hash Array Mapped Trie (HAMT) is a tree of nodes. Each node has an integer bit map (commonly 32 bits) that specifies which slots are occupied, and a variable size array of items. The bitmap means you don't waste space on empty slots like you do in most hash tables.

Each level of the tree consumes 5 bits (for a bitmap of 32 bits) of the hash value. In most implementations, the items are pointers, and can either point to a value, or a child node. The items could be void* in C++ or Object in Java. The equivalent in Go would be interface{} which can hold a value of any type. But Go interface{} is double the size of a simple pointer. And on top of that, it requires casting to get the proper types back out. This approach also doesn't allow embedding small values, they have to be referenced by pointer, so it's not as cache friendly. (An interface can reference a non-pointer type like a strict but internally it’s still using a pointer.)

I decided to take a slightly different approach and use two bitmaps and two slices (Go's variable sized arrays). One bitmap and slice for values, and one bitmap and slice for child pointers. That way the slices could be the proper types, no casting was required, and small values could be embedded rather than referenced by pointer. The tradeoff was that it uses slightly more space.

After finishing an initial implementation, I realized that this design allowed a slight improvement over the usual approach.

At a given node, if you have a value in a slot, and you want to insert another value with the same 5 bits of hash (a collision), you replace the value with a pointer to a child node and put the two values in the new node. If your hash is bad, they can collide again forcing yet more child nodes, poor space usage, and longer searches. (When you run out of hash bits you either have to rehash or use overflow nodes.)

But with separate slices for child pointers and values, you can have both for a given slot. So when you get a collision, you still create a child node, but you don't need to move the old value. So the new node only has one element and there cannot be any further collisions from this insertion, even if the hash values are identical. And it actually simplifies the code and reduces the average search length slightly.

The other variation in my implementation is that the data structure operates in two modes, either immutable (read-only) and safely sharable between threads, or mutable (updatable) and not sharable. Like most persistent immutable tree data structures, making a change requires path copying to the root. That's a lot of overhead to do for every insert or delete. My code tends to operate in two modes, either a single thread is updating, or multiple threads are sharing immutably. By having a mutable mode, the path copying is shared/reduced. e.g. every insert/delete doesn't have to make a new root node. I use a "generation" integer value to know whether a node is mutable. If the node's generation doesn't match the current generation, then we make a copy of the node, and set its generation to the current one so it will be mutable from now on. To switch everything back to immutable (freeze it) we simply increment the current generation. (Don't have to touch the nodes.) 

Until Go gets generics, I'm using the Genny code generator to create typed versions. It's not perfect, but it works well enough.

The code is in Github as usual.

Update 2020-09-15

I discovered a bug in my implementation. If you insert two colliding items, the second will go into a child node. If you then delete the one from the parent node, you end up with a situation my code didn't handle. It assumed that if the value slot of a node was empty, then that hash didn't exist. One solution would be to check child nodes. But it made more sense to put the burden on the less common delete operation. Now when you delete a value from a node, it "pulls up" a value from a child node if there is one. This maintains the invariant that if a value slot of a node is empty, then there are no child nodes of that slot. And by pulling up from the lowest possible child, it also keeps the tree as shallow as possible.

No comments: