I have been thinking about how to improve Suneido's Btree implementation.
Btree's were initially designed for disk storage. But Suneido memory maps the indexes and generally Suneido database servers have enough memory that the indexes are resident. Would it make sense to use a different Btree design for in memory?
One possibility is to not store the keys in the Btree, since there's no extra disk access to access the data if it's in memory. That's more like how you'd design a normal in-memory data structure. And it makes the Btree nodes much smaller.
But memory is not flat these days, we have multiple levels of caches. And in relative terms, cache misses are almost as bad as disk accesses. So what we really want is to minimize cache misses.
So we have a design continuum. No keys makes the index small but will mean a lot of indirection and therefore cache misses. Entire keys means no indirection, but indexes are big.
My first thought was to compromise and store a small fixed portion of each key in the index. This keeps the index small, makes nodes fixed size, and reduces indirection.
But ordered keys usually have common prefixes, so storing the first few bytes does not discriminate very well.
To handle that we can use prefix compression (aka incremental encoding or front compression)
Prefix compression means we have to linear scan each node (i.e. you can't do a binary search). But Btree nodes are reasonably small and linear is actually the optimal access pattern for memory caches.
We can combine these techniques by only including the minimum information to differentiate each key from its predecessor.
Interestingly, this looks a lot like a radix tree, branching on a single character.
This nicely handles both dense (e.g. sequential values) and sparse (widely separated) distributions of keys. However, it does have a worst case - when we go from a sparse key (small common prefix) to a dense key (large common prefix) (e.g. bill to billy or erika to erin) we need more information than we have in the index. One option is indirection to look at the actual data. That's ok if it's rare, but not so good if it's common.
If we give up the fixed length idea (which is not as important for linear scan anyway) then we can store the extra information necessary for this case (the missing prefix information)
With this, only one indirection is required at the end of each search to compare the key to the actual data.
In my test data I averaged less than 1.1 bytes per entry. (Although obviously it can be more, as in this example.)
To search, we iterate sequentially, maintaining the known prefix after each entry, and comparing that to what you're looking for. Because we only have partial information, we may search one entry too far. e.g. if we're looking for "earl" we should stop at "erika", but since all we have for "erika" is "e", we'll look one past.
Inserting and deleting affect the encoding of the following values so you need to re-encode entries until you're back in sync. (e.g. hit a 0 entry) This requires looking at the actual data, but it's only for updates, and typically not that many entries.
I've prototyped the design and it seems to perform reasonably but I haven't implemented enough to actually benchmark it against Suneido's current implementation (which stores entire uncompressed keys).
Part of what triggered this was Design continuums and the path toward self-designing key-value stores that know and learn
Btree's were initially designed for disk storage. But Suneido memory maps the indexes and generally Suneido database servers have enough memory that the indexes are resident. Would it make sense to use a different Btree design for in memory?
One possibility is to not store the keys in the Btree, since there's no extra disk access to access the data if it's in memory. That's more like how you'd design a normal in-memory data structure. And it makes the Btree nodes much smaller.
But memory is not flat these days, we have multiple levels of caches. And in relative terms, cache misses are almost as bad as disk accesses. So what we really want is to minimize cache misses.
So we have a design continuum. No keys makes the index small but will mean a lot of indirection and therefore cache misses. Entire keys means no indirection, but indexes are big.
My first thought was to compromise and store a small fixed portion of each key in the index. This keeps the index small, makes nodes fixed size, and reduces indirection.
bill | bi |
billy | bi |
erika | er |
erin | er |
erma | er |
But ordered keys usually have common prefixes, so storing the first few bytes does not discriminate very well.
To handle that we can use prefix compression (aka incremental encoding or front compression)
bill | 0, bill |
billy | 4, y |
erika | 0, erika |
erin | 3, n |
erma | 2, ma |
Prefix compression means we have to linear scan each node (i.e. you can't do a binary search). But Btree nodes are reasonably small and linear is actually the optimal access pattern for memory caches.
We can combine these techniques by only including the minimum information to differentiate each key from its predecessor.
bill | 0, b |
billy | 4, y |
erika | 0, e |
erin | 3, n |
erma | 2, m |
Interestingly, this looks a lot like a radix tree, branching on a single character.
This nicely handles both dense (e.g. sequential values) and sparse (widely separated) distributions of keys. However, it does have a worst case - when we go from a sparse key (small common prefix) to a dense key (large common prefix) (e.g. bill to billy or erika to erin) we need more information than we have in the index. One option is indirection to look at the actual data. That's ok if it's rare, but not so good if it's common.
If we give up the fixed length idea (which is not as important for linear scan anyway) then we can store the extra information necessary for this case (the missing prefix information)
bill | 0, b |
billy | 4, illy |
erika | 0, e |
erin | 3, rin |
erma | 2, m |
With this, only one indirection is required at the end of each search to compare the key to the actual data.
In my test data I averaged less than 1.1 bytes per entry. (Although obviously it can be more, as in this example.)
To search, we iterate sequentially, maintaining the known prefix after each entry, and comparing that to what you're looking for. Because we only have partial information, we may search one entry too far. e.g. if we're looking for "earl" we should stop at "erika", but since all we have for "erika" is "e", we'll look one past.
key | entry | known |
---|---|---|
bill | 0, b | b |
billy | 4, illy | billy |
erika | 0, e | e |
erin | 3, rin | erin |
erma | 2, m | erm |
Inserting and deleting affect the encoding of the following values so you need to re-encode entries until you're back in sync. (e.g. hit a 0 entry) This requires looking at the actual data, but it's only for updates, and typically not that many entries.
I've prototyped the design and it seems to perform reasonably but I haven't implemented enough to actually benchmark it against Suneido's current implementation (which stores entire uncompressed keys).
Part of what triggered this was Design continuums and the path toward self-designing key-value stores that know and learn
No comments:
Post a Comment