Wednesday, March 26, 2014

B-tree Range Estimation

One of the trickier issues to deal with in Suneido is when the database query optimizer picks what seems like a poor strategy. One of the difficulties is that these issues are usually data dependant - so you need to do the debugging on a large database.

Suneido's query optimizer is fairly straightforward, but even on relatively simple queries there are a lot of possible strategies and it can be quite hard to understand the end results.

We had an issue recently with jSuneido where the sort could be satisfied by an index, but the strategy it chose was reading by a different index and then sorting (with a temporary index).

It wasn't hard to discover that it thought reading by the other index was enough faster to justify the temporary index. The problem was it was wrong.

It came down to the the btree range estimation. It was estimating the same key range quite differently for the two indexes, where they should have been exactly the same.

One of the important choices in query optimization is which indexes to use. To help with this, Suneido uses the index B-trees to estimate how "selective" a given range of index keys is, i.e. what fraction of the records does a range include. Because the query optimizer looks at lots of possible strategies (combinatorial explosion) you want this estimation to be fast. You don't want to read the entire index.

Strangely, I used the identical approach on both cSuneido and jSuneido, but they ended up choosing different strategies for the same data.

The code was doing a lookup on the "from" and "to" keys, estimating the position of the key from the path down the tree. Because a single path down the tree only looks at a few nodes out of a potentially large number, it has to assume that the rest of the tree is balanced and node sizes are "average".

It didn't take long to discover a rather blatent bug in both jSuneido and cSuneido. I was averaging the node sizes of the "from" search and the "to" search in order to get a better idea of the average node size on each level. But I wasn't adjusting the position within the node. For example, if the "from" search went through position 5 of a node of size 10, and the "to" search went through position 25 of a node of size 30, when I average the node sizes it now thought the "to" search went through position 25 of a node of size 20 - obviously wrong. On the positive side, if the node sizes don't vary too much then it doesn't have a big affect.

One reason why this bug hadn't surfaced yet is that the exact estimates aren't that important since the optimizer is just comparing whether different indexes are better or worse, not what the absolute numbers are.

That bug was easy enough to fix, and I came up with a much simpler way to write the code, but I still wasn't getting very accurate results. I started running the numbers through a spreadsheet to try to see what was wrong with my calculations.

After banging my head against the wall for a long time I finally realized that the problem was that the trees were actually not very balanced. A B-tree only guarantees that the tree is balanced in terms of height i.e. all the branches are the same length. It does not guarantee that each branch of the tree leads to the same number of keys. Most B-tree allow nodes to vary between half empty and full. When a node gets full it is split into two half full nodes.

However, if you add keys in order, this leads to most nodes only being half full. So Suneido uses a common optimization to split unevenly if a node gets full by adding on the end. In the "worst" case (in terms of balance) this leads to the root of the tree having a left branch that is full, and a right branch that has a single key. That is exactly what my data looked like in the "bad" case, but I hadn't realized the significance.

Side note - this optimization applies more to B-trees that use fixed node sizes, like cSuneido. jSuneido's append-only database uses variable sized nodes, so space wastage isn't an issue. However, you still want to keep the branching high to minimize the height of the tree.

My solution was for the estimation code to look at the sizes of all the nodes on the first and second level of the tree (root and one level below), rather than assume they were all the same average size. This doesn't handle the tree being unbalanced below there, but the lower levels of the tree have less of an effect on the final result. Finally I got some results that were at least in the right ballpark, and good enough so that jSuneido now chooses the "correct" strategy.

Maybe I've been spoiled by Java, but the unsafe nature of C++ scares me. I made a stupid bug in the cSuneido changes that wrote one past the end of an array. This passed all the unit tests, but gave some strange random errors later on. In this case it wasn't hard to find the problem, but it makes me wonder.

As usual the code is in version control on SourceForge.

No comments: