Saturday, August 18, 2012

Btree Range Estimation

Suneido's database query optimization needs to select which indexes to use. Part of this is estimating what portion of a table is spanned by a given range of index keys.

My assumption in the past was that a crude approximation was sufficient, since usually the best index is obvious. So the estimation would only look at one or two of the top levels of the btree.

But recently, as a result of some performance work on our applications, we've found a number of cases where this isn't true, where you need a more accurate estimate in order to choose the best index. This is usually with large tables, e.g. millions of records. With big tables like this, only looking at the top one or two levels of the tree is not sufficient to differentiate e.g. an index range of a few records versus an index range of thousands of records. In this situation, the "wrong" choice can make quite a difference to performance.

One of the sources of inaccuracy is from estimating the size of a Btree level. So my first idea to make it more accurate was to look at all the nodes on each of the top levels in order to get an exact count. I implemented this, and it was an improvement, but the downside was that it could end up reading a lot of nodes because Btrees have a large fan-out.

So I went back to my old approach of doing searches for the beginning and ending values of the range, and estimated the level sizes and therefore the portion of the tree between them. I made a better estimate of the level size by basing it on the size of the nodes passed through on the search. Since this approach only does two searches down the tree (not reading whole levels) it's reasonable to go further down the tree to get a more accurate estimate. In the end, I found some cases required searching all the way down the tree to the leaves. This is ok, since Btrees are shallow due to their large fanout. (Even going all the way to the leaves is still not 100% accurate because you're still only approximating the level sizes on the way down.)

This was better, but there was still an annoying flaw - because node sizes vary, if a search happened to pass through smaller nodes, it would estimate different level sizes than a search that happened to pass through larger nodes. Different level size estimates would result in poorer estimates. Differences in level size estimates near the top of the tree were especially bad since their effect was multiplied. For small ranges (e.g. "close" values) this wasn't a problem because the two searches would pass through the same nodes at the top levels of the tree and therefore estimate the same level sizes. But larger ranges could be less accurate.

The improvement I came up with was to use the average of the search's estimates of level sizes. Not only does this give a better estimate, but it also means that the two searches use consistent level size estimates, which is just as important as the estimates being accurate. It was a little more awkward for coding since the averaging had to be done in the middle of the searches. I originally wrote the code as recursive, but ended up with an iterative approach since it was shorter and simpler. (Not because it might be faster.)

This may be a known technique, but I haven't encountered it so I thought I'd share it. If anyone is aware of better approaches, let me know.

jSuneido btree source (scroll to the end for the rangefrac methods)

No comments: