Recently I was looking for a good way to "characterize" the values in a column of a database table, to help Suneido's query optimizer choose the best index to use. Originally I was thinking of some kind of histogram. While "discussing" alternatives with Gemini, KLL Quantile Sketches came up. Histograms are usually equi-width - each bin represents the same range of values e.g. 0 to 1 out of 10. Quantiles are usually equi-depth - each bin represents the same frequency of values e.g. 1%. With database data you don't usually know the range of values and it's not easy to divide values like strings into equal ranges so quantiles made more sense.
The name "KLL" comes from the authors (Karnin, Lang, and Liberty) of the original paper Optimal Quantile Approximation in Streams.
I have to admit I struggled with understanding KLL Sketches. At first I got Gemini to explain them to me and show me a sample implementation. I thought I understood it until I tried to implement it. Since I was off track, I quickly led Gemini astray and ended up flailing around for longer than I should have. I finally gave up on AI and went back to the original paper. The bulk of the paper is math theory proving that the algorithm worked within specific error bounds. I didn't realy follow the math and I didn't really care about the proof - I was willing to take their word for it. I just wanted to know how to implement it. In the end, the key part was a couple of paragraphs of the paper. The basic idea is pretty simple.
Unfortunately, I didn't find a lot of material on KLL sketches (or its precursor MRL sketches). There isn't even a Wikipedia page! There is an Apache DataSketches library that includes KLL sketches. I didn't look at their code. Trying to figure out basic ideas from a full implementation is hard.
It's easiest to understand by working through several increasingly optimized versions. The first version is just several buffers that hold k values. k determines the accuracy and a common value is 200 which gives 1.65% accuracy. The buffers are known as "compactors".
Incoming values are added to the first compactor. When it gets full, you sort the values, and then taking them in pairs of consecutive values you pick either the even or the odd ones and discard the others. Even or odd is chosen randomly, or randomly first time and alternating thereafter. You end up with half as many values, which you promote to the next compactor. In turn, when the next compactor fills up, you compact it and promote half the values. You add compactors as necessary. Each compactor represents/summarizes twice as many values i.e. its weight doubles.
The first optimization is that as you add levels, the levels closest to the input can be smaller without losing accuracy. The newer values are less important since they represent less of the data. This saves memory and reduces the work of compacting. The size factor is called c and the usual value is 2/3.
In terms of the algorithm, you add a new level of size k and shrink the existing levels. But in terms of implementation, you can't really shrink a memory allocation. You can take a sub-slice but the full size is still there behind the scenes. Instead, I added a new smaller compactor on the input end. The problem with that is that the pre-existing compactors now have the wrong weights. To fix that, I compact each of those previous levels in place, which has the effect of doubling their weight. This approach isn't in the original paper (which is light on implementation) but it seems to retain the correct behavior.
As the compactors get smaller, eventually you reach the minimum size of 2. The next optimization is that all the levels of size 2 can be combined into a "sampler" which randomly selects one value from every 2^n where n is the number of size 2 levels replaced by the sampler.
Querying the result is straightforward - you take the values with their implicit weights and sort them. Then the cumulative weight gives you the quantiles. e.g. 50% of the weight gives you the median.
There are further optimizations that I didn't implement. They seem more theoretical improvements than practical. My simple implementation (< 200 lines of Go) seems to give good performance with low memory requirements. (> 100,000,000 values/second, zero allocation once it reaches the maximum size of about 600 values)
No comments:
Post a Comment