The Software Life

Sunday, October 19, 2025

KLL Sketch Quantiles

Recently I was looking for a good way to "characterize" the values in a column of a database table, to help Suneido's query optimizer choose the best index to use. Originally I was thinking of some kind of histogram. While "discussing" alternatives with Gemini, KLL Quantile Sketches came up. Histograms are usually equi-width - each bin represents the same range of values e.g. 0 to 1 out of 10. Quantiles are usually equi-depth - each bin represents the same frequency of values e.g. 1%. With database data you don't usually know the range of values and it's not easy to divide values like strings into equal ranges so quantiles made more sense.

The name "KLL" comes from the authors (Karnin, Lang, and Liberty) of the original paper Optimal Quantile Approximation in Streams.

I have to admit I struggled with understanding KLL Sketches. At first I got Gemini to explain them to me and show me a sample implementation. I thought I understood it until I tried to implement it. Since I was off track, I quickly led Gemini astray and ended up flailing around for longer than I should have. I finally gave up on AI and went back to the original paper. The bulk of the paper is math theory proving that the algorithm worked within specific error bounds. I didn't realy follow the math and I didn't really care about the proof - I was willing to take their word for it. I just wanted to know how to implement it. In the end, the key part was a couple of paragraphs of the paper. The basic idea is pretty simple.

Unfortunately, I didn't find a lot of material on KLL sketches (or its precursor MRL sketches). There isn't even a Wikipedia page! There is an Apache DataSketches library that includes KLL sketches. I didn't look at their code. Trying to figure out basic ideas from a full implementation is hard.

It's easiest to understand by working through several increasingly optimized versions. The first version is just several buffers that hold k values. k determines the accuracy and a common value is 200 which gives 1.65% accuracy. The buffers are known as "compactors".

Incoming values are added to the first compactor. When it gets full, you sort the values, and then taking them in pairs of consecutive values you pick either the even or the odd ones and discard the others. Even or odd is chosen randomly, or randomly first time and alternating thereafter. You end up with half as many values, which you promote to the next compactor. In turn, when the next compactor fills up, you compact it and promote half the values. You add compactors as necessary. Each compactor represents/summarizes twice as many values i.e. its weight doubles.

The first optimization is that as you add levels, the levels closest to the input can be smaller without losing accuracy. The newer values are less important since they represent less of the data. This saves memory and reduces the work of compacting. The size factor is called c and the usual value is 2/3.

In terms of the algorithm, you add a new level of size k and shrink the existing levels. But in terms of implementation, you can't really shrink a memory allocation. You can take a sub-slice but the full size is still there behind the scenes. Instead, I added a new smaller compactor on the input end. The problem with that is that the pre-existing compactors now have the wrong weights. To fix that, I compact each of those previous levels in place, which has the effect of doubling their weight. This approach isn't in the original paper (which is light on implementation) but it seems to retain the correct behavior.

As the compactors get smaller, eventually you reach the minimum size of 2. The next optimization is that all the levels of size 2 can be combined into a "sampler" which randomly selects one value from every 2^n where n is the number of size 2 levels replaced by the sampler.

Querying the result is straightforward - you take the values with their implicit weights and sort them. Then the cumulative weight gives you the quantiles. e.g. 50% of the weight gives you the median.

There are further optimizations that I didn't implement. They seem more theoretical improvements than practical. My simple implementation (< 200 lines of Go) seems to give good performance with low memory requirements. (> 100,000,000 values/second, zero allocation once it reaches the maximum size of about 600 values)

Sunday, September 21, 2025

AI Coding Tools

A few people have asked about what tools I'm using for AI coding so I figured I'd snapshot what I'm using right now. Given how fast AI is changing it will probably be different six months from now. I haven't tested different tools extensively so don't take this as expert advice, just one data point.

For tab completion I've been using Amp Tab which is the tab completion part of Amp which is Sourcegraph's second generation AI coding tool (after Cody). Amp itself is a little too aggressive for my style of programming. I prefer to review changes closely before applying them. Currently Amp Tab is experimental and is still free. My understanding is that it's similar to Cursor, which I haven't tried because I prefer to use standard VSCodium. Even Amp Tab can be a little aggressive for me sometimes. I have to be careful about hitting tab to indent a line since it's liable to go and make changes to my code.

For investigating, reviewing, or writing code I've been using Cline as a VSCode extension. Cline is open source and lets you pick your AI model and provider. I've been using OpenRouter so I can try out different models. You can also use Cline as the provider which I might have done if I'd realized it before I signed up with OpenRouter.

Cline has a "planning" mode which is basically read-only, and an "act" mode where it makes code changes. Even in act mode I require approval for code changes.

As far as models, I started out with Claude Sonnet 3.5 and progressed to 3.7 and now 4. I've also tried a few others like GPT-5 and grok-code-fast-1. There isn't a huge difference, they can all do well or mess up badly but I tend to go back to Claude Sonnet 4 even though it's one of the more expensive ones. The last few months I've been spending about $50 per month on model usage. It's worth it for me, as much for the learning experience as for the actual code produced. If you didn't want to (or couldn't afford to) spend money on it, there are usually free or cheap options.

For general research I've been using Gemini 2.5 Pro, mostly because it's included with our company Gmail/Google accounts. It works well to research algorithms or data structures.

The big question these days is whether programmers are actually more productive using AI. There have been studies that show that although programmers feel they're more productive, they're actually not. It sounds a bit like multi-tasking. I wouldn't say it's made a huge difference to my productivity. Some types of tasks go quicker, but for others AI can become a big time waster. I would say the quality of my code might be slightly higher from having more tests and more reviews.

Sunday, September 07, 2025

A Priority Queue with sync.Cond

The good news is Claude (Sonnet 4) found a bug for me.

The bad news is that Claude wrote the code that had the bug.

It's not all Claude's fault though. Garbage in, garbage out. When I searched on the internet for an example, I found a post with the same bug. I left a comment explaining the bug, hopefully it'll get corrected.

The history was somewhat amusing as well: (paraphrased)

Me: we could write a priority queue

Claude: that would be complex

Me: but we could encapsulate the complexity in the queue package

Claude: good idea

Me: write a priority queue (followed by the requirements)

Claude: no problem, here you go

The code for a concurrent producer-consumer queue is not that complicated, even with specific priority and ordering constraints.

I also got Claude to write tests and benchmarks and everything seemed to work well. The benchmarks showed it had comparable performance to a Go channel. I changed my code to use the new queue (replacing a simple Go channel) and it worked fine. It even appeared to show the hoped for performance improvements. I got Claude to write more benchmarks to measure performance better. But the benchmarks would hang sometimes. I tried to get Claude to fix it but it just thrashed around rewriting the benchmark different ways. I suspected a deadlock in the priority queue so I wrote a better test for that. It appeared to work fine, until I increased the number of threads to 16 or 32, then it would hang consistently with a deadlock. I gave the simple failing test to Claude and it immediately spotted the bug.

The "obvious" approach is to use a single condition variable (sync.Cond) for the queue. Put and Get wait on the Cond and then signal it. It seems straightforward, and under low concurrency it appears to work. But under high concurrency, it can deadlock. The problem is that Signal only wakes up one goroutine. If the queue is full and it wakes up a producer, or the queue is empty and it wakes up a consumer, it will deadlock.
One solution is to use Broadcast to wake up all waiters but that can be inefficient with many goroutines. The better solution is to use two sync.Cond, one for not-full and one for not-empty. Put waits on not-full and signals not-empty. Get waits on not-empty and signals not-full. I should have read the Wikipedia article instead of trusting AI generated code.

Go's sync.Cond has a bit of a funny story in itself. There is no example for it in the Go documentation. When someone suggested adding an example they were told that sync.Cond is tricky and therefore shouldn't be used and they didn't want to add an example because that would "encourage" people to use it. So it becomes a self fulfilling prophecy that people misuse it. At one point there was even a movement to remove sync.Cond from Go. It seems a little odd to me, since condition variables are a standard concurrency concept. It seems to me a few examples would prevent almost all the misuse.

Here's the priority queue code if you're interested.

Friday, August 01, 2025

OverIter Cur Deleted

gSuneido has had a long standing assertion failure that happened maybe once a month which equates to about once per million user hours. We tried multiple times but were never able to recreate it.

I’ll try to explain the scenario. Indexes are stored in the database file as btrees. While the database is running, index updates are buffered. An Overlay consists of the base btree plus layers of updates. When a transaction commits it adds an ixbuf layer to the Overlay’s that it updated. Background threads merge the layers and update the btree. OverIter handles iterating over an Overlay. It merges iterators for each layer. This is fundamentally straightforward but as always, the devil is in the details. The ixbuf layers from the transactions include updates and deletes as well as additions. So the merging has to combine these. For example, combining an add followed by several updates. The final piece of the puzzle is that concurrent modifications can occur during iteration.

"OverIter Cur deleted" means that the current value of the iterator is marked as deleted. This should never happen. The iterator is designed to skip deleted records.

The error occurred again recently and I decided to see what Claude (Sonnet 4) could do with it. It kept saying "now I see the problem", but when I'd tell it to write a test to make it happen it couldn't. It became obvious it wasn't going to spot the bug by just looking at the code so I got it writing tests. It wrote a lot of them, and they all passed. That was actually kind of nice since it meant the code was fairly robust. I wouldn't have been surprised if other bugs had showed up with the intense testing.

Finally, it wrote a random stress test that caused the assertion failure. I was cautiously optimistic. Sadly, it turned out it was generating invalid test data and that was what was triggering the assertion failure. Once I corrected the data generation, then the test passed. It was possible that the bug was actually leading to bad data which was leading to the assertion failure but that seemed unlikely since bad data would cause other problems.

Back to the drawing board. I continued extending the test to cover more scenarios. Eventually I managed to recreate the error with legitimate data and actions. After that it was a matter of extracting one failing case from the random test. Claude added print statements, looked at the result, and wrote a test for that specific sequence.

Once I had a relatively simple failing test, Claude claimed to find the bug right away. I was skeptical since it had claimed to find the bug many times already. I worked on extending the test to cover more of the scenarios. Sure enough, it started failing again. Claude came up with another fix. But the test kept failing. The proposed changes would fix certain cases but break other cases. No combination of the “fixes” solved all the problems.

Eventually Claude proposed rewriting one of the key functions. I was even more skeptical but the code looked reasonable and it was simpler and clearer than the old code. It wasn't very efficient but it wasn't hard to tell Claude how to optimize it. But it still didn’t fix all the problems. I dug into the other “fix” and realized it was on the right track but wasn’t quite complete. A little back and forth came up with a solution here as well. And finally I had a version of the code that passed all the tests.

I am cautiously optimistic. By this point I think the tests are fairly comprehensive. And I understand the fixes and they make sense.

As I've come to expect, my results with Claude were mixed. It definitely did some of the grunt work of writing tests. And I have to give it credit for giving up on some of my old code and writing a simpler, more correct version. But it also came up with several incorrect, or at least incomplete, fixes.

I started this thinking I'd spend a few hours playing with AI. It ended up being my main project for a week. Even though it was rare enough that it wasn't really a problem, I'm glad I finally fixed it. (I hope!)

Wednesday, July 16, 2025

Partnering with AI

I'm not a leading edge AI user. I have used Tabnine for the last few years, but mostly as a smarter auto-complete. Handy and time saving but nothing earth shaking. Mostly I was a skeptic. But using Cline with Claude Sonnet 4 lately has definitely changed my attitude.

One of the first things I noticed was the "rubberducky" effect. Just the effort of explaining a problem to AI often helps think of a solution. Explaining a problem to another person (or an AI) is even better because they ask questions and offer solutions. Claude often says "I see the problem" when it's totally off track. That's a little annoying, but it's no worse than what a person would do. And often those wacky ideas can spark some new ideas or avenues to explore.

A more surprising aspect is that it feels like I'm collaborating. I've been solo programming for most of my (long) career, but the last five years working remotely has exaggerated that. No one reads the majority of my code or comments on it. It's not that I'm socializing with the AI, but suddenly I have someone I can "talk" to about it. What does this do? Do we need that? What if we did xyz? Isn't that dangerous? It might not "know" the code like a true collaborator, but it's a very close approximation. And it never hurts to be exposed to other ways of doing things.

I don't vibe code with AI. I want the end code to be as good as I can make it. And with current AI that means keeping a very close watch on what it's doing. Like a beginning programmer, it seldom gets a good solution on the first try. It often requires reviewing changes closely and not being afraid to reject them. Often, you have to steer it to an elegant solution. I'm working on heavily used production code, not a throwaway toy project.

The agent approach seems to be a fundamental improvement. It's impressive to watch Claude add debugging statements, run the code, "look" at the output, and repeat until it tracks down the issue. At the same time, it can also make quite blatant mistakes so you need to be watching closely.

Honestly, I miss writing code when I'm working with AI. It's like pair programming with someone who never gives you the keyboard. Of course, there's nothing stopping me from writing some of the code myself, and I do. But when Claude can spit it out faster than I can type, it seems pointless not to let it.

One of Claude's quirks is that it has a positive tone. It's always saying "good idea" or "you're absolutely right". I found that a little goofy at first, but once I got used to it I find I like it. Who doesn't like positive feedback? I know it's meaningless, but I find when I use less positive models, I miss it.

Monday, July 07, 2025

Super Instructions

The overhead of an interpreter depends on the size of the instructions compared to the dispatching. So one of the ways to reduce the overhead is to increase the size of the instructions. One way to do that is “super instructions” - combining several instructions into one. This caught my interest and I decided to see whether it would help with Suneido. The Suneido byte code instructions themselves are mostly simple, which means a higher overhead, but most of the time is spent in built-in functions like database queries, which means a lower overhead.

The first step was to determine if there were combinations of instructions that were common enough to justify optimizing them. I added some quick temporary instrumentation to the interpreter. First I looked at sequences of two instructions. The most common accounted for roughly 7% of the instructions. The top 10 made up 40% of the instructions. That was actually higher than I expected. I also looked at sequences of 3 and 4 instructions. The most common was 3% so I decided to keep it simple and stick to 2 instruction sequences.

%	op1	op2
7.0	Load	Value
6.7	Value	CallMethNoNil
5.2	Value	Get
4.8	Load	Load
4.5	This	Value
2.7	Store	Pop
2.5	This	Load
2.3	Get	Value
2.2	Pop	Load
2.2	Global	CallFuncNoNil
40.1	Total

The compiler code generator funnels through a single emit function so it was easy to modify that to detect the sequences and replace them with combined instructions. And it was easy to add the new instructions to the interpreter itself. If I’d stopped there it would have been a quick and easy one day project. But, of course, I had to benchmark it. At first that went well. Micro benchmarks showed clear improvement as I expected. But running our application test suite showed a slight slowdown of 1 or 2%. That puzzled me.

One possibility was that it slowed down the compiler. Running the test suite requires compiling all the code. The changes did slow down the compiling slightly but compiling is fast enough that it doesn’t even show up in profiling. I optimized the code slightly but that wasn’t the problem.

That left the interpreter. gSuneido uses a giant switch to dispatch instructions. Go doesn’t allow some of the other dispatching methods you could use in e.g. C(++). I had added 10 cases to the switch. Did that slow it down? How did Go implement switches? Searches didn’t turn up a lot, but it sounded like it used a binary search. Running the code in the debugger confirmed this. Maybe adding more cases had added another level to the binary search? I tried rewriting the switch as a hand written binary search using if-else, optimized for the most frequent instructions. That quickly turned into an ugly mess. More searching found Go issue 5496 to implement switches with jump tables. The issue appeared to be completed. I used Compiler Explorer to test and sure enough, Go compiled dense integer switches as jump tables. But not my interpreter switch??? I tried simplifying the switch, ordering the cases, removing fallthrough, etc. But the debugger still showed a binary search. I found the implementation in the Go compiler source code. It seemed straightforward; there were no special requirements. I tried one of my simple examples from Compiler Explorer inside my gSuneido project. That didn't show as a jump table in the debugger either. What the heck!

Back to the Go compiler source code. There was a condition on base.Flag.N, what was that? Oh crap. That's the -N command line option that disables optimization, which is used by the debugger. So whenever I was using the debugger to look at the assembly language code, I was seeing the unoptimized version. A little digging and I figured out how to use gdb to disassemble the production code and found it had been using a jump table all along. Argh!

Back to the original question - why was it slower? Maybe the extra code in the interpreter was affecting inlining? I looked at the compiler inlining logging but everything still seemed to be getting inlined correctly.

Looking at the cpu profile, the interpreter time decreased as I'd expect. The only noticeable increase was in garbage collector time. But the changes don't do any allocation. They would actually decrease the size of compiled code slightly. The allocation from the test suite was slightly higher. But the memory profile didn't show anything.

In the end, I gave up. I can't figure out why the overall test suite seems slightly slower. There doesn't seem to be any evidence that it's a result of the changes. Everything else shows an improvement so I'm going to keep it. I can only guess that the slowdown is a result of perturbing the code, affecting layout, or caching, or branch prediction, or timing. Modern cpu's are so complex that they are somewhat non-deterministic.

Code is in Github as usual.

Monday, March 24, 2025

Copy on Write

Copy on write is an interesting technique with a wide variety of applications. It's somewhat related to persistent immutable data structures, which are really "partial copy on write". Basically it's just lazy or deferred but with the addition of reference counting.

It started when I happened to be looking at our memoize code. (That's the correct spelling, it's different than "memorize"). When it returns a mutable object, it makes a defensive copy. Otherwise, if the object was modified it would modify the cached value.

Defensive copies are a standard technique, but they're often inefficient because if the caller doesn't modify the object then the copy was unnecessary.

One solution is to make the cached values read-only. Then they can't be modified and you don't need a defensive copy. But this has two problems. One is that people forget to make it read-only, since it works fine without it. The other is that often you do need to modify the result and then all the callers have to copy.

My first thought was to add an explicit CopyOnWrite method. But most people wouldn't understand the difference or remember to use it. We could use it in Memoize, but that was quite limited.

Then I realized that it probably made sense to just make the existing Copy method always be copy-on-write i.e. deferred or lazy copying. That was assuming that I could implement copy-on-write with low enough overhead that the benefit would outweigh the cost.

The simplest naive approach is to mark both the original and the copy as copy-on-write. But then if you later modified them both, you'd end up making two copies, whereas with normal copying you'd only have made one copy. The solution is to keep a shared "copy count", similar to a reference count for memory management. If the copy count is zero, then you can just modify the object without copying it, since you know you won't affect any other "copies".

When you make a lazy copy, you increment the copy-count. When you do an actual copy to allow modification, you decrement the copy-count. Ideally you'd also decrement the copy-count when an object was garbage collected. (perhaps with the new runtime.AddCleanup in Go 1.24)

One catch is that the copy-count must be shared. At first I thought that meant I had to put the data and copy count in a separate object with an extra level of indirection for all references to the data. Then I realized it was only the copy count that had to be shared. So I just allocated it separately. That meant I could access it with atomic operations which have low overhead.

Luckily I had an existing test for concurrent access to objects. This failed with my changes. The race detector also found problems. Objects are locked while reading or writing. But with copy-on-write there are multiple objects referencing the same data. Locking an object isn't sufficient to protect the data. One solution would be what I previous considered - keeping the data and the copy count separately, along with a lock. But then we're back to too much overhead.

I found the problem was that I was decremented the copy count before doing the actual copy. But as soon as the copy count went to zero, another thread could think it was ok to modify. I had to decrement the copy count after the actual copy. But that meant checking if the copy count was 0 separately from the decrement, which meant there was potential for two threads to check the copy count, both find it was 1, and both copying the object. I decided this would happen very rarely, and the only cost was an extra copy.

For once my code was structured so it was quite easy to implement this. Copying was done in a single place and update methods all called a mustBeMutable method. It only took about 40 lines of code.

And pleasantly surprising, this abstraction wasn't leaky and it didn't break or affect any of our application code. Running our application tests there were roughly 500,000 deferred copies, and 250,000 eventual actual copies. So it saved half of the copying - nice!