Friday, October 07, 2011

The Grass Always Seems Greener

After tracking down the third bug to the same area of the code, I decided I had to stop adding more special cases and come up with a better solution.

The code was in the B-tree (technically B+-tree) implementation in my new append-only storage engine for jSuneido.

I'll try to explain the problem. I combine keys with a child node pointer. That leaves the question of whether a key points to the values less than it, or greater than it. I've always gone with greater than. But the problem is that when you do a lower bound binary search, the resulting position is one higher than the one you want. For example, if the tree node contains the keys 2, 4, 6, 8, the leaf containing 5 will be pointed at by the "4". But a lower bound binary search will give the position where you would insert 5, i.e. the position of the "6" key.

You can adjust for this, but it gets ugly. After the multiple bugs, I started to think that making the keys point to values less than the key would be cleaner because you wouldn't have the off by one problem.

So yesterday morning I set out to change the code. It's not a large amount of code and the changes were straightforward, but the code is fairly complex, and the "greater than" assumption was hidden in more places than I would have liked.

Eight hours later I had all my tests passing again. And it did simplify some parts of the code. But it also turned out (not surprisingly) that this approach had its own drawbacks. For example, a standard optimization is to detect when keys are being added in order (i.e. to the "end" of the B-tree) and to put only the new key in the new node, keeping the old full node intact. The way my code works, this is much easier with the "greater than" approach.

I'm sure I could have got it all worked out, but switching approaches was not the clear winner that I'd hoped.

Actual written code, with all it's messiness, always seems inferior to an imagined alternative, which is always perfectly clean. (And therefore, the grass is always greener title.) But when you implement the alternative, it often ends up just as messy as what you had before. (This relates to never rewrite from scratch)

I came up for air, ate some supper, realized I hadn't stepped foot outside the house all day, and considered my options. A glass of wine in hand, I returned to the computer. I wanted to "finish" working on it while all the details were fresh in my mind. I knew I wouldn't get a chance to work on it again for a week or so, and then I'd have to get back up to speed.

I threw out all the changes I'd made. (Except for a few additional tests I'd written.) Annoying, but at least it was only a days work.

By carefully adjusting the code I was able to keep the old approach, but remove most of the special cases that had been causing me trouble. The code is quite a bit cleaner now. I still have the off by one adjustment, but it's in a single place now.

Meanwhile, I've been reading Tokutek's blog about how COLA (cache oblivious lookahead array) style fractal trees are so much better than B-trees. That grass sure looks greener over there :-)

No comments: