Friday, March 23, 2012

I'm a little slow sometimes

A roundabout detour of bit twiddling optimizations that turn out to be ultimately unnecessary.

Originally, Suneido used 32 bit database file offsets. Remember when disks (let alone files) were limited to 4 gb? When we needed to support databases larger than 4 gb, I didn't want to switch to 64 bit offsets because of the extra space required to store them, so I "compressed" larger offsets into 32 bits. (a bit like the Java -XX:+UseCompressedOops option)

The approach I used was to align and shift. If you align the data in the file on, for example, 8 byte boundaries, then the lower 3 bits of an offset will always be zero. So you can shift a 35 bit offset to the right 3 bits so it fits into 32 bits. (Who says bit twiddling is dead?) This allows a maximum file size of 32 gb instead of 4 gb. (Actually, for obscure historical reasons, even though the alignment is 8 bytes, the shift is only 2 bits, so the maximum size is 16 gb.)

The tradeoff is some wasted space. Each block of data will waste an average of 4 bytes.

The problem is, it's starting to look like we need to allow even larger databases. I could just increase the align/shift to 16 bytes/4 bits to allow 64 gb. But that would increase the average waste to 8 bytes, even though most databases aren't that big.

Ideally, you'd want small databases to use a small align/shift, and larger databases to use a larger align/shift. I could make the align/shift a command line option, but you'd have to rebuild the database to change it. And I'm not a big fan of cryptic options that most people won't understand.

My next thought was to use a variable align/shift. There are different ways you could do this, but a simple one would be to reserve two bits to specify the align/shift. This would leave 30 bits (1 gb range) for the unshifted offset. For example:

00 = 4 byte alignment = 0 to 4 gb
01 = 8 byte alignment = 4 to 12 gb
10 = 16 byte alignment = 12 to 28 gb
11 = 32 byte alignment = 28 to 50 gb

For databases up to 4 gb, this would use a smaller alignment than currently and therefore less wasted space, while still allow a much higher maximum size than now. Only large databases would pay the cost of higher wastage due to larger alignment.

This scheme would be slightly more complex to decode, but not too bad.

At first I was excited about this approach, but it doesn't scale. You can increase alignment to allow bigger files, but then you waste more and more space. At some point, you'd be further ahead to drop the alignment and just store 64 bit offsets. Except you don't want the overhead of large offsets on small databases.

So why not just use a variable length encoding like Google Protocol Buffer varint?

A variable length encoding would not be the simplest thing to manipulate in the code, but we don't have large numbers of offsets in memory, so you could use 64 bit unencoded offsets in memory, and just use the variable length encoding on disk.

At this point I whack myself on the forehead. D'oh! Sometimes I am such an idiot! (A lot of my "aha!" moments are spoiled by thinking that if I was just a little bit smarter and had thought of it a little sooner, I could have saved myself a lot of work.)

Suneido already uses a variable length encoding when storing file offsets!

The offsets are stored in the btree key records. They are getting encoded the same way other numeric values are packed into records, which is a already a variable length encoding.

So all the messing around with align/shift is mostly wasted effort. It does reduce the magnitude of the offsets, which does mean they are encoded slightly smaller, but this is minor.

I'd be further ahead to drop the align/shift altogether, use 64 bit offsets in memory, and let the variable length encoding handle making small offsets small. This would eliminate the whole maximum database file size issue!

If I want to save space, I could use a different variable length encoding for offsets. The current encoding is designed for Suneido's decimal floating point numbers and is not ideal for simple integers.

I can't use the Protocol Buffer varint encoding because I need encoded integers to sort in the same order as the original values. varint's don't because they are stored least significant first. One simple approach would be to reserve the first two bits to specify the length of the value, for example:

00 = 4 bytes
01 = 5 bytes
10 = 6 bytes
11 = 7 bytes

This would give a maximum of 54 bits (7 * 8 - 2) or about 16,000 terabytes. By the time anyone is trying to make a Suneido database bigger than that, it should be someone else's problem :-)

Of course, I'm itching to implement this, but I'm in the middle of debugging my immudb improvements so it'll have to wait.

No comments: