Saturday, April 30, 2011

Suneido's Record Structure Revisited

After my last blog post on Suneido's record data structure, the wasted byte in the header started to bug me. I know where it originally came from, I simply had a C struct like:

     char type;
     short nfields;

C/C++ automatically adds a byte of padding after the char so that the short is aligned.

Most of the time the padding was invisible and when I did think about it, I figured if the average record is, for example, 100 bytes, then it's only 1% overhead, no big deal.

After the last blog post I was wondering how I could eliminate that byte of padding. There are only three types, so it only really requires two bits. And the full range of nfields isn't really required, so you could store the type in two bits of nfields, still leaving a maximum number of fields of 16,383.

That actually cuts two bytes, not just one. But still, that's only 2%.

But then I realized that Suneido also uses this record format to store keys within btree nodes. And keys are shorter. If the average key is 10 or 20 bytes, then the overhead is 10 or 20%, not 2%. That's a little more significant.

And it's not just the memory space, it's also garbage collection overhead, and disk reads and writes. And it's also copying overhead when "updating" immutable data records and btree nodes.

One of the reasons I kept the same record format in jSuneido as in cSuneido was so that dumps from cSuneido could be directly loaded into jSuneido without converting the record format. But btrees are not moved from one version to another, so they could easily use a different, more compact format. And it wouldn't be hard to convert the format when loading cSuneido dumps.

One drawback to this approach is that when the offsets are 4 byte ints, then a 2 byte header will mean the offsets are not aligned. This might be more of a concern in cSuneido, where records are used more heavily and are accessed more directly. But I don't think it's a big issue in jSuneido.

It wasn't too hard to implement. With one table of about 1600 records that was 1 mb in size (including indexes) the database was about 5% smaller after the changes. Not bad for a few hours work and very little added complexity.

No comments: