Saturday, April 30, 2011

Suneido's Record Structure Revisited

After my last blog post on Suneido's record data structure, the wasted byte in the header started to bug me. I know where it originally came from, I simply had a C struct like:

     char type;
     short nfields;

C/C++ automatically adds a byte of padding after the char so that the short is aligned.

Most of the time the padding was invisible and when I did think about it, I figured if the average record is, for example, 100 bytes, then it's only 1% overhead, no big deal.

After the last blog post I was wondering how I could eliminate that byte of padding. There are only three types, so it only really requires two bits. And the full range of nfields isn't really required, so you could store the type in two bits of nfields, still leaving a maximum number of fields of 16,383.

That actually cuts two bytes, not just one. But still, that's only 2%.

But then I realized that Suneido also uses this record format to store keys within btree nodes. And keys are shorter. If the average key is 10 or 20 bytes, then the overhead is 10 or 20%, not 2%. That's a little more significant.

And it's not just the memory space, it's also garbage collection overhead, and disk reads and writes. And it's also copying overhead when "updating" immutable data records and btree nodes.

One of the reasons I kept the same record format in jSuneido as in cSuneido was so that dumps from cSuneido could be directly loaded into jSuneido without converting the record format. But btrees are not moved from one version to another, so they could easily use a different, more compact format. And it wouldn't be hard to convert the format when loading cSuneido dumps.

One drawback to this approach is that when the offsets are 4 byte ints, then a 2 byte header will mean the offsets are not aligned. This might be more of a concern in cSuneido, where records are used more heavily and are accessed more directly. But I don't think it's a big issue in jSuneido.

It wasn't too hard to implement. With one table of about 1600 records that was 1 mb in size (including indexes) the database was about 5% smaller after the changes. Not bad for a few hours work and very little added complexity.

Thursday, April 28, 2011

Suneido's Record Data Structure

Both cSuneido and jSuneido use the same data structure for low level record storage.

type is one of 'c' (char, 1 byte), 's' (short, 2 bytes), or 'l' (long, 4 bytes) and specifies the size of the length and offset values. (In Java, a better choice of types would have been 'b' (byte), 's' (short), and 'i' (int)). (Calling a 4 byte integer "long" is a leftover from the days when int's were 2 bytes.)

There is one byte of padding after the type byte. (Originally a result of C struct padding.) Or the type can be treated as a 2 byte little endian short integer.

nfields is a short (2 byte) integer count of the number of fields in the record.

length is the overall number of bytes in the record. When calculating field sizes as the difference between two offsets, it is handy to be able to reference the length as offset[-1].

The fields data is variable length and stored in reverse order.

In cSuneido records are used as mutable data structures. As fields are added, the offsets and fields grow towards each other. Before storing in the database file they are copied to the exact size required. As a mutable data structure, one drawback of the design is that when the record is reallocated to a larger size, all the offsets change. If the field data was at the beginning and the offsets at the end, then reallocating would not change the offsets and they could simply be copied to the new record. The tradeoff would be that accessing the offsets would be more awkward since they would not start at a fixed location in the record.

In jSuneido records are stored in NIO ByteBuffer's. They are only used as immutable data structures (at least in the new storage engine). A RecordBuilder is used to collect the data so that the record can then be created the exact size required. (i.e. with no unused space between the offsets and the fields)

Allowing different sizes for the length and offsets makes the overhead low for small records (an empty record is only 5 bytes) while still allowing for large records (up to 32k fields and 2gb length).

Having an array of offsets allows random access to the fields. For example, a search on the Nth field can access that field directly, without having to scan through the record. Of course, this is only an advantage because Suneido does not convert records to an internal format to search them.

Offsets are used rather than pointers because records are intended to be stored in the database where pointers would not be usable.

In this design empty fields still take up space in the offsets array although not in the field data. (e.g. in the common short type, empty fields take 2 bytes each) This seems like a reasonable tradeoff for the ability to randomly access fields.

The C++ source code is in record.h and record.cpp and the Java code is in

Tuesday, April 26, 2011

Semi-Immutable Data Structures

In the new append-only database storage engine for jSuneido I've ended up with several data structures that are "semi-immutable" i.e. they are immutable some of the time and mutable some of the time.

(These are in-memory data structures, the database file itself is only ever appended to, existing data is always immutable i.e. never updated.)

I've flip flopped back and forth a few times from completely immutable to partially immutable, but I seemed to have settled on partially immutable.

The overall "state" of the database is kept immutably. When a database transaction starts, it "copies" this state. (The "copy" consisting of just a few pointers to persistent immutable data structures.)

"Inside" a transaction, the data structures can become mutable. The first update a transaction does will path copy nodes up to the root. This "disconnects" the data structure from the overall databases state. Nodes within the data structure can now be either shared and immutable or not shared and mutable.

(Allowing mutability within transactions is "safe" because transactions are thread-contained i.e. only one thread at a time works on a transaction.)

The main advantage of this approach is performance since nodes can be updated with less copying.

At the end of a transaction commit, the new state is stored (persisted) in the database file. This converts all the nodes back to immutable.

This semi-immutable approach is used both for the table index btrees and the hash tries used to store redirects and database info (e.g. btree roots).

Sunday, April 24, 2011

jSuneido append-only database fomat

The following is as much for my own review as anything.

The database file layout is quite simple - it consists of a sequence of commits.

Each commit has a header (size and timestamp) and a trailer (checksum and size). Given the leading and trailing sizes, the commits can be traversed forwards and backwards.

The body of a commit consists of the data records (table rows), btree nodes, and hash trie nodes followed by pointers to the roots and redirects hash tries.

The roots hash trie points to the btree indexes as well as other database information. As I've talked about before, the redirects are used to avoid path copying in the btrees.

When the database is opened, the roots and redirects are accessed at a fixed offset from the end of the file. Together they provide access to the database at a certain point in time.

Read-only transactions, operating as-of a certain point in time simply use the roots and redirects as of when they started. They can operate completely independently because the data they are looking at is immutable. There are no locks or concurrency issues.

When the database is opened, the checksums of the last few commits are verified to confirm that the database was previously closed properly. (This uses the trailing size to traverse the commits from the end of the file backwards.)

If this checking fails, then crash recovery is done. The commits are traversed from the beginning of the file (using the leading sizes) and the checksums are verified. The file is then truncated after the last good commit. (This is much simpler and faster than the current system.)

Checksums are of the raw data - they do not care about the logical contents of the commits. They use Adler32 checksums for speed.

Note: Although it's not shown in the diagram, there may be "padding" between elements in the file. This is because the memory mapped file is mapped and extended in "chunks" of 64 mb. Elements are not allowed to straddle chunks, so padding may be required at the end of a chunk if an element won't fit. Since there is no sequential access to elements within commits, this padding has no effect, other than a very small amount of wasted space.

All of this is implemented and working well. But I've still got a fair bit of work left to integrate this new storage engine into the rest of the database code. Sadly, I don't have a nice clean storage engine interface :-(

As usual, the code is available in the Suneido project Mercurial repository on SourceForge. The new storage engine code is in the immudb package.

Thursday, April 21, 2011

NativeJ Java Launcher

If you're looking for a way to run Java programs as Windows services, or launch them from an exe, I would highly recommend Dobysoft's NativeJ

The software works well, but more importantly, they have bent over backwards to accomodate our needs. Granted we paid for it, but $150 doesn't cover a lot of support, let alone programming.

Previously we used JSL to run as a service and Launch4J for a launcher. Both are free and worked ok, but we ran into problems with Launch4J not returning the exit code. (a known issue) It's open source so I guess we could have dug into it and tried to fix it, but I was looking for something that would handle both services and launcher together and NativeJ looks like a good solution.

Wednesday, April 20, 2011

PDP-11 Emulator running V6 Unix - in Javascript!


I was in high school when I got my first Unix account on a computer science system at the university. It might have been a PDP-11. I remember struggling to teach myself C from Kernighan and Ritchie when all I knew was Basic. And of course way too shy to actually ask anyone for help.

Sunday, April 17, 2011

A Fractal Tree Implementation

I couldn't resist whipping up a quick implementation of a fractal tree, even though I don't really need it for anything. I'm a sucker for a good algorithm :-)

The code can be amazingly simple. Here's my first cut at insertion:
public void add(T x) {
     Object[] tmp = new Object[] { x };
     int i = 0;
     for (; nodes[i] != null; ++i) {
          tmp = merge(nodes[i], tmp);
          nodes[i] = null;
     nodes[i] = tmp;
merge simply merges two sorted arrays into a new larger array.

This works fine, but it will chew through a lot of memory allocating new arrays. The space will be reclaimed by the garbage collector, but it seems a little wasteful.

So I modified it to pre-allocate the first few levels of small arrays since these are the ones that get the heavy activity. (e.g. the first four levels of size 1, 2, 4, and 8 will handle 15/16 of adds) This complicates the code a bit, but not too badly.

Iteration is quite simple as well - just merging the sorted nodes.

The full code is in SourceForge

Friday, April 15, 2011

Fractal Trees

Another interesting data structure is the Fractal Tree used by Tokutek's TokuDB (another MySQL variant). I think this is related to the stratified data structure (at least some of the description is similar).

B-trees are at one point in the range of insert speed versus lookup speed. Fractal trees are much faster at inserts, but slower at queries.

It's easy to see how inserts work, and that most of them would be very fast. But part of the reason it's fast is that the cost is amortized - there are occasional big operations (a little like having to resize a hash table). It sounds like they're doing these in a background thread to avoid delays, but that has to make querying tricky.

And querying seems tricky regardless. A naive binary search in each level is going to be slow (and potentially many disk accesses or at least cache misses). To speed this up they add pointers from each value to it's position in the next level. They don't go into detail, but I have to think maintaining these pointers is going to make inserts more complex.

I'm not sure how this approach would work for an append-only index. Again, the average insert wouldn't write much, but the occasional insert that caused a lot of merging would write a lot. Maybe the amortized cost would still be reasonable.

It seems like it would make an interesting sort algorithm, but I have no idea how the performance would compare. It's a little like a merge sort, which has good performance.

Doesn't look like this is something I can use directly, but it's always good to learn new approaches.

Thursday, April 14, 2011

B-tree Arcana

I recently ran across a new article on Stratified B-trees and versioning dictionaries. I have to admit I didn't really follow their data structure or algorithm. I looked at some of the references, for example to Cache-Oblivious Streaming B-trees but the ideas seem too complicated to be directly useful to me.

However, there were some bits that were interesting to me. They talked about the space blowups and inefficient caching in path copying CoW (copy-on-write) B-trees.

each update may cause an entire new path to be written – to update a 16-byte key/value pair in a tree of depth 3 with 256K block size, one must first do 3x256K random reads and then write 768K of data (see which reported 2GB of log data per 50MB of index data)

I'm hoping to avoid this using redirection (based on ideas from RethinkDB). By redirecting instead of path copying, you only need to copy the node you're updating, not all the nodes on the path to the root. (And the nodes can be orders of magnitude smaller - see below.)

Many systems use indirect access through a mapping table all the time. Instead, I use direct access whenever possible, and only indirect to updated data. That feels like it should be better, but without a lot of analysis it's hard to be sure. It does reduce space overhead somewhat.

One of the big issues with CoW B-trees is garbage collecting the old nodes. I have to admit I punt on this, requiring occasional database compaction instead. (This also removes all the redirection.) Not so good if you want to run 24x7 but it's a tricky problem and for our customers occasional maintenance downtime isn't a big deal.

Another interesting point was on SSD's (solid state drives).

SSDs handle small random reads very effectively, but perform poorly at random writes (the recent Intel X25M can perform 35,000 4KB random reads/s, but an order of magnitude fewer writes). However, they can perform very efficiently for large sequential writes.
That fits really well with CoW append-only B-trees since they do exactly that - random reads but sequential writes. And I recently got a new computer at work with an SSD :-) so I'll be able to do some testing.

Another point that caught my eye was that their arrays doubled in size between levels. Not understanding their system, I'm not completely sure what this means. But it rang a bell because I've been considering doing something similar with my B-tree nodes. Often, B-trees use large node sizes to reduce the number of random disk IO's. But large nodes don't work well with copy-on-write. And random IO's aren't as bad with SSD or just large disk caches.

I still need to do some testing to determine the best size, but I think it will be quite small. But I wonder if it would make sense to make nodes larger as you go "up" the tree towards the root. The leaf nodes are updated most often so it makes sense to make them small. But as you go up the tree the nodes are read more and updated less. Increasing (e.g. doubling) the size of the nodes at each level would decrease the number of tree levels and reduce the number of random IO's. And since larger nodes would split less often, higher tree levels would get updated even less frequently.

If I was in the academic world I'd need to figure out the theoretical performance of my approach under DAM (direct access machine) or CO (cache oblivious) models. Instead, I'll do some quick and dirty benchmarks and if there's not a big performance difference go with the simplest code.

Wednesday, April 13, 2011

Dead End

I've been plugging away at the new append-only storage engine for jSuneido. I have a decent proof of concept, but it's still a long way from production.

Recently I was working on btree iteration. Usually btree's link the leaf nodes to simplify iteration. But this linking is problematic with an append-only btree. So the iteration has to work a little differently. One of the problems is how to handle changes made during the iteration. Being in an immutable mindset, my immediate thought was to iterate through a "snapshot" of the btree. (Since snapshots are just a pointer in immutable persistent data structures.)

But then I realized I was only immutable on disk. In memory, within a transaction, I was using mutable data structures. So I jumped in and converted my btree's and the redirection hash trie map to fully immutable. After I'd pretty much finished this, I realized I should have looked closer before I leaped. But when you have a hammer, everything looks like a nail :-)

Iterating through a snapshot won't work. For example, if you update a record during iteration, the iteration would see an out of date version of the record, and might even try to update this "stale" data. Back to the drawing board.

On the positive side, I improved the code quite a bit. I've lost track of how many times I've refactored, reshuffled, rewritten this new storage engine code. It's not that big - currently about 2500 lines. You wouldn't think there'd be that many ways to arrange it!

Since I ended up not needing full immutability I changed it back to being immutable in memory (better performance). But I couldn't just revert the code because I wanted to keep some of the improvements I'd made.

About three days work on a dead end path. Oh well, if you're trying to come up with something new you have to expect a few dead ends. Not the first time, and I'm sure it won't be the last.