Thursday, April 28, 2011

Suneido's Record Data Structure

Both cSuneido and jSuneido use the same data structure for low level record storage.


type is one of 'c' (char, 1 byte), 's' (short, 2 bytes), or 'l' (long, 4 bytes) and specifies the size of the length and offset values. (In Java, a better choice of types would have been 'b' (byte), 's' (short), and 'i' (int)). (Calling a 4 byte integer "long" is a leftover from the days when int's were 2 bytes.)

There is one byte of padding after the type byte. (Originally a result of C struct padding.) Or the type can be treated as a 2 byte little endian short integer.

nfields is a short (2 byte) integer count of the number of fields in the record.

length is the overall number of bytes in the record. When calculating field sizes as the difference between two offsets, it is handy to be able to reference the length as offset[-1].

The fields data is variable length and stored in reverse order.

In cSuneido records are used as mutable data structures. As fields are added, the offsets and fields grow towards each other. Before storing in the database file they are copied to the exact size required. As a mutable data structure, one drawback of the design is that when the record is reallocated to a larger size, all the offsets change. If the field data was at the beginning and the offsets at the end, then reallocating would not change the offsets and they could simply be copied to the new record. The tradeoff would be that accessing the offsets would be more awkward since they would not start at a fixed location in the record.

In jSuneido records are stored in NIO ByteBuffer's. They are only used as immutable data structures (at least in the new storage engine). A RecordBuilder is used to collect the data so that the record can then be created the exact size required. (i.e. with no unused space between the offsets and the fields)

Allowing different sizes for the length and offsets makes the overhead low for small records (an empty record is only 5 bytes) while still allowing for large records (up to 32k fields and 2gb length).

Having an array of offsets allows random access to the fields. For example, a search on the Nth field can access that field directly, without having to scan through the record. Of course, this is only an advantage because Suneido does not convert records to an internal format to search them.

Offsets are used rather than pointers because records are intended to be stored in the database where pointers would not be usable.

In this design empty fields still take up space in the offsets array although not in the field data. (e.g. in the common short type, empty fields take 2 bytes each) This seems like a reasonable tradeoff for the ability to randomly access fields.

The C++ source code is in record.h and record.cpp and the Java code is in Record.java

No comments: