Sunday, April 24, 2011

jSuneido append-only database fomat

The following is as much for my own review as anything.

The database file layout is quite simple - it consists of a sequence of commits.

Each commit has a header (size and timestamp) and a trailer (checksum and size). Given the leading and trailing sizes, the commits can be traversed forwards and backwards.

The body of a commit consists of the data records (table rows), btree nodes, and hash trie nodes followed by pointers to the roots and redirects hash tries.

The roots hash trie points to the btree indexes as well as other database information. As I've talked about before, the redirects are used to avoid path copying in the btrees.

When the database is opened, the roots and redirects are accessed at a fixed offset from the end of the file. Together they provide access to the database at a certain point in time.

Read-only transactions, operating as-of a certain point in time simply use the roots and redirects as of when they started. They can operate completely independently because the data they are looking at is immutable. There are no locks or concurrency issues.

When the database is opened, the checksums of the last few commits are verified to confirm that the database was previously closed properly. (This uses the trailing size to traverse the commits from the end of the file backwards.)

If this checking fails, then crash recovery is done. The commits are traversed from the beginning of the file (using the leading sizes) and the checksums are verified. The file is then truncated after the last good commit. (This is much simpler and faster than the current system.)

Checksums are of the raw data - they do not care about the logical contents of the commits. They use Adler32 checksums for speed.

Note: Although it's not shown in the diagram, there may be "padding" between elements in the file. This is because the memory mapped file is mapped and extended in "chunks" of 64 mb. Elements are not allowed to straddle chunks, so padding may be required at the end of a chunk if an element won't fit. Since there is no sequential access to elements within commits, this padding has no effect, other than a very small amount of wasted space.

All of this is implemented and working well. But I've still got a fair bit of work left to integrate this new storage engine into the rest of the database code. Sadly, I don't have a nice clean storage engine interface :-(

As usual, the code is available in the Suneido project Mercurial repository on SourceForge. The new storage engine code is in the immudb package.

No comments: