Thursday, January 19, 2012

Houston, we have a problem

I thought things were going well with my new immudb append-only storage engine. It finally passed all our application tests, and performance was looking good.

Then I noticed the amount of database growth when I ran our full application test suite:

mutable storage engine = 20mb
immutable storage engine = 400mb

Ouch. I always expected it to use more space - that's the cost of immutable persistent data structures. But I didn't expect it to be 20 times as much.

With immutable persistent data structures in memory, the space is reclaimed by garbage collection. But so far the only form of garbage collection for database space that Suneido has had is offline compaction.

It's a little surprising that the performance is good, even on hard disk rather than SSD. I assume that's because sequential writes are actually quite fast. And also because it's using memory mapped file access so it's not blocking.

Given that the performance is ok, the database growth might even be acceptable. It might just mean compacting the database nightly instead of weekly as we currently do. However, presumably the performance would be even better if it wasn't writing so much.

I think it's worth spending some time to see if I can improve it. The first step was to see what the breakdown of the space was. (Note: these figures are growth, not total size.)

96820 update transactions
data space 11 mb
btree node space 140 mb
redirection space 89 mb
db info space 148 mb

If you look at the growth per transaction it's only about 4 kb. That doesn't sound too bad, until you have 100,000 transactions.

The btree node space could likely be decreased a lot (maybe half?) by compression of the nodes, perhaps something as simple as prefix compression.

Redirection is used to avoid path copying all the way to the root in the btrees. It's stored as an immutable persistent hash trie. Unfortunately, it does have to do path copying itself and that may be part of the problem.

I also noticed that each subsequent run used more redirection space. That's a little curious because it's doing the same set of updates each time. However, as the total number of redirections grows, then path copying will grow.

The "db info" is another hash trie containing aggregate data about each table and index. The information is used for query optimization. It could use redirection instead of path copying but currently it doesn't because it's using the same hash trie implementation as the redirections. It currently stores a "blob" of information for each table.

One improvement would be to delete redirection and db info when a table or index is destroyed. However, this would help the tests more than actual application usage, since the tests create and destroy lots of temporary tables. This might mean actual application usage would not show as high database growth as the tests.

I'm not sure what to try first. It's too bad the space is coming from three similarly sized sources. If the majority had come from one source then that would be the obvious place to start. There's always a chance that there are bugs or design flaws causing part of the problem. Another variable to try adjusting is btree node size.

* You could argue that our application test suite ideally shouldn't be hitting the database at all. But that's not relevant to this post.

** The title of this post comes from Apollo 13

No comments: