Saturday, January 21, 2012

Problem or Opportunity?

I started to get a little worried about the database growth problem with my immudb append-only storage engine for jSuneido. I could think of ways to reduce the growth a little bit, but not by much, and only at the cost of adding a bunch of complexity.

Was I going to have to throw out a year's work, when it seemed so promising? Having to throw out a few day's work is one thing, but a year? How could I have gone so far without seeing this problem? Especially since I knew perfectly well it was an issue with this approach.

I'm as much an eternal optimist as the next programmer, with a big enough ego to think I can solve most problems. But I'm also human with the standard intermittent feelings of inadequacy and stupidity. In other words, I was starting to worry. But I knew enough to give myself time to mull it over.

I was sitting eating lunch, reading a computer book when I started to think about the problem.

What if I wrote out the redirections as just a list of values? That would be a lot more compact than saving the changes to the hash trie. That would be fine while you were running because you'd still have the hash trie in memory. But what about restarting? How would you get the in-memory hash trie back? And how would you handle crash recovery?

It seemed like the problem wasn't so much the way I was saving the information, it was saving it so often, for every transaction. I did this so the on-disk data structure matched the in-memory data structure after every transaction commit.

But was that strictly necessary? What if you relaxed that constraint? What if you only wrote out all the extra information (btree nodes, redirections, db info) periodically? That would reduce the database growth a lot. You could then consider a continuum, from saving after every commit, to the opposite extreme of only saving when you shut down the server.

I realized that what I was thinking about was basically the idea of "checkpoints" - a standard database implementation technique. You could also think of it as simply deferring index writes to disk.

Crash recovery becomes a little harder - you'd have to start from the last good checkpoint and reprocess the data from transactions after that. Similar to how other database crash recovery reprocesses logs from the last checkpoint. This isn't quite as elegant or simple as being able to simply pick up from the last good commit, but crashes don't happen that often. And it was still a lot better than having to rebuild all the indexes after a crash as Suneido requires with the current mutable storage engine.

The beauty of this solution is that it doesn't require a lot of changes. The immutable persistent in-memory data structures and the code that uses them stays exactly the same. You just don't save it as often.

Of course, taanstaafl. The cost of this approach would be higher memory usage. Once the index changes are saved to disk you don't have to keep the in-memory versions. Obviously, if you don't save to disk, you do have to keep the in-memory versions. However, the memory cost is much less than the disk growth because the memory will be garbage collected. Memory is cheap and jSuneido assumes you have lots. (For small systems with limited memory, cSuneido is a better choice.)

Another benefit from saving after multiple transactions is that there can be considerable overlap. For example, if there are "hotspots" in the data that are frequently updated, you only save them once per checkpoint instead of once per commit.

So the commits would simply save the data records (inserts, updates, and deletes) to the database file, with no other information. The btrees would be updated in-memory only. Then periodically you would save a checkpoint with the btree and db info changes.

Thinking about it, I wonder if I needed the redirections with this approach. The point of the redirections is to save space by not path copying to the root. But periodic checkpoints may save enough space that redirections aren't as important. Eliminating them would definitely simplify the code - always a good thing!

I also realized that I could even write the checkpoints in a separate concurrent thread since they could use the persistent immutable data structures to view the state as of a certain point in time (just like transactions do) without any concurrency issues and without blocking ongoing database activity. The only trick would be coordinating writing to the database file. Even that coordination could be avoided if you wrote the checkpoints (i.e. the index btrees) to a separate file from the data. Although I like having everything in one file, it's certainly not essential.

They say that creative problem solving is a lot about combining known, existing pieces into new combinations. I had all the pieces I needed to solve this problem, I just had to put them together. (Ironically, I talked about a lot of the same pieces almost two years ago in A Faster, Cleaner Database Design for Suneido)

And so I go from thoughts of a depressing major failure, to feeling like a kid with a new Lego kit :-)

No comments: