Sunday, February 19, 2012

Immudb Progress

I've been plugging away at improvements to the immudb append-only storage engine for jSuneido.

To reduce database growth: (see Houston, we have a problem and Problem or Opportunity?)

- Change to persist btree index changes to disks periodically, instead of after every transaction.
- This makes redirection less important so I'm removing it. (Funny how this great idea is suddenly more trouble than it's worth.)
- This also allows persisting btrees in a separate concurrent thread. To make this as independent as possible I plan to split the database into two files. I have mixed feelings about this because I really like having everything in one file. With two files you have to make sure you always backup / restore / copy / move / rename the two files together because if they get out of sync you're in trouble.

To reduce transaction conflicts: (see Suneido Database Transaction Puzzle)

- Instead of tracking reads and writes by btree node, track writes by record and reads by index range and predicate.
- This makes it harder to use the previous approach of merging btree versions. Instead I'm "overlaying" transaction local indexes over the "master" indexes, and then applying the index changes to the master index during commit. (Easy because commits are single-threaded.)

None of these changes is that big, but unfortunately, they are fairly invasive. Rather than "break" everything while the changes are in progress I've been taking a "strangler fig" approach, keeping the existing code intact and functional while at the same time building up a new version that leans heavily on the old one. Once I have the new version functional on its own I can remove the vestiges of the old code.

I should probably be doing these two sets of changes separately, but they keep getting mixed together in my head. I just know where I want to get to, not necessarily which problem each particular change is aimed at. The changes all affect the same code, so in some respects it's easier to do them together. Time will tell whether this was the right approach :-)

Another question is whether the changes I'm working on will actually succeed in reducing database growth sufficiently. I'm actually fairly confident about this because it's not a "yes or no" question. If you persist the indexes after every commit you get the excessive growth I've already seen. At the opposite extreme, if you only persisted at shutdown, you'd get minimal growth. Presumably, the answer lies somewhere between these two extremes. Looking at it another way, if you updated the same btree node in each transaction, if you only saved it once per 10 transactions, then you'd get 1/10th the database growth. If 1/10th is still too big, then you only save every 100 transactions, giving 1/100th the database growth. Since the actual data is still being written to disk at each commit, the only cost of persisting the indexes less often is increased work by crash recovery, which (a) shouldn't happen that often, and (b) processing e.g. 100 commits worth of data to recover from a crash doesn't seem that bad.

Another nice feature of deferring persisting indexes is that it offloads work from commits which are a potential bottleneck since they're single-threaded. There are some issues with persisting indexes concurrently with commits, but I think I have an approach figured out. On the other hand, I'm now updating the master indexes during the commit so it's not all gain.

So far so good. I have some of the pieces working and it's gradually coming together. I'm anxious to see how it turns out.