I ran a few tests for database growth on the new version of my immudb append-only database storage engine for jSuneido.
First, I tested with persisting indexes every update transaction. This is more or less what it did previously. Then I tested with persisting every so many update transactions.
every 1 = growth 5,882,408
every 10 = growth 1,322,712
every 100 = growth 551,080
every 1000 = growth 354,168
This is the kind of results I was counting on. Deferring persisting the indexes results in much less growth.
Coincidentally, the improvement shown above is about 17 times, not far from the 20 times difference between cSuneido and the first version of immudb. This is just coincidence since I'm running different tests, but it's very nice to see that the improvement may be enough to bring immudb into the same ballpark as cSuneido, which is pretty good considering cSuneido is a mutable database.
Is persisting every 1000 update transactions reasonable? I think so. These tests only take about 5 seconds to run. Remember, the actual data is being written right away, log style, so persisting less often does not increase the risk of losing data. It only increases the amount of work to be done by crash recovery. Making crash recovery re-process 5 seconds (or more) of transactions seems fine. Especially considering that cSuneido rebuilds the entire indexes after a crash, which could take a long time for a big database.
These numbers are from running the stdlib (standard library) tests. This isn't a very typical work load, so the numbers aren't going to be very accurate. But I'm just looking for some very rough feedback at this point. If I'd seen little or no improvement, I'd be seriously depressed right now and considering taking up golf instead :-)
Ideally, I need a good statistical model of a typical work load, or at least representative, since there's probably no such thing as typical. Another project for my "spare" time!
Hmmm... thinking about what the stdlib tests do versus what a typical workload might be, I realized that one atypical thing the stdlib tests do is to create, use, and then drop temporary database tables. But in that case, if you defer persisting till after the table is dropped, then you never need to save the indexes for that table at all. (Although the data will still be written immediately, log style.)
It turned out I wasn't handling this, which actually was good because it would have seriously skewed the above results. It was easy enough to fix.
Rerunning the tests, I got the results I expected - much the same growth when persisting every transaction, less growth than before as the persist interval increased. Persisting every 1000 transactions resulted in a growth of only 62,424 - roughly 5 times better.
This is nice, but not really relevant to production, since a typical workload does not include creating and dropping a lot of temporary tables. It will be nice for development because running the tests won't cause the database to grow so much. (At least until I implement background compaction.)
Persisting every so many update transactions is not necessarily the best approach, I'm using it because it's the easiest. Alternately, you could persist every so many seconds or minutes, or try to persist during "idle" times. Currently, I'm persisting synchronously i.e. while holding the same lock that commits use. So update transactions can't commit till the persist finishes. (But read-only transactions are not affected, and even update transactions can read and write, they just have to wait to commit.) Depending on how much index information has to be persisted, this could introduce delays in server response. To alleviate this, you could persist in a background thread. Because of the immutability, this should be relatively straightforward. The exception is updating the in-memory index information to show it has been saved - this would require some synchronization.
Another issue I haven't dealt with is that I may need to "flush" some of the index information out of memory. Once it's been stored, a btree node in memory consists mainly of a ByteBuffer that points to part of the memory mapped index file. Since the memory mapped space is "virtual", it will get evicted if the OS runs short on memory. But for a really big database, even the small objects referencing the virtual memory may add up to too much memory usage.
It would be easy enough for the persist process to discard the btree nodes once they have been saved. But that would have an effect on performance since they would have to be recreated on-demand. You'd want to evict selectively, presumably on some kind of LRU or NFU basis, which would require tracking some kind of usage data. Presumably, you'd also only want to do this when necessary, i.e. when memory is getting full. Hmmm... perhaps one approach would be to use WeakReference's and let Java worry about when and what to evict. Anyway, that's a challenge for another day.
And now back to testing :-)