Wednesday, January 25, 2012

On-line Database Compaction

One of the weaknesses (at least in my mind) of Suneido's database is that it requires off-line compaction. i.e you have to periodically shut down the server and run a standalone compaction process.

Practically speaking, that hasn't been a big issue. With the mutable storage engine, database growth is relatively slow and you don't need to compact that often. Few of our customers run 24 x 7 so it's not a problem to shut down at night occasionally. Also, some of the updates we deploy have to be run single user anyway.

But with the immutable storage engine, even with my improved design, the database will grow faster, and require more frequent compaction.

It would sure be nice if you could compact on-line, in the background, while the server was running.

Thinking about my planned improvements, I realized this might now be relatively easy. (Actually, I think I could have used a similar approach before, it just became more obvious now.)

The basic idea is to use a read-only transaction, and its snapshot of the data base as of a certain point in time, to compact the database up to that point. (Just like the current off-line compaction.)

But since the database is active, you will probably have activity after that point. So then you reprocess any updates after your transaction and apply them to the new database. (This is similar to how crash recovery would reprocess updates that happened after the last good checkpoint.) When you "catch up", then you switch over to the new database (with some synchronization required).

There is a potential problem if the server is so busy that the compaction never catches up. In practice, the compaction could be scheduled at a slow time of the day, and most of our systems aren't continuously busy. In any case, this wouldn't "hurt" anything, other than performance. You could easily detect this situation and abort the compaction.

Although the server can continue to operate during the compaction, performance may be affected. (As it can be with our current on-line backups.) CPU time is usually not the problem these days. A bigger issue is that reading all the data can cause other working set data to be evicted from memory. One option may be to not use virtual memory to read the data. Then you're only competing for disk cache and not virtual memory space. What we've found with our on-line backups is that as long as you have lots of memory it's not a problem.

One benefit of building a new compacted database file (rather than compact in place) is that it is inherently crash proof, at least in the sense that if the server crashes, the current database won't be affected. The new database being created can simply be discarded.

In garbage collector terms this is equivalent to a "copying collector" i.e. it just copies the live data, rather than scanning all the data (live and dead) as, for example, a "mark-sweep" collector does.

Suneido does support on-line backups, taking advantage of snapshot isolation to make a consistent backup as of a certain point in time. With the append only immudb storage engine, another option would be to simply copy the bytes of the database file up to a certain point without paying any attention to interpreting the structure. If the data and indexes are in the same file, this would copy both. If they were in separate files you'd have the option of only copying the data (like the current on-line backup). However, this would still be somewhat larger than the current backup because it would include deleted records. On the other hand, it should be faster since it's just a bulk copy.

Note: These ideas, like my ideas for limiting database growth, are just ideas. I'm sure there are unseen complexities waiting to trip me up.

No comments: