Sunday, January 08, 2012

Suneido Database Transaction Puzzle

When we converted the first large customer to jSuneido we ran into a lot of transaction conflicts. A lot more than we had with cSuneido. The question was why? (Note: jSuneido is working well, without many conflicts, both in-house and at a smaller client - this problem only showed up under heavy use.)

[Disclaimer - this post is largely to clarify and review my own thinking. My apologies if it's somewhat cryptic.]

A lot of the conflicts could be traced back to application issues that we could fix e.g. having an alert inside a transaction (so the transaction was kept open too long) or not retrying.

But there were still too many conflicts. I ended up adding an option for snapshot isolation rather than serializable snapshot isolation. That eliminated most of the conflicts, but it opens up the potential for inconsistencies and I'm not really happy with it.

My initial assumption was that jSuneido was more "strict" than cSuneido and that was why there were more conflicts. jSuneido implements the algorithm from Serializable Isolation for Snapshot Databases by Cahill, Rohm, and Fekete [CRF], whereas cSuneido does "read validation" i.e. when a transaction commits it verifies that all of its reads are still valid.

But when I thought about the two approaches, I realized that cSuneido's read validation is actually more restrictive than jSuneido's CRF algorithm. Read validation requires transactions to serialize as of their commit time, whereas CRF allows re-ordering (as long as there are no dependency cycles that would prevent it). Otherwise, they are basically doing the same thing.

So I was back to trying to figure out why jSuneido was getting more conflicts.

The CRF approach does admit some "false positives" i.e. spurious conflicts that are not really conflicts. Because it's an "incremental" approach, the conflict flags can get set by transactions that don't end up overlapping or that later abort. As the rate of conflicts rises, this effect will get worse. It's possible this accounts for some of the jSuneido conflicts but I don't think it's the main reason. (Note: cSuneido's read validation only looks at committed transactions so it doesn't degrade in this fashion.)

One major difference is that jSuneido tracks reads and writes by B-tree index node rather than by individual record. This handles phantoms and simplifies merging transaction updates, but it can lead to more conflicts. For example, if a table only has only a few rows, then all updates will be in the the same B-tree node and overlapping transactions will always conflict. Similarly, appending keys at the end of an index is likely to conflict due to updating the same node.

cSuneido handles phantoms by tracking the index ranges that a transaction reads. Read validation then checks whether any concurrent writes from other transactions fell within those ranges.

My current thinking is that tracking reads and writes by B-tree node is what accounts for jSuneido getting more conflicts, although I don't have an easy way to "prove" this. If I'm correct, then the solution is to track index ranges rather than B-tree nodes. But, that's not so easy with CRF since it uses read locks on items - harder to do with ranges. And you have to deal with concurrent changes to B-tree nodes, which would require a complex tree merge.

So I think I'll probably go back to read validation instead of the CRF approach. It does mean more work during commits, and since commits are single-threaded, that will reduce the concurrency of commits. On the other hand, CRF required access to shared locks which reduces the concurrency of reads and writes. And if commits did turn out to be a bottle-neck, I think it would be relatively easy to multi-thread the commit process e.g. using one task per index. (Since indexes are independent, this should parallelize well.)

But it still leaves the question of how to handle concurrent B-tree updates if I'm no longer "locking" nodes. One approach I've been considering is to delay updating the actual "master" B-tree indexes until a transaction commits. Then the updates are single-threaded and there are no concurrent updates to deal with. Again, this puts more work into the commit, but I think this can be addressed as above.

(cSuneido handles this issue by keeping multiple versions of keys in the B-tree nodes.  But this is tricky and doesn't fit well with the append-only approach that I'm concentrating on.)

However, transactions still need to "see" their own updates, so you'd have to have some kind of additional per-transaction indexes that you would transparently "merge" with the master ones. These could be B-tree indexes, but since they are in-memory, there are probably better data structures. One possibility is a "merge tree". I called this a "fractal tree" in a couple of blog posts (here and here) but the simple structure I'm talking about is not really the same as Tokutek's fractal tree. Since those blog posts I wrote an improved version (MergeTree.java) which I'm now using for temporary indexes required by queries.

A merge tree will be fast for inserts (faster than a B-tree) but slower for lookups. Accumulating updates in a per-transaction merge tree and then applying them to the master B-tree during commit will also have the side benefit of applying the updates to the B-tree in sorted order, which is optimal for B-trees.

Thinking about this, I have an idea for another potentially large optimization. Serializability really only cares about the records that the application sees. This is not the same as the records that the low level query execution reads. For example, if you have an index on fields a,b,c and you query on b, the low level code will scan the entire index, perhaps only returning a few records (possibly none). If you just track the index range that was read by the low level code, then it will appear that all the records were read. This will mean a much greater chance of conflicts than if you tracked the records that the application actually received.

However, I don't want to mix levels and have the query execution code worry about transaction read tracking. I think what I can do is have the query code pass a "black box" predicate to the low level database code. Then the transaction read validation can use the predicate when validating reads. i.e. writes by other transactions that do not match the predicate cannot have been seen and can be ignored.

One remaining issue is "bulk" updates since none of these approaches are great for updating a huge number of records. It may be reasonable to allow the application code to "lock" a table. This would prevent concurrent updates, and would allow updating the master B-trees directly, without accumulating updates in memory.

The problem with all of this is that it relies on an embarrassing amount of guess work and seat of the pants intuition. (Based on a fair bit of experience, but even experienced intuition is probably wrong as often as it's right.) Ideally, you'd test out these ideas before jumping into implementing them. But it's just not feasible. Any one of these issues would make a good research project. And even then, you'd only have analyzed a single issue - not how it relates to all the other issues.

The solution space is just too large. Which, of course, is also what makes it fun. It would be pretty boring if it was simple enough to analyze exhaustively, and have no questions about the best approach. The chances of me stumbling on a solution that is better than what is out there is slim, but at least it exists.

The timing actually isn't too bad. I'm just finishing up debugging the append-only storage engine. (So far, the performance looks promising, perhaps as much as twice as fast.) Once I finish that I can take a look at these transaction ideas and see what I can do.

No comments: