Saturday, September 02, 2023

A gSuneido Database Issue

We've been gradually converting our customers' servers from jSuneido to gSuneido. Recently we converted a larger customer and they started to get "too many outstanding transactions" errors. Unfortunately, we'd changed multiple things at once with this customer, so it wasn't immediately clear whether this issue was because of the switch to gSuneido, but I suspected it was.

I had set the limit in gSuneido to the same value as jSuneido - 200 outstanding transactions. "outstanding" in this case either means uncompleted in progress, or completed but overlapping with an uncompleted one.

It seemed a little counterintuitive. gSuneido is faster than jSuneido. So why should it run into this more? My theory is that because it's faster, more transactions get processed and if some of them are slow this will mean more overlapping transactions, even though they completed quickly.

One of my first thoughts was to look at optimizing the gSuneido code. But if my theory was right, this could actually make the problem worse rather than better.

We temporarily solved the problem by adding slight delays to code that was using a lot of transactions. (i.e. we throttled or rate limited them). This worked but it seemed a little ugly to slow it down all the time, just to avoid a problem that only occurred occasionally.

That led me to add throttling to gSuneido, but only when the outstanding transactions got high (e.g. 100). Throttling would prevent hitting the limit (200). I wrote a stress test and this did solve the problem. And it was better than adding delays in the application code. But the problem was that once it throttled, it got really slow. If you throttled enough to prevent hitting the limit e.g to 10 transactions per second, then 1000 transactions would go from maybe .1 seconds to 100 seconds - ouch! And because throttling slowed things down so much, it increased the chances of transactions hitting the 20 second time limit. The cure seemed worse than the disease, it won the battle but lost the war. (Sorry for the proverbs.)

I started to wonder what the right limit was. 200 was an arbitrary value. The reason to have a limit was to prevent it from "falling over" under heavy load. But at what point would that happen? So I removed the limit and tested under different loads.

threads time tran/sec conflicts max trans
1 2 sec 5000 100 70
2 2 sec 10000 150 110
3 3 sec 10000 200 150
4 4 sec 10000 200 180
5 4 sec 12500 200 200
6 5 sec 12000 300 230
8 8 sec 10000 300 270
10 10 sec 10000 300 300
16 20 sec 8000 400 450
20 30 sec 7000 500 490
30 50 sec 7000 600 580

I was quite happy with the results. Performance did drop off with more than 5 or 6 threads, but not that badly. The maximum outstanding got well above the old limit of 200, but that didn't seem to be a major problem. The number of transactions aborting due to conflicts increased with load, but that's expected with that much concurrent activity. To some extent it seemed to be self limiting. i.e. the load itself slowed things down and effectively did its own throttling.

Of course, this is completely dependent on what my test is doing. Is it "typical"? No, probably not. For starters, it's doing continuous activity on multiple transactions as fast as it can, with no other work. Real application code is not going to be doing that. Our applications never do 10,000 transactions per second for long periods. The test does result in shorter and longer transactions, but the longest are only around 1 second, whereas in production we do see transactions taking much longer and sometimes hitting the 20 second limit (usually as a result of external factors). And my workstation only has 8 cores / 16 threads, so it's not really 30 threads.

Out of curiosity, I removed the limit from jSuneido to see how it would perform. 10 threads took 15 seconds or about 50% slower. That wasn't bad. But roughly half the transactions (5000) failed due to conflicts. Ouch! That wasn't so good. Even with only one thread, roughly 10% (1000) conflicted. As opposed to 1% (100) with gSuneido. The usual method of dealing with conflicts is to retry, but that would just increase the load and make things worse.

So what is the correct limit for gSuneido? Or should there be no limit? Should it ever throttle, and if so, at what point? Honestly, I still don't know the answers to those questions. But it does appear the limit could be considerably higher e.g. 500 instead of 200, which would likely eliminate the problem, other than rare extreme cases.

No comments: