Friday, January 12, 2024

Fuzzing - the gift that keeps on giving

It's been almost a year since I started fuzz testing Suneido database queries and it's still finding bugs. I assumed it would find bugs at first, and then peter out. But we set them up to run for an hour every night to catch any regressions and that has turned out to be very worthwhile. Of course, I'm sure I've fixed the last bug now :-)

I have a bunch of Go tests for the query code, and they have decent coverage. But they barely scratch the surface in terms of combinations of query operations. Passing these tests gives me little confidence that changes are correct.

We also have lots of application tests that do a lot of queries. But all the queries are written by humans and follow similar styles and approaches. The bugs found by fuzzing were not detected by the application tests. That's probably partly because I used those application tests to initially develop and debug the gSuneido query implementation. So those types of queries have been well tested and debugged.

It might seem obvious, but one thing the fuzz testing has hammered home is that making something work for "normal" cases can be quite different from making it work for every bizarre unreasonable case. I look at the failing queries from fuzzing and I think "no one would ever write a query like that". And yet, they are bugs nonetheless.

In most cases when the nightly fuzzing would find an error (a difference between the results from gSuneido and jSuneido) I could easily reproduce it. But occasionally it would depend on the randomly generated data. (e.g. that a certain value did or didn't exist) That could make it hard to reproduce the error to debug it. We tried saving the data from each run, but that was awkward. We ended up saving the seed for generating the random data. We could include that with the results and use it to recreate the data.

The data dependency made me realize that we weren't testing many sets of data, just one per night. Rather than generate more sets of data (slow relative to running queries) I changed the fuzzing to try several variations of the constant values in each query. That had a similar effect to trying different sets of data.

It wasn't entirely surprising, but it was interesting that the fuzzing didn't find any errors in the jSuneido query implementation. The fuzzing only detects differences between the implementations, it didn't tell me which was right and which was wrong. Several times I convinced myself that it was jSuneido that was wrong, but it always turned out to be gSuneido. The gSuneido query implementation started out very similar to jSuneido. But over the years it got refactored and improved and optimized. gSuneido's query optimization is now quite a bit more sophisticated, which means more complex, which means more bugs unfortunately. jSuneido has also been around longer so maybe that helped work out the bugs. It's also much more stable - I've barely touched the code for several years whereas the gSuneido code has been changing constantly.

Another aspect that was surprising was how "rare" some of these failing queries were. We are running about a million queries per night. I would have thought that would explore the possibilities fairly well. But some of the bugs were only caught once after weeks of running. In hindsight that shouldn't be surprising - there are an infinite number of possible queries, and when you add in the data dependencies, the failing cases are even more rare.

Like other fuzz testing systems, I accumulate the failing queries in a "corpus" and run them every time along with new random queries. Quite often these are the ones that fail again. They obviously are "tricky" queries as far as my query implementation goes.

One concern I had was whether my ad hoc random query generation would generate "all" the possible queries. There could be bugs that could not be "reached" by the possible range of queries I was generating. Obviously, I don't know if there are bugs it's missing. But judging by the bugs it has found, it seems to have quite good coverage. In theory it should probably be based on the query grammar (the reverse of parsing). I ended up using a random sequence of additions and substitutions. One of the challenges is to ensure that the query is still valid e.g. that the columns referenced actually exist in the sub-query.

When I get a failure, the first thing I try to do is simplify the query. Most of the time there are parts of the query that can be removed and it will still fail. It's a lot easier to debug if e.g. there is only one join in the query. This is a fairly mechanical process and could probably be automated. (More easily from the parse tree than from the query text.) But it only takes a few minutes to do manually so it's not worth the effort to automate. (As long as there aren't too many bugs!)

One issue with my current approach is that it depends on using jSuneido for comparison. But we have replaced jSuneido with gSuneido and we're not using it any more. It's also awkward to have to run two processes and compare the results. I decided to write a "simplest possible query implementation" within gSuneido. I procrastinated on this for a while because I thought it would be a lot of work. But it only took me a few hours. I have to remember that the complexity is not inherent in the operations, it's almost all from the optimization. It still uses the same parser, but that seems ok because that's not where the problems are. It doesn't use any of the normal query optimization or execution. And in a way it's better than using jSuneido since it's not at all the same approach. And it's easier to be sure the simple implementation is correct. The alternate implementation is quite slow and isn't suitable for large results but it's sufficient for the fuzzing. And to get around the slower speed, with it all being within one gSuneido process, it's easy to run multiple threads.

Friday, September 22, 2023

Poisoning Workers

As in worker threads, not people.

gSuneido uses an unbounded thread pool to handle client requests to the server. Most Go thread pools are intended to limit the number of goroutines e.g. to match the number of hardware threads. But in this case I want all the requests to execute in parallel, even if it wasn't the most efficient, so that all the requests made progress. Normally in Go you don't need an unbounded thread pool, goroutines are usually fast enough that you can just start a goroutine for each request. However, if the goroutine will need to allocate memory, either for the stack or for other things, then creating new ones every time becomes more expensive. One option is to use a sync.Pool to avoid the allocation. But that doesn't solve the cost of stack growth. (e.g. issue 1838)

In the old version of my code, each worker was responsible for terminating itself if it was idle. This was implemented with a timer that was reset by each request.

This seemed straightforward. I tested it and it worked fine. But then recently I added some metrics, including the number of workers, and I noticed the number of workers rarely went down. Definitely not dependably after the 5 second timeout. I puzzled over this for a while before I realized what was happening. If multiple worker goroutines are waiting to receive from the same channel, it's unpredictable which worker will receive the message. This doesn't seem to be defined in the language spec so I'm not sure if it's random (as the select statement is defined to be) or if the receivers are queued and processed first-in-first-out. It doesn't really matter for my issue. The bottom line is that tasks end up distributed over workers. So even with a low level of activity, workers tend to get at least one task per 5 seconds, and therefore they don't terminate.

I could dream up lots of complicated approaches but it seemed like there should be a simple solution. I could probably get away with just not ever killing workers. But then if a spike in load creates a bunch of workers, they will hang around forever, which isn't the best behavior. Searching the web, I found a lot of introductory material about goroutines and channels, and a lot of bounded thread pools, but not much on unbounded ones. I found lnquy/elastic-worker-pool, but that seemed more complicated than I needed.

Eventually I realized I could make the pool a "leaky bucket" i.e. just periodically kill a worker. That would ensure the pool would eventually shrink. This is very simple and doesn't require anything complex like trying to measure the load. The downside is that even if you're in a steady state with the correct number of workers, it's still going to kill workers and they'll end up being recreated. But a worker pool is really just a cache and a cache doesn't need to have 100% hit rate to be effective. I made a slight improvement by preventing killing too soon after creating a new worker goroutine.

I implemented this by having a killer goroutine that sends a "poison pill" to the same channel as the tasks. One of the workers would receive this and terminate itself.

The killClock and nworkers are atomic since they are accessed from multiple threads. The workers just needed to recognize the poison pill and terminate. The submit code just need to reset the killClock to zero whenever it created a new worker goroutine.

For example, if the interval is 1 second and the delay 10 seconds, in a steady state a worker would be killed and then recreated every 10 seconds. Given that creating a worker is almost fast enough to do on every task, that's very minor overhead.

Hopefully I haven't overlooked any flaws in this solution. It seems simple and fast.

As usual the code is on Github

Sunday, September 10, 2023

A gSuneido Database Optimization

While working on the recent gSuneido database issue I had an idea for optimizing the transaction conflict checking code.

The current code kept a list of outstanding transactions and for each a list of tables and their reads and writes. To check a new read or write meant a linear search through all the transactions for other activity on that table. If the data was organized by table rather than by transaction, it would eliminate the linear search. An action on a "rare" table wouldn't have to look at so much data. Some tables would be common to many transactions, but still normally not all of them. And even in a situation where all the outstanding transactions included a particular table it wouldn't be any slower than before.

At a high level it was a small change, but I wasn't sure how it would fit with the code. I knew I'd still need to keep a list of transactions and the tables they had accessed. It turned out to be relatively straightforward to modify the code. (A few hours work.) One of the questions was where to embed data and where to use pointers to separately allocated data. Embedding tends to be faster, but embedding in maps or slices can use more memory because empty slots are larger.

Before starting the changes I wrote a Go benchmark so I could see what effect the changes had on performance. As usual, it was hard to know what a "typical" pattern of activity was. For this benchmark the new version was over twice as fast. That was a bigger difference than I'd expected. Sadly, but not surprisingly, the change made no difference to the speed of our application test suite. It doesn't do many concurrent transactions so it wouldn't benefit. However, the stress test I'd used to investigate the "too many outstanding transactions" issue did show big improvements. When I collected the same measurements as before and graphed them, it looked even better. Up to three threads it didn't make much difference but after that it was substantially better. And performance didn't drop off under load like it had before. (Or at least, not until much higher load.)

The conventional wisdom for optimization would be to profile the code and find the hot spots. That wouldn't have helped in this case. The impact of the data structure was spread over the code. Nor would profiling have given any insight into a better data structure. As I commonly find, what is required is to have good mental models of the problem, the code, the programming language, and the hardware. Then you need to find an approach that is a good fit between all of these. Of course, the hardware performance model is a moving target. These days it's mostly about cache misses and branch prediction.

I'm always happy to improve performance, especially by significant amounts. But in the back of my mind I can't help thinking I should have written it better the first time around. Of course, that doesn't make sense because you can't write the "perfect" version. There's always going to be room for improvement.

Saturday, September 02, 2023

A gSuneido Database Issue

We've been gradually converting our customers' servers from jSuneido to gSuneido. Recently we converted a larger customer and they started to get "too many outstanding transactions" errors. Unfortunately, we'd changed multiple things at once with this customer, so it wasn't immediately clear whether this issue was because of the switch to gSuneido, but I suspected it was.

I had set the limit in gSuneido to the same value as jSuneido - 200 outstanding transactions. "outstanding" in this case either means uncompleted in progress, or completed but overlapping with an uncompleted one.

It seemed a little counterintuitive. gSuneido is faster than jSuneido. So why should it run into this more? My theory is that because it's faster, more transactions get processed and if some of them are slow this will mean more overlapping transactions, even though they completed quickly.

One of my first thoughts was to look at optimizing the gSuneido code. But if my theory was right, this could actually make the problem worse rather than better.

We temporarily solved the problem by adding slight delays to code that was using a lot of transactions. (i.e. we throttled or rate limited them). This worked but it seemed a little ugly to slow it down all the time, just to avoid a problem that only occurred occasionally.

That led me to add throttling to gSuneido, but only when the outstanding transactions got high (e.g. 100). Throttling would prevent hitting the limit (200). I wrote a stress test and this did solve the problem. And it was better than adding delays in the application code. But the problem was that once it throttled, it got really slow. If you throttled enough to prevent hitting the limit e.g to 10 transactions per second, then 1000 transactions would go from maybe .1 seconds to 100 seconds - ouch! And because throttling slowed things down so much, it increased the chances of transactions hitting the 20 second time limit. The cure seemed worse than the disease, it won the battle but lost the war. (Sorry for the proverbs.)

I started to wonder what the right limit was. 200 was an arbitrary value. The reason to have a limit was to prevent it from "falling over" under heavy load. But at what point would that happen? So I removed the limit and tested under different loads.

threads time tran/sec conflicts max trans
1 2 sec 5000 100 70
2 2 sec 10000 150 110
3 3 sec 10000 200 150
4 4 sec 10000 200 180
5 4 sec 12500 200 200
6 5 sec 12000 300 230
8 8 sec 10000 300 270
10 10 sec 10000 300 300
16 20 sec 8000 400 450
20 30 sec 7000 500 490
30 50 sec 7000 600 580

I was quite happy with the results. Performance did drop off with more than 5 or 6 threads, but not that badly. The maximum outstanding got well above the old limit of 200, but that didn't seem to be a major problem. The number of transactions aborting due to conflicts increased with load, but that's expected with that much concurrent activity. To some extent it seemed to be self limiting. i.e. the load itself slowed things down and effectively did its own throttling.

Of course, this is completely dependent on what my test is doing. Is it "typical"? No, probably not. For starters, it's doing continuous activity on multiple transactions as fast as it can, with no other work. Real application code is not going to be doing that. Our applications never do 10,000 transactions per second for long periods. The test does result in shorter and longer transactions, but the longest are only around 1 second, whereas in production we do see transactions taking much longer and sometimes hitting the 20 second limit (usually as a result of external factors). And my workstation only has 8 cores / 16 threads, so it's not really 30 threads.

Out of curiosity, I removed the limit from jSuneido to see how it would perform. 10 threads took 15 seconds or about 50% slower. That wasn't bad. But roughly half the transactions (5000) failed due to conflicts. Ouch! That wasn't so good. Even with only one thread, roughly 10% (1000) conflicted. As opposed to 1% (100) with gSuneido. The usual method of dealing with conflicts is to retry, but that would just increase the load and make things worse.

So what is the correct limit for gSuneido? Or should there be no limit? Should it ever throttle, and if so, at what point? Honestly, I still don't know the answers to those questions. But it does appear the limit could be considerably higher e.g. 500 instead of 200, which would likely eliminate the problem, other than rare extreme cases.

Thursday, July 20, 2023

Keyword Recognition

Daniel Lemire's recent blog post Recognizing string prefixes with SIMD instructions and one he linked to Modern perfect hashing for strings prompted me to revisit how I was recognizing keywords in gSuneido. Although I find it interesting how you can use advanced processor instructions, I didn't really want to get into that. But like most things, even without looking at my code, I figured it could probably be improved.

I found that I was recognizing language keywords (28) with a linear search, sorted by frequency of occurrence. For query keywords I was using a Go map. Both were quite fast, certainly not a bottleneck.

The reason I hadn't used a Go map for language keywords was that I wanted to "intern" the strings. And there's no way to retrieve the key of a Go map (unless you loop through all of them). Normally it doesn't matter since you need the key to do a lookup in the first place. But the key I'm doing a lookup with is a slice from the source code. If I hold onto that key, it will keep the entire source code in memory (because the reference prevents garbage collection).

One of the first things I tried (Compile-time switches from the Modern article) was one of the fastest (4x faster than the old linear search) - switch on the length, and then switch on the i'th character (chosen for each length). For example:

switch len(s) {
case 2:
    switch s[1] {
    case 'f':
        if s == "if" {
            return tok.If, "if"
        }
    case 's':
        if s == "is" {
            return tok.Is, "is"
        }

I find it a little surprising that this turned out to be the fastest since it has quite a few branches - two switches plus a string comparison. Using Compiler Explorer to look at the assembler, I found the code was quite good. I was expecting a call to a string comparison function, but it is unrolled inline. It may actually be beneficial that the switches are explicit so each branch can get its own prediction.

Based on the assembler, I simplified the code. This was just as fast, smaller source code, and smaller machine code.

switch len(s) {
case 2:
    if s == "if" {
        return tok.If, "if"
    }
    if s == "is" {
        return tok.Is, "is"
    }

Because it was comparing constant strings and it knew the length matched, it compared them as multi-byte integers instead of strings. 2, 4, and 8 bytes matched int16, int32, and int64. For lengths of 3, 5, 6, and 7 it used a combination e.g. 6 characters was an int32 and an int16.

I did try various other approaches.

I tried finding "perfect" hash functions and table sizes, which turned out to be trickier than I thought. It was almost as fast as the switch version.

I tried another hash table implementation, but it was slower than the Go map.

I tried switching on the length and then doing a linear search.

In the end I went with the switch version since it was fastest. It was a little more verbose, but Tabnine handled most of the code generation.

Did this micro-optimization make any difference to the overall speed? No, not that I saw from a quick test. But it was an interesting investigation. And yes, premature optimization can be a mistake. But ignoring speed is also a mistake.

Sunday, April 23, 2023

New Regular Expression Implementation for gSuneido

Suneido has its own regular expression implementation. 25 years ago, when I wrote the original C++ cSuneido implementation, there weren't a lot of regular expression libraries, and I already had an implementation from a previous product, so that's what I used. With the Java version, I initially tried to use the Java standard regular expressions. But it was too hard to get exactly the same behavior, so I ended up using my own implementation. And I did the same thing with the Go version of Suneido.

But I haven't been totally happy with my implementation. It uses backtracking (like many other implementations) which means it has some bad worst cases. I had to put limits on the backtracking, which meant we couldn't use regular expressions in some places, which meant hand writing code to replace them. To make it worse, there was no easy explanation of what kinds of regular expressions were ok.

Recently I re-encountered this series of articles about implementing regular expressions.

Regular Expression Matching Can Be Simple And Fast
Regular Expression Matching: the Virtual Machine Approach
Regular Expression Matching in the Wild

I'd read them before but at that point I'd been satisfied with what I had. The articles made it seem easy to implement a non-backtracking approach to avoid the worst case behavior. Of course, it didn't turn out to be quite that easy to write a full implementation. Naturally, the articles didn't cover all the details.

It started out as a weekend project, but it ended up taking me longer than that, partly because I implemented optimizations for literal strings and single pass expressions.

Because Suneido is ASCII only (due to its age), the code is a little simpler and faster than the Go standard library version which handles UTF-8.

The Go regular expression package compiles to linked structs that are relatively large. I chose to compile to byte code stored in a string. I like to use strings because they are one of the few constant immutable types in Go. And string references are only 16 bytes compared to 24 bytes for byte slices. There's probably a bit more overhead processing the byte code, but it's a more compact chunk of memory so it'll be more cache friendly.

Humorously (to me) when I was fuzz testing my implementation against the Go standard library implementation I discovered a bug in the Go code. As expected, one of the initial responses was denial, blaming it on my lack of understanding. Thankfully, someone else (not on the Go team) confirmed the problem and even located a possible location of the bug. I thought a bug in something like regular expressions would be treated fairly seriously, but over a month later it's still just labeled as "NeedsInvestigation" rather than "NeedsFix". It was put under the Go1.21 milestone at least.

The new implementation is about 2000 lines of Go code. Probably half of that I was able to reuse from the previous version. As usual the code is on GitHub

Saturday, April 08, 2023

Full Text Search

Suneido uses full text search for its help and wiki. jSuneido used Lucene, a natural fit for Java. I needed something different for gSuneido. I looked at Bleve which seems to be the most common Go full text search. But it was big and slow, and the indexes were large. I had used Lunr.js on a previous project and it had worked well. It's small and fast and easy to use. It's JavaScript, but we display our help and wiki in a browser window so that seemed ok.

We (thanks Jatin) got it working, but it was a little ugly. Our customer systems have different combinations of applications and options so we built custom indexes on each system. But for Lunr we were using Node.js to build the indexes and we didn't want to install Node on all our customer systems. So we had to build an index containing "everything", ship that to our customers (every time we updated the help), and then filter the search results based on their configuration.

We only use our Wiki in-house, but with it there was another issue. People edit the wiki all the time. With Lucene, we could update the index with changes, but Lunr.js doesn't support that, you have to rebuild the whole index.

Eventually I got frustrated with the "friction" of using Lunr and decided to continue my long tradition of re-implementing the wheel. I'd considered writing my own full text index/search previously but had resisted the urge. But I had a weekend, and nothing urgent to do, so I decided to give it a try. I started with Let's build a Full-Text Search engine, which makes it sound easy, but mostly because the simple version it describes isn't complete.

Tokenizing is easy, although there are questions about exactly what constitutes a token. Obviously, sequences of letters, but what length? I decided to ignore single letters and also ignore anything over a certain length (e.g. 32 characters). Especially in our wiki we also wanted to search on numbers. There's also the issue of punctuation, which I'm currently ignoring.

I used the Go Snowball (Porter2) stemmer mentioned in the article. A stemmer simplifies words. For example, fishing, fished and fisher are reduced to the base form (stem) fish. That reduces the number of terms in the index and makes searching more forgiving.

I got sidetracked into looking at bitmap data structures like roaring bitmaps (as suggested in the article), but then I started looking at how to implement search scoring and found that bitmaps weren't sufficient, I needed counts per term per document, not just true/false.

I decided to use BM25 scoring like Lucene and Lunr. I found a good article that explained it and even gave some examples I could check my results against.

It came together surprisingly easily and by the end of the weekend I had about 1000 lines of Go code that could create indexes and search them. It was fast and the indexes were small (compared to Lunr). The results seemed comparable. I felt a little guilty because it meant throwing out all the work that went into trying to use Lunr. For a change I felt like I should have written my own version sooner.