Friday, January 12, 2024

Fuzzing - the gift that keeps on giving

It's been almost a year since I started fuzz testing Suneido database queries and it's still finding bugs. I assumed it would find bugs at first, and then peter out. But we set them up to run for an hour every night to catch any regressions and that has turned out to be very worthwhile. Of course, I'm sure I've fixed the last bug now :-)

I have a bunch of Go tests for the query code, and they have decent coverage. But they barely scratch the surface in terms of combinations of query operations. Passing these tests gives me little confidence that changes are correct.

We also have lots of application tests that do a lot of queries. But all the queries are written by humans and follow similar styles and approaches. The bugs found by fuzzing were not detected by the application tests. That's probably partly because I used those application tests to initially develop and debug the gSuneido query implementation. So those types of queries have been well tested and debugged.

It might seem obvious, but one thing the fuzz testing has hammered home is that making something work for "normal" cases can be quite different from making it work for every bizarre unreasonable case. I look at the failing queries from fuzzing and I think "no one would ever write a query like that". And yet, they are bugs nonetheless.

In most cases when the nightly fuzzing would find an error (a difference between the results from gSuneido and jSuneido) I could easily reproduce it. But occasionally it would depend on the randomly generated data. (e.g. that a certain value did or didn't exist) That could make it hard to reproduce the error to debug it. We tried saving the data from each run, but that was awkward. We ended up saving the seed for generating the random data. We could include that with the results and use it to recreate the data.

The data dependency made me realize that we weren't testing many sets of data, just one per night. Rather than generate more sets of data (slow relative to running queries) I changed the fuzzing to try several variations of the constant values in each query. That had a similar effect to trying different sets of data.

It wasn't entirely surprising, but it was interesting that the fuzzing didn't find any errors in the jSuneido query implementation. The fuzzing only detects differences between the implementations, it didn't tell me which was right and which was wrong. Several times I convinced myself that it was jSuneido that was wrong, but it always turned out to be gSuneido. The gSuneido query implementation started out very similar to jSuneido. But over the years it got refactored and improved and optimized. gSuneido's query optimization is now quite a bit more sophisticated, which means more complex, which means more bugs unfortunately. jSuneido has also been around longer so maybe that helped work out the bugs. It's also much more stable - I've barely touched the code for several years whereas the gSuneido code has been changing constantly.

Another aspect that was surprising was how "rare" some of these failing queries were. We are running about a million queries per night. I would have thought that would explore the possibilities fairly well. But some of the bugs were only caught once after weeks of running. In hindsight that shouldn't be surprising - there are an infinite number of possible queries, and when you add in the data dependencies, the failing cases are even more rare.

Like other fuzz testing systems, I accumulate the failing queries in a "corpus" and run them every time along with new random queries. Quite often these are the ones that fail again. They obviously are "tricky" queries as far as my query implementation goes.

One concern I had was whether my ad hoc random query generation would generate "all" the possible queries. There could be bugs that could not be "reached" by the possible range of queries I was generating. Obviously, I don't know if there are bugs it's missing. But judging by the bugs it has found, it seems to have quite good coverage. In theory it should probably be based on the query grammar (the reverse of parsing). I ended up using a random sequence of additions and substitutions. One of the challenges is to ensure that the query is still valid e.g. that the columns referenced actually exist in the sub-query.

When I get a failure, the first thing I try to do is simplify the query. Most of the time there are parts of the query that can be removed and it will still fail. It's a lot easier to debug if e.g. there is only one join in the query. This is a fairly mechanical process and could probably be automated. (More easily from the parse tree than from the query text.) But it only takes a few minutes to do manually so it's not worth the effort to automate. (As long as there aren't too many bugs!)

One issue with my current approach is that it depends on using jSuneido for comparison. But we have replaced jSuneido with gSuneido and we're not using it any more. It's also awkward to have to run two processes and compare the results. I decided to write a "simplest possible query implementation" within gSuneido. I procrastinated on this for a while because I thought it would be a lot of work. But it only took me a few hours. I have to remember that the complexity is not inherent in the operations, it's almost all from the optimization. It still uses the same parser, but that seems ok because that's not where the problems are. It doesn't use any of the normal query optimization or execution. And in a way it's better than using jSuneido since it's not at all the same approach. And it's easier to be sure the simple implementation is correct. The alternate implementation is quite slow and isn't suitable for large results but it's sufficient for the fuzzing. And to get around the slower speed, with it all being within one gSuneido process, it's easy to run multiple threads.