Sunday, November 03, 2024

Go range over integer gotcha

Go version 1.22 introduced the ability to do a for loop ranging over an integer. For example:

for i := range 5

This is normally explained as equivalent to:

for i := 0; i < 5; i++

Recently I decided to go through my code and update to the new style. It was mostly fairly mechanical search and replace. But when I finished and ran the tests, there were multiple failures. I figured I'd just made a typo somewhere in my editing. But that wasn't the case. The places that were failing were modifying the loop variable. For example:

for i := 0; i < 5; i++ {
    fmt.Println(i)
    i++
}
=> 0, 2, 4

for i := range 5 {
    fmt.Println(i)
    i++
}
=> 0, 1, 2, 3, 4

I don't see anything in the language spec about this.

The accepted proposal says:

If n is an integer type, then for x := range n { ... } would be completely equivalent to for x := T(0); x < n; x++ { ... }, where T is the type of n (assuming x is not modified in the loop body).

There might be some discussion of this but I didn't find it. It may relate to how closures capture the loop variable. In Go 1.22 this was changed so each iteration has its own iteration variable(s). So if each loop gets its own variable that kind of explains why modifying it doesn't have any effect on subsequent iterations.

But even if you declare the variable outside the for statement, it still behaves the same:

var i int
for i = range 5 {
    fmt.Println(i)
    i++
}
=> 0, 1, 2, 3, 4

It's not really a major issue, it just means if you want to modify the loop variable you need to use the old style of for loop. The unfortunate part is that you can't mechanically replace one with the other. You have to check whether the body of the loop modifies the variable. I don't typically do this but occasionally it can be useful.

Thursday, October 03, 2024

String Allocation Optimization

This post follows on from the previous one. (Go Interface Allocation) After realizing the allocation from assigning strings to interfaces (common in gSuneido) I started thinking about how this could be improved. I had a few ideas:

  • predefine all 256 one byte strings (common when looping over a string by character)
  • implement a "small string" type (i.e. a one byte size and 15 byte array in 16 bytes, the same space as a Go string pointer + size)
  • intern certain strings with the new Go unique package (unique.Make returns a Handle which is a pointer so it could be stored in an interface without allocation)

Unfortunately, the last two require introducing new string types. That's feasible (gSuneido already has an additional SuConcat string type to optimize concatenation) but it would take some work.

Luckily the first idea was easy and didn't require a new string type.

func SuStr1(s string) Value {
    if len(s) == 1 {
        return SuStr1s[s[0]]
    }
    return SuStr(s)
}

var SuStr1s = [256]Value{
    SuStr("\x00"), SuStr("\x01"), SuStr("\x02"), SuStr("\x03"),
    ...
    SuStr("\xfc"), SuStr("\xfd"), SuStr("\xfe"), SuStr("\xff"),
}

It didn't make much difference overall but it was easy to see the improvements on specific types of code. For a simple change I think it was worth it. (Thanks to "AI" for auto-completing most of that 256 element array.)

See the commit on GitHub

Wednesday, October 02, 2024

Go Interface Allocation

I was running some benchmarks recently and I noticed the memory allocation was higher than I expected. After some investigation, I found it was an issue I was aware of, but that was easily overlooked.

The simplest Go "interface" is "any" aka "interface { }". This is a bit like void* in C or C++, except it has a runtime type. An interface is a fat pointer consisting of a pointer plus the runtime type (another pointer). You can store a pointer in an interface without any allocation. But if you store a value that isn't a pointer e.g. a struct, then it has to be on the heap, which usually means allocation.

The catch is that in Go there are quite a few fat pointers. It is a distinctive feature of Go. For example, a string value consists of a pointer and a length, and a slice consists of a pointer, a length, and a capacity. So even though mentally you might think of strings and slices as pointers, they don't fit in an interface. So storing a string in an interface allocates a copy of the string value (16 bytes) which in turn points to the actual data. (Also not ideal for locality.)

For example:

Even though this benchmark doesn't create any new values, it still allocates. If the type of X is changed from any to string, then there is no allocation.

There's not much you can do about this. In some cases you can use generics to avoid interfaces. But interfaces are a basic feature of Go, it's hard to avoid them totally. My point is just that if you're trying to write high performance code that minimizes allocation, watch out for interfaces.

See also: Go Interfaces and Immediate Integers

Friday, May 17, 2024

Attaching Extra Data to Trees

This is nothing original, just something I've been working with and thought I'd document.

Say you have a tree made up of nodes in a tree. In my case I'm working with abstract syntax trees. And in some case you need to add extra information to nodes. For example, gSuneido uses the same AST for several purposes - to compile the code, to do static checks, to do syntax based searches in the IDE, and to do source code formatting. Those things all require slightly different information.

Here's an example node struct.

The simplest approach is just to add the extra information to the node struct:

The advantage of this approach is that it's simple and has low overhead. The disadvantage is that you always have that extra space whether you need it or not. I'm using this approach for basic source position information that is needed for compiler error messages and debug information.

A variation of this approach would be to make "extra" a pointer to a separate struct. If the data was large and you only needed it on some nodes, this could save space.

Another approach is to insert additional "wrapper" nodes.

The advantages of this approach is that it can be applied anywhere and it can be applied selectively. The disadvantages are that code that processes the tree has to deal with these extra nodes. I'm using this approach for some additional source position information that is needed for syntax based searches in the IDE.

Another approach is to store the information "off to the side".

This also has the advantage of being able to apply it to any node selectively and traversing the tree doesn't have to worry about the extra information. The disadvantage is that you have another data structure to deal with. I'm using this approach to store comment and newline position information for source code formatting.

Friday, January 12, 2024

Fuzzing - the gift that keeps on giving

It's been almost a year since I started fuzz testing Suneido database queries and it's still finding bugs. I assumed it would find bugs at first, and then peter out. But we set them up to run for an hour every night to catch any regressions and that has turned out to be very worthwhile. Of course, I'm sure I've fixed the last bug now :-)

I have a bunch of Go tests for the query code, and they have decent coverage. But they barely scratch the surface in terms of combinations of query operations. Passing these tests gives me little confidence that changes are correct.

We also have lots of application tests that do a lot of queries. But all the queries are written by humans and follow similar styles and approaches. The bugs found by fuzzing were not detected by the application tests. That's probably partly because I used those application tests to initially develop and debug the gSuneido query implementation. So those types of queries have been well tested and debugged.

It might seem obvious, but one thing the fuzz testing has hammered home is that making something work for "normal" cases can be quite different from making it work for every bizarre unreasonable case. I look at the failing queries from fuzzing and I think "no one would ever write a query like that". And yet, they are bugs nonetheless.

In most cases when the nightly fuzzing would find an error (a difference between the results from gSuneido and jSuneido) I could easily reproduce it. But occasionally it would depend on the randomly generated data. (e.g. that a certain value did or didn't exist) That could make it hard to reproduce the error to debug it. We tried saving the data from each run, but that was awkward. We ended up saving the seed for generating the random data. We could include that with the results and use it to recreate the data.

The data dependency made me realize that we weren't testing many sets of data, just one per night. Rather than generate more sets of data (slow relative to running queries) I changed the fuzzing to try several variations of the constant values in each query. That had a similar effect to trying different sets of data.

It wasn't entirely surprising, but it was interesting that the fuzzing didn't find any errors in the jSuneido query implementation. The fuzzing only detects differences between the implementations, it didn't tell me which was right and which was wrong. Several times I convinced myself that it was jSuneido that was wrong, but it always turned out to be gSuneido. The gSuneido query implementation started out very similar to jSuneido. But over the years it got refactored and improved and optimized. gSuneido's query optimization is now quite a bit more sophisticated, which means more complex, which means more bugs unfortunately. jSuneido has also been around longer so maybe that helped work out the bugs. It's also much more stable - I've barely touched the code for several years whereas the gSuneido code has been changing constantly.

Another aspect that was surprising was how "rare" some of these failing queries were. We are running about a million queries per night. I would have thought that would explore the possibilities fairly well. But some of the bugs were only caught once after weeks of running. In hindsight that shouldn't be surprising - there are an infinite number of possible queries, and when you add in the data dependencies, the failing cases are even more rare.

Like other fuzz testing systems, I accumulate the failing queries in a "corpus" and run them every time along with new random queries. Quite often these are the ones that fail again. They obviously are "tricky" queries as far as my query implementation goes.

One concern I had was whether my ad hoc random query generation would generate "all" the possible queries. There could be bugs that could not be "reached" by the possible range of queries I was generating. Obviously, I don't know if there are bugs it's missing. But judging by the bugs it has found, it seems to have quite good coverage. In theory it should probably be based on the query grammar (the reverse of parsing). I ended up using a random sequence of additions and substitutions. One of the challenges is to ensure that the query is still valid e.g. that the columns referenced actually exist in the sub-query.

When I get a failure, the first thing I try to do is simplify the query. Most of the time there are parts of the query that can be removed and it will still fail. It's a lot easier to debug if e.g. there is only one join in the query. This is a fairly mechanical process and could probably be automated. (More easily from the parse tree than from the query text.) But it only takes a few minutes to do manually so it's not worth the effort to automate. (As long as there aren't too many bugs!)

One issue with my current approach is that it depends on using jSuneido for comparison. But we have replaced jSuneido with gSuneido and we're not using it any more. It's also awkward to have to run two processes and compare the results. I decided to write a "simplest possible query implementation" within gSuneido. I procrastinated on this for a while because I thought it would be a lot of work. But it only took me a few hours. I have to remember that the complexity is not inherent in the operations, it's almost all from the optimization. It still uses the same parser, but that seems ok because that's not where the problems are. It doesn't use any of the normal query optimization or execution. And in a way it's better than using jSuneido since it's not at all the same approach. And it's easier to be sure the simple implementation is correct. The alternate implementation is quite slow and isn't suitable for large results but it's sufficient for the fuzzing. And to get around the slower speed, with it all being within one gSuneido process, it's easy to run multiple threads.

Friday, September 22, 2023

Poisoning Workers

As in worker threads, not people.

gSuneido uses an unbounded thread pool to handle client requests to the server. Most Go thread pools are intended to limit the number of goroutines e.g. to match the number of hardware threads. But in this case I want all the requests to execute in parallel, even if it wasn't the most efficient, so that all the requests made progress. Normally in Go you don't need an unbounded thread pool, goroutines are usually fast enough that you can just start a goroutine for each request. However, if the goroutine will need to allocate memory, either for the stack or for other things, then creating new ones every time becomes more expensive. One option is to use a sync.Pool to avoid the allocation. But that doesn't solve the cost of stack growth. (e.g. issue 1838)

In the old version of my code, each worker was responsible for terminating itself if it was idle. This was implemented with a timer that was reset by each request.

This seemed straightforward. I tested it and it worked fine. But then recently I added some metrics, including the number of workers, and I noticed the number of workers rarely went down. Definitely not dependably after the 5 second timeout. I puzzled over this for a while before I realized what was happening. If multiple worker goroutines are waiting to receive from the same channel, it's unpredictable which worker will receive the message. This doesn't seem to be defined in the language spec so I'm not sure if it's random (as the select statement is defined to be) or if the receivers are queued and processed first-in-first-out. It doesn't really matter for my issue. The bottom line is that tasks end up distributed over workers. So even with a low level of activity, workers tend to get at least one task per 5 seconds, and therefore they don't terminate.

I could dream up lots of complicated approaches but it seemed like there should be a simple solution. I could probably get away with just not ever killing workers. But then if a spike in load creates a bunch of workers, they will hang around forever, which isn't the best behavior. Searching the web, I found a lot of introductory material about goroutines and channels, and a lot of bounded thread pools, but not much on unbounded ones. I found lnquy/elastic-worker-pool, but that seemed more complicated than I needed.

Eventually I realized I could make the pool a "leaky bucket" i.e. just periodically kill a worker. That would ensure the pool would eventually shrink. This is very simple and doesn't require anything complex like trying to measure the load. The downside is that even if you're in a steady state with the correct number of workers, it's still going to kill workers and they'll end up being recreated. But a worker pool is really just a cache and a cache doesn't need to have 100% hit rate to be effective. I made a slight improvement by preventing killing too soon after creating a new worker goroutine.

I implemented this by having a killer goroutine that sends a "poison pill" to the same channel as the tasks. One of the workers would receive this and terminate itself.

The killClock and nworkers are atomic since they are accessed from multiple threads. The workers just needed to recognize the poison pill and terminate. The submit code just need to reset the killClock to zero whenever it created a new worker goroutine.

For example, if the interval is 1 second and the delay 10 seconds, in a steady state a worker would be killed and then recreated every 10 seconds. Given that creating a worker is almost fast enough to do on every task, that's very minor overhead.

Hopefully I haven't overlooked any flaws in this solution. It seems simple and fast.

As usual the code is on Github