Friday, September 22, 2023

Poisoning Workers

As in worker threads, not people.

gSuneido uses an unbounded thread pool to handle client requests to the server. Most Go thread pools are intended to limit the number of goroutines e.g. to match the number of hardware threads. But in this case I want all the requests to execute in parallel, even if it wasn't the most efficient, so that all the requests made progress. Normally in Go you don't need an unbounded thread pool, goroutines are usually fast enough that you can just start a goroutine for each request. However, if the goroutine will need to allocate memory, either for the stack or for other things, then creating new ones every time becomes more expensive. One option is to use a sync.Pool to avoid the allocation. But that doesn't solve the cost of stack growth. (e.g. issue 1838)

In the old version of my code, each worker was responsible for terminating itself if it was idle. This was implemented with a timer that was reset by each request.

This seemed straightforward. I tested it and it worked fine. But then recently I added some metrics, including the number of workers, and I noticed the number of workers rarely went down. Definitely not dependably after the 5 second timeout. I puzzled over this for a while before I realized what was happening. If multiple worker goroutines are waiting to receive from the same channel, it's unpredictable which worker will receive the message. This doesn't seem to be defined in the language spec so I'm not sure if it's random (as the select statement is defined to be) or if the receivers are queued and processed first-in-first-out. It doesn't really matter for my issue. The bottom line is that tasks end up distributed over workers. So even with a low level of activity, workers tend to get at least one task per 5 seconds, and therefore they don't terminate.

I could dream up lots of complicated approaches but it seemed like there should be a simple solution. I could probably get away with just not ever killing workers. But then if a spike in load creates a bunch of workers, they will hang around forever, which isn't the best behavior. Searching the web, I found a lot of introductory material about goroutines and channels, and a lot of bounded thread pools, but not much on unbounded ones. I found lnquy/elastic-worker-pool, but that seemed more complicated than I needed.

Eventually I realized I could make the pool a "leaky bucket" i.e. just periodically kill a worker. That would ensure the pool would eventually shrink. This is very simple and doesn't require anything complex like trying to measure the load. The downside is that even if you're in a steady state with the correct number of workers, it's still going to kill workers and they'll end up being recreated. But a worker pool is really just a cache and a cache doesn't need to have 100% hit rate to be effective. I made a slight improvement by preventing killing too soon after creating a new worker goroutine.

I implemented this by having a killer goroutine that sends a "poison pill" to the same channel as the tasks. One of the workers would receive this and terminate itself.

The killClock and nworkers are atomic since they are accessed from multiple threads. The workers just needed to recognize the poison pill and terminate. The submit code just need to reset the killClock to zero whenever it created a new worker goroutine.

For example, if the interval is 1 second and the delay 10 seconds, in a steady state a worker would be killed and then recreated every 10 seconds. Given that creating a worker is almost fast enough to do on every task, that's very minor overhead.

Hopefully I haven't overlooked any flaws in this solution. It seems simple and fast.

As usual the code is on Github

No comments: