The Software Life: suneido

Showing posts with label suneido. Show all posts

Sunday, October 19, 2025

KLL Sketch Quantiles

Recently I was looking for a good way to "characterize" the values in a column of a database table, to help Suneido's query optimizer choose the best index to use. Originally I was thinking of some kind of histogram. While "discussing" alternatives with Gemini, KLL Quantile Sketches came up. Histograms are usually equi-width - each bin represents the same range of values e.g. 0 to 1 out of 10. Quantiles are usually equi-depth - each bin represents the same frequency of values e.g. 1%. With database data you don't usually know the range of values and it's not easy to divide values like strings into equal ranges so quantiles made more sense.

The name "KLL" comes from the authors (Karnin, Lang, and Liberty) of the original paper Optimal Quantile Approximation in Streams.

I have to admit I struggled with understanding KLL Sketches. At first I got Gemini to explain them to me and show me a sample implementation. I thought I understood it until I tried to implement it. Since I was off track, I quickly led Gemini astray and ended up flailing around for longer than I should have. I finally gave up on AI and went back to the original paper. The bulk of the paper is math theory proving that the algorithm worked within specific error bounds. I didn't realy follow the math and I didn't really care about the proof - I was willing to take their word for it. I just wanted to know how to implement it. In the end, the key part was a couple of paragraphs of the paper. The basic idea is pretty simple.

Unfortunately, I didn't find a lot of material on KLL sketches (or its precursor MRL sketches). There isn't even a Wikipedia page! There is an Apache DataSketches library that includes KLL sketches. I didn't look at their code. Trying to figure out basic ideas from a full implementation is hard.

It's easiest to understand by working through several increasingly optimized versions. The first version is just several buffers that hold k values. k determines the accuracy and a common value is 200 which gives 1.65% accuracy. The buffers are known as "compactors".

Incoming values are added to the first compactor. When it gets full, you sort the values, and then taking them in pairs of consecutive values you pick either the even or the odd ones and discard the others. Even or odd is chosen randomly, or randomly first time and alternating thereafter. You end up with half as many values, which you promote to the next compactor. In turn, when the next compactor fills up, you compact it and promote half the values. You add compactors as necessary. Each compactor represents/summarizes twice as many values i.e. its weight doubles.

The first optimization is that as you add levels, the levels closest to the input can be smaller without losing accuracy. The newer values are less important since they represent less of the data. This saves memory and reduces the work of compacting. The size factor is called c and the usual value is 2/3.

In terms of the algorithm, you add a new level of size k and shrink the existing levels. But in terms of implementation, you can't really shrink a memory allocation. You can take a sub-slice but the full size is still there behind the scenes. Instead, I added a new smaller compactor on the input end. The problem with that is that the pre-existing compactors now have the wrong weights. To fix that, I compact each of those previous levels in place, which has the effect of doubling their weight. This approach isn't in the original paper (which is light on implementation) but it seems to retain the correct behavior.

As the compactors get smaller, eventually you reach the minimum size of 2. The next optimization is that all the levels of size 2 can be combined into a "sampler" which randomly selects one value from every 2^n where n is the number of size 2 levels replaced by the sampler.

Querying the result is straightforward - you take the values with their implicit weights and sort them. Then the cumulative weight gives you the quantiles. e.g. 50% of the weight gives you the median.

There are further optimizations that I didn't implement. They seem more theoretical improvements than practical. My simple implementation (< 200 lines of Go) seems to give good performance with low memory requirements. (> 100,000,000 values/second, zero allocation once it reaches the maximum size of about 600 values)

Monday, March 24, 2025

Copy on Write

Copy on write is an interesting technique with a wide variety of applications. It's somewhat related to persistent immutable data structures, which are really "partial copy on write". Basically it's just lazy or deferred but with the addition of reference counting.

It started when I happened to be looking at our memoize code. (That's the correct spelling, it's different than "memorize"). When it returns a mutable object, it makes a defensive copy. Otherwise, if the object was modified it would modify the cached value.

Defensive copies are a standard technique, but they're often inefficient because if the caller doesn't modify the object then the copy was unnecessary.

One solution is to make the cached values read-only. Then they can't be modified and you don't need a defensive copy. But this has two problems. One is that people forget to make it read-only, since it works fine without it. The other is that often you do need to modify the result and then all the callers have to copy.

My first thought was to add an explicit CopyOnWrite method. But most people wouldn't understand the difference or remember to use it. We could use it in Memoize, but that was quite limited.

Then I realized that it probably made sense to just make the existing Copy method always be copy-on-write i.e. deferred or lazy copying. That was assuming that I could implement copy-on-write with low enough overhead that the benefit would outweigh the cost.

The simplest naive approach is to mark both the original and the copy as copy-on-write. But then if you later modified them both, you'd end up making two copies, whereas with normal copying you'd only have made one copy. The solution is to keep a shared "copy count", similar to a reference count for memory management. If the copy count is zero, then you can just modify the object without copying it, since you know you won't affect any other "copies".

When you make a lazy copy, you increment the copy-count. When you do an actual copy to allow modification, you decrement the copy-count. Ideally you'd also decrement the copy-count when an object was garbage collected. (perhaps with the new runtime.AddCleanup in Go 1.24)

One catch is that the copy-count must be shared. At first I thought that meant I had to put the data and copy count in a separate object with an extra level of indirection for all references to the data. Then I realized it was only the copy count that had to be shared. So I just allocated it separately. That meant I could access it with atomic operations which have low overhead.

Luckily I had an existing test for concurrent access to objects. This failed with my changes. The race detector also found problems. Objects are locked while reading or writing. But with copy-on-write there are multiple objects referencing the same data. Locking an object isn't sufficient to protect the data. One solution would be what I previous considered - keeping the data and the copy count separately, along with a lock. But then we're back to too much overhead.

I found the problem was that I was decremented the copy count before doing the actual copy. But as soon as the copy count went to zero, another thread could think it was ok to modify. I had to decrement the copy count after the actual copy. But that meant checking if the copy count was 0 separately from the decrement, which meant there was potential for two threads to check the copy count, both find it was 1, and both copying the object. I decided this would happen very rarely, and the only cost was an extra copy.

For once my code was structured so it was quite easy to implement this. Copying was done in a single place and update methods all called a mustBeMutable method. It only took about 40 lines of code.

And pleasantly surprising, this abstraction wasn't leaky and it didn't break or affect any of our application code. Running our application tests there were roughly 500,000 deferred copies, and 250,000 eventual actual copies. So it saved half of the copying - nice!

Monday, March 03, 2025

return throw

In a programming language with exceptions (like Suneido), one of the API design questions is when to return an error, and when to throw it.

It's helpful to distinguish "queries" and "commands". (Command-query separation) A query returns a result. A command performs an action, it may or may not have a return value.

For dynamically typed languages like Suneido query functions can easily return an error value like false or a string. If you forget to check for errors it will usually cause a runtime error. But if you forget to check for errors with a command function, they'll be lost. It might cause a problem later but that can be hard to debug.

From the perspective of using the API, both approaches have pros and cons.

checking a return value is generally faster and less code than try-catch
it's too easy to forget to check a command return value, especially if failure is rare
you can't forget to check an exception (unless you have a try-catch somewhere that deliberately ignores exceptions)
return values have to be checked on every call, exceptions can be handled for larger scopes

C++ and C have nodiscard and Java has JSR-305 @CheckReturnValue annotation but these are static compile time checks, not a good fit for a dynamic language like Suneido.

I came up with the idea of "return throw" (avoiding the need for a new keyword). It returns a value like a regular return. But if the return value is not used (discarded) then it throws an exception.

As I started to use this, I realized it could be more dynamic. A successful return could be just a regular (ignorable) return. whereas error returns could be return throw.

if error
    return throw false
return true

That got a little repetitive so I changed return throw to automatically treat true and "" as success. i.e. a return throw of true or "" would be treated as a normal return and the result could be ignored. But a return throw of any other value e.g. false or "invalid foo" would throw if the result was ignored.

Another issue was if F() did return throw, and G() did return F() that shouldn't be treated as "using" the result. Luckily that turned out to be relatively easy to handle.

I made a bunch of the built-in functions return throw and that has helped track down a few issues. Otherwise it isn't too popular yet. Hopefully it will prove worthwhile in the long run.

Tuesday, February 18, 2025

Go, Cgo, and Syscall

I recently overhauled the Windows interface for gSuneido. The Go documentation for this is sparse and scattered and sometimes hard to interpret, so here are some notes on my current understanding of how Cgo and Syscall should be used.

The big issue is interfacing between typed and garbage collected Go and external code. Go doesn't "see" values in external code. If values are garbage collected by Go while still in use externally it can cause corruption and crashes. This is hard to debug, it happens randomly and often rarely and is hard to reproduce.

The other issue with Go is that thread stacks can expand. This means the stack moves, which means the addresses of values on the stack change. Unlike C or C++, in Go you don't control which values are on the stack and which are on the heap. This leads to the need to "pin" values, so they don't move while being referenced by external code. (There is also the potential that the garbage collector might move values in a future version.)

Go pointers passed as function arguments to C functions have the memory they point to implicitly pinned for the duration of the call. link

Note the "Go pointers" part and remember that uintptr is not a pointer. If you want to pass "untyped" pointer values to C functions, you can use void* which has the nice feature that unsafe.Pointer converts automatically to void* without requiring an explicit cast.

SyscallN takes its arguments as uintptr. This is problematic because uintptr is just a pointer sized integer. It does not protect the value from being garbage collected. So SyscallN has special behavior built-in to the compiler. If you convert a pointer to uintptr in the argument to SyscallN it will keep the pointer alive during the call. It is unclear to me whether this applies to Cgo calls. Do they qualify as "implemented in assembly"? Strangely, SyscallN has no documentation. And the one example uses the deprecated Syscall instead of SyscallN. And you won't even find SyscallN unless you add "?GOOS=windows" to the url.

If a pointer argument must be converted to uintptr for use as an argument, that conversion must appear in the call expression itself. The compiler handles a Pointer converted to a uintptr in the argument list of a call to a function implemented in assembly by arranging that the referenced allocated object, if any, is retained and not moved until the call completes, even though from the types alone it would appear that the object is no longer needed during the call. link

The example for this generally show both the uintptr cast and the unsafe.Pointer call in the argument, for example:

syscall.SyscallN(address, uintptr(unsafe.Pointer(&foo)))

My assumption (?) is that the key part is the uintptr cast and that it's ok to do the unsafe.Pointer outside of the argument, for example:

p := unsafe.Pointer(&foo)
syscall.SyscallN(address, uintptr(p))

What you do NOT want to do, is:

p := uintptr(unsafe.Pointer(&foo))
// WRONG: foo is not referenced or pinned at this point
syscall.SyscallN(address, p)

To prevent something from being garbage collected prematurely you can use runtime.KeepAlive which is simply a way to add an explicit reference to a value, to prevent it from being garbage collected before that point in the code. KeepAlive is mostly described as a way to prevent finalizers from running, but it also affects garbage collection.

KeepAlive marks its argument as currently reachable. This ensures that the object is not freed, and its finalizer is not run, before the point in the program where KeepAlive is called. link

However, it does not prevent stack values from moving. For that you can use runtime.Pinner

Pin pins a Go object, preventing it from being moved or freed by the garbage collector until the Pinner.Unpin method has been called. A pointer to a pinned object can be directly stored in C memory or can be contained in Go memory passed to C functions. If the pinned object itself contains pointers to Go objects, these objects must be pinned separately if they are going to be accessed from C code. link

Passing Go strings to C functions is awkward. Cgo has C.CString but it malloc's so you have to make sure you free. Another option is:

buf := make([]byte, len(s)+1) // +1 for nul terminator
copy(buf, s)
fn((C.char*)(unsafe.Pointer(&buf[0])))

If you don't need to add a nul terminator, you can use:

buf := []byte(s)
fn((C.char*)(unsafe.Pointer(&buf[0])))

Or if you are sure the external function doesn't modify the string, and you don't need a nul terminator, you can live dangerously and pass a direct pointer to the Go string with:

fn((C.char*)(unsafe.Pointer(unsafe.StringData(s)))

One thing to watch out for is that Go doesn't allow &buf[0] if len(buf) == 0. If it's possible for the length to be zero, you can use unsafe.SliceData(buf) instead.

There is a certain amount of run time checking for some of this.

The checking is controlled by the cgocheck setting of the GODEBUG environment variable. The default setting is GODEBUG=cgocheck=1, which implements reasonably cheap dynamic checks. These checks may be disabled entirely using GODEBUG=cgocheck=0. Complete checking of pointer handling, at some cost in run time, is available by setting GOEXPERIMENT=cgocheck2 at build time. link

Sorry for the somewhat disorganized post. Hopefully if someone stumbles on this it might help. Or maybe it'll just be gobbled up and regurgitated by AI.

Thursday, October 03, 2024

String Allocation Optimization

This post follows on from the previous one. (Go Interface Allocation) After realizing the allocation from assigning strings to interfaces (common in gSuneido) I started thinking about how this could be improved. I had a few ideas:

predefine all 256 one byte strings (common when looping over a string by character)
implement a "small string" type (i.e. a one byte size and 15 byte array in 16 bytes, the same space as a Go string pointer + size)
intern certain strings with the new Go unique package (unique.Make returns a Handle which is a pointer so it could be stored in an interface without allocation)

Unfortunately, the last two require introducing new string types. That's feasible (gSuneido already has an additional SuConcat string type to optimize concatenation) but it would take some work.

Luckily the first idea was easy and didn't require a new string type.

func SuStr1(s string) Value {
    if len(s) == 1 {
        return SuStr1s[s[0]]
    }
    return SuStr(s)
}

var SuStr1s = [256]Value{
    SuStr("\x00"), SuStr("\x01"), SuStr("\x02"), SuStr("\x03"),
    ...
    SuStr("\xfc"), SuStr("\xfd"), SuStr("\xfe"), SuStr("\xff"),
}

It didn't make much difference overall but it was easy to see the improvements on specific types of code. For a simple change I think it was worth it. (Thanks to "AI" for auto-completing most of that 256 element array.)

See the commit on GitHub

Friday, January 12, 2024

Fuzzing - the gift that keeps on giving

It's been almost a year since I started fuzz testing Suneido database queries and it's still finding bugs. I assumed it would find bugs at first, and then peter out. But we set them up to run for an hour every night to catch any regressions and that has turned out to be very worthwhile. Of course, I'm sure I've fixed the last bug now :-)

I have a bunch of Go tests for the query code, and they have decent coverage. But they barely scratch the surface in terms of combinations of query operations. Passing these tests gives me little confidence that changes are correct.

We also have lots of application tests that do a lot of queries. But all the queries are written by humans and follow similar styles and approaches. The bugs found by fuzzing were not detected by the application tests. That's probably partly because I used those application tests to initially develop and debug the gSuneido query implementation. So those types of queries have been well tested and debugged.

It might seem obvious, but one thing the fuzz testing has hammered home is that making something work for "normal" cases can be quite different from making it work for every bizarre unreasonable case. I look at the failing queries from fuzzing and I think "no one would ever write a query like that". And yet, they are bugs nonetheless.

In most cases when the nightly fuzzing would find an error (a difference between the results from gSuneido and jSuneido) I could easily reproduce it. But occasionally it would depend on the randomly generated data. (e.g. that a certain value did or didn't exist) That could make it hard to reproduce the error to debug it. We tried saving the data from each run, but that was awkward. We ended up saving the seed for generating the random data. We could include that with the results and use it to recreate the data.

The data dependency made me realize that we weren't testing many sets of data, just one per night. Rather than generate more sets of data (slow relative to running queries) I changed the fuzzing to try several variations of the constant values in each query. That had a similar effect to trying different sets of data.

It wasn't entirely surprising, but it was interesting that the fuzzing didn't find any errors in the jSuneido query implementation. The fuzzing only detects differences between the implementations, it didn't tell me which was right and which was wrong. Several times I convinced myself that it was jSuneido that was wrong, but it always turned out to be gSuneido. The gSuneido query implementation started out very similar to jSuneido. But over the years it got refactored and improved and optimized. gSuneido's query optimization is now quite a bit more sophisticated, which means more complex, which means more bugs unfortunately. jSuneido has also been around longer so maybe that helped work out the bugs. It's also much more stable - I've barely touched the code for several years whereas the gSuneido code has been changing constantly.

Another aspect that was surprising was how "rare" some of these failing queries were. We are running about a million queries per night. I would have thought that would explore the possibilities fairly well. But some of the bugs were only caught once after weeks of running. In hindsight that shouldn't be surprising - there are an infinite number of possible queries, and when you add in the data dependencies, the failing cases are even more rare.

Like other fuzz testing systems, I accumulate the failing queries in a "corpus" and run them every time along with new random queries. Quite often these are the ones that fail again. They obviously are "tricky" queries as far as my query implementation goes.

One concern I had was whether my ad hoc random query generation would generate "all" the possible queries. There could be bugs that could not be "reached" by the possible range of queries I was generating. Obviously, I don't know if there are bugs it's missing. But judging by the bugs it has found, it seems to have quite good coverage. In theory it should probably be based on the query grammar (the reverse of parsing). I ended up using a random sequence of additions and substitutions. One of the challenges is to ensure that the query is still valid e.g. that the columns referenced actually exist in the sub-query.

When I get a failure, the first thing I try to do is simplify the query. Most of the time there are parts of the query that can be removed and it will still fail. It's a lot easier to debug if e.g. there is only one join in the query. This is a fairly mechanical process and could probably be automated. (More easily from the parse tree than from the query text.) But it only takes a few minutes to do manually so it's not worth the effort to automate. (As long as there aren't too many bugs!)

One issue with my current approach is that it depends on using jSuneido for comparison. But we have replaced jSuneido with gSuneido and we're not using it any more. It's also awkward to have to run two processes and compare the results. I decided to write a "simplest possible query implementation" within gSuneido. I procrastinated on this for a while because I thought it would be a lot of work. But it only took me a few hours. I have to remember that the complexity is not inherent in the operations, it's almost all from the optimization. It still uses the same parser, but that seems ok because that's not where the problems are. It doesn't use any of the normal query optimization or execution. And in a way it's better than using jSuneido since it's not at all the same approach. And it's easier to be sure the simple implementation is correct. The alternate implementation is quite slow and isn't suitable for large results but it's sufficient for the fuzzing. And to get around the slower speed, with it all being within one gSuneido process, it's easy to run multiple threads.

Friday, September 22, 2023

Poisoning Workers

As in worker threads, not people.

gSuneido uses an unbounded thread pool to handle client requests to the server. Most Go thread pools are intended to limit the number of goroutines e.g. to match the number of hardware threads. But in this case I want all the requests to execute in parallel, even if it wasn't the most efficient, so that all the requests made progress. Normally in Go you don't need an unbounded thread pool, goroutines are usually fast enough that you can just start a goroutine for each request. However, if the goroutine will need to allocate memory, either for the stack or for other things, then creating new ones every time becomes more expensive. One option is to use a sync.Pool to avoid the allocation. But that doesn't solve the cost of stack growth. (e.g. issue 1838)

In the old version of my code, each worker was responsible for terminating itself if it was idle. This was implemented with a timer that was reset by each request.

This seemed straightforward. I tested it and it worked fine. But then recently I added some metrics, including the number of workers, and I noticed the number of workers rarely went down. Definitely not dependably after the 5 second timeout. I puzzled over this for a while before I realized what was happening. If multiple worker goroutines are waiting to receive from the same channel, it's unpredictable which worker will receive the message. This doesn't seem to be defined in the language spec so I'm not sure if it's random (as the select statement is defined to be) or if the receivers are queued and processed first-in-first-out. It doesn't really matter for my issue. The bottom line is that tasks end up distributed over workers. So even with a low level of activity, workers tend to get at least one task per 5 seconds, and therefore they don't terminate.

I could dream up lots of complicated approaches but it seemed like there should be a simple solution. I could probably get away with just not ever killing workers. But then if a spike in load creates a bunch of workers, they will hang around forever, which isn't the best behavior. Searching the web, I found a lot of introductory material about goroutines and channels, and a lot of bounded thread pools, but not much on unbounded ones. I found lnquy/elastic-worker-pool, but that seemed more complicated than I needed.

Eventually I realized I could make the pool a "leaky bucket" i.e. just periodically kill a worker. That would ensure the pool would eventually shrink. This is very simple and doesn't require anything complex like trying to measure the load. The downside is that even if you're in a steady state with the correct number of workers, it's still going to kill workers and they'll end up being recreated. But a worker pool is really just a cache and a cache doesn't need to have 100% hit rate to be effective. I made a slight improvement by preventing killing too soon after creating a new worker goroutine.

I implemented this by having a killer goroutine that sends a "poison pill" to the same channel as the tasks. One of the workers would receive this and terminate itself.

The killClock and nworkers are atomic since they are accessed from multiple threads. The workers just needed to recognize the poison pill and terminate. The submit code just need to reset the killClock to zero whenever it created a new worker goroutine.

For example, if the interval is 1 second and the delay 10 seconds, in a steady state a worker would be killed and then recreated every 10 seconds. Given that creating a worker is almost fast enough to do on every task, that's very minor overhead.

Hopefully I haven't overlooked any flaws in this solution. It seems simple and fast.

As usual the code is on Github

Sunday, September 10, 2023

A gSuneido Database Optimization

While working on the recent gSuneido database issue I had an idea for optimizing the transaction conflict checking code.

The current code kept a list of outstanding transactions and for each a list of tables and their reads and writes. To check a new read or write meant a linear search through all the transactions for other activity on that table. If the data was organized by table rather than by transaction, it would eliminate the linear search. An action on a "rare" table wouldn't have to look at so much data. Some tables would be common to many transactions, but still normally not all of them. And even in a situation where all the outstanding transactions included a particular table it wouldn't be any slower than before.

At a high level it was a small change, but I wasn't sure how it would fit with the code. I knew I'd still need to keep a list of transactions and the tables they had accessed. It turned out to be relatively straightforward to modify the code. (A few hours work.) One of the questions was where to embed data and where to use pointers to separately allocated data. Embedding tends to be faster, but embedding in maps or slices can use more memory because empty slots are larger.

Before starting the changes I wrote a Go benchmark so I could see what effect the changes had on performance. As usual, it was hard to know what a "typical" pattern of activity was. For this benchmark the new version was over twice as fast. That was a bigger difference than I'd expected. Sadly, but not surprisingly, the change made no difference to the speed of our application test suite. It doesn't do many concurrent transactions so it wouldn't benefit. However, the stress test I'd used to investigate the "too many outstanding transactions" issue did show big improvements. When I collected the same measurements as before and graphed them, it looked even better. Up to three threads it didn't make much difference but after that it was substantially better. And performance didn't drop off under load like it had before. (Or at least, not until much higher load.)

The conventional wisdom for optimization would be to profile the code and find the hot spots. That wouldn't have helped in this case. The impact of the data structure was spread over the code. Nor would profiling have given any insight into a better data structure. As I commonly find, what is required is to have good mental models of the problem, the code, the programming language, and the hardware. Then you need to find an approach that is a good fit between all of these. Of course, the hardware performance model is a moving target. These days it's mostly about cache misses and branch prediction.

I'm always happy to improve performance, especially by significant amounts. But in the back of my mind I can't help thinking I should have written it better the first time around. Of course, that doesn't make sense because you can't write the "perfect" version. There's always going to be room for improvement.

Saturday, September 02, 2023

A gSuneido Database Issue

We've been gradually converting our customers' servers from jSuneido to gSuneido. Recently we converted a larger customer and they started to get "too many outstanding transactions" errors. Unfortunately, we'd changed multiple things at once with this customer, so it wasn't immediately clear whether this issue was because of the switch to gSuneido, but I suspected it was.

I had set the limit in gSuneido to the same value as jSuneido - 200 outstanding transactions. "outstanding" in this case either means uncompleted in progress, or completed but overlapping with an uncompleted one.

It seemed a little counterintuitive. gSuneido is faster than jSuneido. So why should it run into this more? My theory is that because it's faster, more transactions get processed and if some of them are slow this will mean more overlapping transactions, even though they completed quickly.

One of my first thoughts was to look at optimizing the gSuneido code. But if my theory was right, this could actually make the problem worse rather than better.

We temporarily solved the problem by adding slight delays to code that was using a lot of transactions. (i.e. we throttled or rate limited them). This worked but it seemed a little ugly to slow it down all the time, just to avoid a problem that only occurred occasionally.

That led me to add throttling to gSuneido, but only when the outstanding transactions got high (e.g. 100). Throttling would prevent hitting the limit (200). I wrote a stress test and this did solve the problem. And it was better than adding delays in the application code. But the problem was that once it throttled, it got really slow. If you throttled enough to prevent hitting the limit e.g to 10 transactions per second, then 1000 transactions would go from maybe .1 seconds to 100 seconds - ouch! And because throttling slowed things down so much, it increased the chances of transactions hitting the 20 second time limit. The cure seemed worse than the disease, it won the battle but lost the war. (Sorry for the proverbs.)

I started to wonder what the right limit was. 200 was an arbitrary value. The reason to have a limit was to prevent it from "falling over" under heavy load. But at what point would that happen? So I removed the limit and tested under different loads.

threads	time	tran/sec	conflicts	max trans
1	2 sec	5000	100	70
2	2 sec	10000	150	110
3	3 sec	10000	200	150
4	4 sec	10000	200	180
5	4 sec	12500	200	200
6	5 sec	12000	300	230
8	8 sec	10000	300	270
10	10 sec	10000	300	300
16	20 sec	8000	400	450
20	30 sec	7000	500	490
30	50 sec	7000	600	580

I was quite happy with the results. Performance did drop off with more than 5 or 6 threads, but not that badly. The maximum outstanding got well above the old limit of 200, but that didn't seem to be a major problem. The number of transactions aborting due to conflicts increased with load, but that's expected with that much concurrent activity. To some extent it seemed to be self limiting. i.e. the load itself slowed things down and effectively did its own throttling.

Of course, this is completely dependent on what my test is doing. Is it "typical"? No, probably not. For starters, it's doing continuous activity on multiple transactions as fast as it can, with no other work. Real application code is not going to be doing that. Our applications never do 10,000 transactions per second for long periods. The test does result in shorter and longer transactions, but the longest are only around 1 second, whereas in production we do see transactions taking much longer and sometimes hitting the 20 second limit (usually as a result of external factors). And my workstation only has 8 cores / 16 threads, so it's not really 30 threads.

Out of curiosity, I removed the limit from jSuneido to see how it would perform. 10 threads took 15 seconds or about 50% slower. That wasn't bad. But roughly half the transactions (5000) failed due to conflicts. Ouch! That wasn't so good. Even with only one thread, roughly 10% (1000) conflicted. As opposed to 1% (100) with gSuneido. The usual method of dealing with conflicts is to retry, but that would just increase the load and make things worse.

So what is the correct limit for gSuneido? Or should there be no limit? Should it ever throttle, and if so, at what point? Honestly, I still don't know the answers to those questions. But it does appear the limit could be considerably higher e.g. 500 instead of 200, which would likely eliminate the problem, other than rare extreme cases.

Thursday, July 20, 2023

Keyword Recognition

Daniel Lemire's recent blog post Recognizing string prefixes with SIMD instructions and one he linked to Modern perfect hashing for strings prompted me to revisit how I was recognizing keywords in gSuneido. Although I find it interesting how you can use advanced processor instructions, I didn't really want to get into that. But like most things, even without looking at my code, I figured it could probably be improved.

I found that I was recognizing language keywords (28) with a linear search, sorted by frequency of occurrence. For query keywords I was using a Go map. Both were quite fast, certainly not a bottleneck.

The reason I hadn't used a Go map for language keywords was that I wanted to "intern" the strings. And there's no way to retrieve the key of a Go map (unless you loop through all of them). Normally it doesn't matter since you need the key to do a lookup in the first place. But the key I'm doing a lookup with is a slice from the source code. If I hold onto that key, it will keep the entire source code in memory (because the reference prevents garbage collection).

One of the first things I tried (Compile-time switches from the Modern article) was one of the fastest (4x faster than the old linear search) - switch on the length, and then switch on the i'th character (chosen for each length). For example:

switch len(s) {
case 2:
    switch s[1] {
    case 'f':
        if s == "if" {
            return tok.If, "if"
        }
    case 's':
        if s == "is" {
            return tok.Is, "is"
        }

I find it a little surprising that this turned out to be the fastest since it has quite a few branches - two switches plus a string comparison. Using Compiler Explorer to look at the assembler, I found the code was quite good. I was expecting a call to a string comparison function, but it is unrolled inline. It may actually be beneficial that the switches are explicit so each branch can get its own prediction.

Based on the assembler, I simplified the code. This was just as fast, smaller source code, and smaller machine code.

switch len(s) {
case 2:
    if s == "if" {
        return tok.If, "if"
    }
    if s == "is" {
        return tok.Is, "is"
    }

Because it was comparing constant strings and it knew the length matched, it compared them as multi-byte integers instead of strings. 2, 4, and 8 bytes matched int16, int32, and int64. For lengths of 3, 5, 6, and 7 it used a combination e.g. 6 characters was an int32 and an int16.

I did try various other approaches.

I tried finding "perfect" hash functions and table sizes, which turned out to be trickier than I thought. It was almost as fast as the switch version.

I tried another hash table implementation, but it was slower than the Go map.

I tried switching on the length and then doing a linear search.

In the end I went with the switch version since it was fastest. It was a little more verbose, but Tabnine handled most of the code generation.

Did this micro-optimization make any difference to the overall speed? No, not that I saw from a quick test. But it was an interesting investigation. And yes, premature optimization can be a mistake. But ignoring speed is also a mistake.

Sunday, April 23, 2023

New Regular Expression Implementation for gSuneido

Suneido has its own regular expression implementation. 25 years ago, when I wrote the original C++ cSuneido implementation, there weren't a lot of regular expression libraries, and I already had an implementation from a previous product, so that's what I used. With the Java version, I initially tried to use the Java standard regular expressions. But it was too hard to get exactly the same behavior, so I ended up using my own implementation. And I did the same thing with the Go version of Suneido.

But I haven't been totally happy with my implementation. It uses backtracking (like many other implementations) which means it has some bad worst cases. I had to put limits on the backtracking, which meant we couldn't use regular expressions in some places, which meant hand writing code to replace them. To make it worse, there was no easy explanation of what kinds of regular expressions were ok.

Recently I re-encountered this series of articles about implementing regular expressions.

Regular Expression Matching Can Be Simple And Fast
Regular Expression Matching: the Virtual Machine Approach
Regular Expression Matching in the Wild

I'd read them before but at that point I'd been satisfied with what I had. The articles made it seem easy to implement a non-backtracking approach to avoid the worst case behavior. Of course, it didn't turn out to be quite that easy to write a full implementation. Naturally, the articles didn't cover all the details.

It started out as a weekend project, but it ended up taking me longer than that, partly because I implemented optimizations for literal strings and single pass expressions.

Because Suneido is ASCII only (due to its age), the code is a little simpler and faster than the Go standard library version which handles UTF-8.

The Go regular expression package compiles to linked structs that are relatively large. I chose to compile to byte code stored in a string. I like to use strings because they are one of the few constant immutable types in Go. And string references are only 16 bytes compared to 24 bytes for byte slices. There's probably a bit more overhead processing the byte code, but it's a more compact chunk of memory so it'll be more cache friendly.

Humorously (to me) when I was fuzz testing my implementation against the Go standard library implementation I discovered a bug in the Go code. As expected, one of the initial responses was denial, blaming it on my lack of understanding. Thankfully, someone else (not on the Go team) confirmed the problem and even located a possible location of the bug. I thought a bug in something like regular expressions would be treated fairly seriously, but over a month later it's still just labeled as "NeedsInvestigation" rather than "NeedsFix". It was put under the Go1.21 milestone at least.

The new implementation is about 2000 lines of Go code. Probably half of that I was able to reuse from the previous version. As usual the code is on GitHub

Saturday, April 08, 2023

Full Text Search

Suneido uses full text search for its help and wiki. jSuneido used Lucene, a natural fit for Java. I needed something different for gSuneido. I looked at Bleve which seems to be the most common Go full text search. But it was big and slow, and the indexes were large. I had used Lunr.js on a previous project and it had worked well. It's small and fast and easy to use. It's JavaScript, but we display our help and wiki in a browser window so that seemed ok.

We (thanks Jatin) got it working, but it was a little ugly. Our customer systems have different combinations of applications and options so we built custom indexes on each system. But for Lunr we were using Node.js to build the indexes and we didn't want to install Node on all our customer systems. So we had to build an index containing "everything", ship that to our customers (every time we updated the help), and then filter the search results based on their configuration.

We only use our Wiki in-house, but with it there was another issue. People edit the wiki all the time. With Lucene, we could update the index with changes, but Lunr.js doesn't support that, you have to rebuild the whole index.

Eventually I got frustrated with the "friction" of using Lunr and decided to continue my long tradition of re-implementing the wheel. I'd considered writing my own full text index/search previously but had resisted the urge. But I had a weekend, and nothing urgent to do, so I decided to give it a try. I started with Let's build a Full-Text Search engine, which makes it sound easy, but mostly because the simple version it describes isn't complete.

Tokenizing is easy, although there are questions about exactly what constitutes a token. Obviously, sequences of letters, but what length? I decided to ignore single letters and also ignore anything over a certain length (e.g. 32 characters). Especially in our wiki we also wanted to search on numbers. There's also the issue of punctuation, which I'm currently ignoring.

I used the Go Snowball (Porter2) stemmer mentioned in the article. A stemmer simplifies words. For example, fishing, fished and fisher are reduced to the base form (stem) fish. That reduces the number of terms in the index and makes searching more forgiving.

I got sidetracked into looking at bitmap data structures like roaring bitmaps (as suggested in the article), but then I started looking at how to implement search scoring and found that bitmaps weren't sufficient, I needed counts per term per document, not just true/false.

I decided to use BM25 scoring like Lucene and Lunr. I found a good article that explained it and even gave some examples I could check my results against.

It came together surprisingly easily and by the end of the weekend I had about 1000 lines of Go code that could create indexes and search them. It was fast and the indexes were small (compared to Lunr). The results seemed comparable. I felt a little guilty because it meant throwing out all the work that went into trying to use Lunr. For a change I felt like I should have written my own version sooner.

Wednesday, February 01, 2023

Fuzz Testing Database Queries

The gSuneido database was ready for production, as far as I could tell. All our tests passed.

But there was a steady trickle of bugs showing up. Sure, they were obscure cases, but they were showing up in actual application code, so they weren't that obscure.

Every time I'd think it was ready to deploy further, another bug would show up.

I finally decided I had to do something to flush out these obscure bugs. Waiting for them to show up in production was not the way to go.

I had thought about trying to fuzz test database queries before, but it always seemed too hard. Even if I could figure out a way to generate random but legal queries, how would I check the results? Then I realized I could compare jSuneido and gSuneido. It wasn't a perfect test since they were roughly the same design and therefore could have matching bugs. But I had made enough changes and improvements to gSuneido that they were now significantly different. And even if it wasn't a perfect test, it was a heck of a lot better than nothing.

I puzzled over how to generate valid random queries. And what kind of data to use. I started with a simple set of four tables with obvious joins between them. The joins seemed to be the most constrained element, so I started with simply picking randomly from a fixed list of joins.

e.g. cus join ivc

I represented the query as a tree of nested objects and wrote a function to randomly pick a sub-expression branch.

unions are added by randomly picking a branch and replacing it with: branch union branch

e.g. (cus join ivc) union (cus join ivc)

leftjoins are added by randomly replacing some of the join's.

Then where, rename, extend, and remove (project) are added at random spots.

(((cus where ck = 12) join ivc) union ((cus leftjoin ivc) extend x1)) remove c1

Finally it optionally adds a sort.

The resulting queries aren't necessarily realistic, but they seemed to cover most of the variations.

It's a little more ad-hoc than I originally hoped. You could generate random queries from the grammar, and they would be valid syntax, but queries also have to have valid semantics, and that isn't represented in the grammar.

First I just parsed and optimized the queries, no execution. This soon triggered some assertion failures which uncovered a few bugs.

The next step was to compare the actual results between gSuneido and jSuneido. I decided the simplest approach was to calculate a checksum of the result and output queries and their result checksums to a text file. Then I could re-run those queries on jSuneido and compare the checksums.

In the end I found about a dozen bugs. Several of them were in jSuneido, which is a little surprising since it has been in production with thousands of users for about 10 years.

The problem with finding a bug through fuzzing, is that the inputs it finds are often messy. Some fuzzing systems will try to simplify the inputs for you. I just did it manually, starting the debugging of each failure by simplifying the query as much as possible while still retaining the failure. Fuzzing can make finding bugs easier, but it doesn't really help with fixing them. And the more you fuzz, the more obscure the bugs get. On the other hand, by the end I got pretty good at tracking them down, inserting prints or assertions or using the debugger.

I added all the failed queries to a "corpus" that gets run every time, along with new random queries as a defense against regressions. In theory they would get generated randomly again, but that could potentially take a long time.

I called it success when I ran about 500,000 queries (roughly two hours of processing) with no differences. The plan is to add this to our continuous test system and fuzz for an hour or so per night when we have spare compute power. That should prevent regressions and possibly find even more obscure bugs.

I'm pretty happy with how this turned out. It only took me a few days work to write the fuzzing system (not counting the time to actually fix the bugs), and it flushed out a bunch of bugs that I'm sure would have haunted me for a long time otherwise.

Of course, in hindsight I should have done this 20 years ago, or at least 10 years ago when I wrote a second implementation (jSuneido). Sometimes I'm a little slow! But better now than never.

Maybe the counterpart to Eric Raymond's "given enough eyeballs, all bugs are shallow" is "given enough fuzzing, all bugs are visible". (Of course, it's never "all", but you get the idea.)

Wednesday, January 04, 2023

Three Weeks for Three Tweaks

It started with reports of slow database queries. At first I didn't pay too much attention, after all, some queries are slow. We're dealing with quite large data (in a small business sense, not in Google terms) and customers are choosing what to select on and how to sort.

It progressed to the same query being slow sometimes and fast other times. That seemed a little more suspicious, but there were still a lot of factors that might have explained it.

Finally someone came up with a simple example where the same query, ran on the same data, was 1000 times slower when you ran it slightly differently. That was bad news for me, since it definitely meant there was a problem with the query optimization. Why was it picking such a slow strategy sometimes, when there was an obviously better strategy. Especially when it was such an extreme difference. The optimization shouldn't have to be very accurate when there is a 1000 times difference!

It didn't turn out to be a bug, the code was working as designed. It was just a particular scenario that wasn't handled well by the current design.

After 20 years, the easy improvements have been made and I'm into the realm of diminishing returns. I ended up needing to "tweak" three aspects of the code in order to fix the issue. And all three were required before I could see if it was going to work. TL;DR - it did solve the problem.

One way of looking at the issue was that the optimization was getting caught by a local minimum. It wasn't smart enough to use a sub-optimal strategy in one spot in order to allow a better strategy overall. (It doesn't explore every possible strategy, for a complex query that would be far to slow.)

None of the changes were difficult, but they were all somewhat invasive, requiring changes to all the query operations.

Background

In case you actually want to try to follow what I'm talking about, here's a little background on how Suneido implements queries.

A query like:

(table1 join table2) union table3

gets parsed into a tree like:

where the leaves are database tables and the other nodes are operations. Each node has one or more strategies for execution, plus some parameters like choice of index.

Optimization starts with calling the optimize method on the root operation (union in this case) which then calls optimize on each of its sub-queries.

Execution works similarly by calling the get method on the root operation, which then "pulls" data from its sub-queries.

Tweak #1

Lets say we have a query like:

table1 join table2 where id=1

During optimization, query operations can ask their children for their estimated number of rows. In this case join would get the table size from table1 e.g. 1000 and one (1) from the where since it's selecting on a unique id. But that information isn't sufficient to estimate the fan-out of the join.

So the first change I made was to return the "population" count in addition to the row count. The where could return 1 for the result count and e.g. 10,000 for the population i.e. table2 size. This allows the join to estimate a more realistic fan-out of 1:10.

Tweak #2

The next problem is when a temporary index is required, there are two costs - the "fixed" cost of building the index, and the "variable" cost of reading from it. But until now these had been combined into a single "cost". Other operations usually have only a variable cost.

Continuing the example from Problem #1, now the join can estimate the fan-out is 1:10, it can estimate it's only going to read 10 records from table2. So the variable cost will be low, but we'll still incur the entire fixed cost. So in this case we want to minimize the fixed cost. But to do that, I needed to separate the cost into fixed and variable parts everywhere.

Tweak #3

The remaining problem was that join didn't have any way to tell the sub-query that only a small part of the result would be read. To address this, I added a "fraction" argument to allow queries to tell their sub-queries how much of the result they estimated they would read.

In the running example, 10 rows from 10,000 would be a fraction of .001 Where before it would choose a strategy with e.g. a fixed cost of 1000 and a variable cost of 1000 (total 2000) over one with a fixed cost of 0 and a variable cost of 5000 (total 5000). Now the 5000 be multiplied by .001 giving 5, which is obviously preferable to 2000.

This allowed the optimization to avoid the local minimum.

Conclusion

I'm not sure how common the problem scenario is. It was only because we ran into such an extreme case (1000 times slower) that I got motivated to investigate. Perhaps less extreme cases also occur and will be improved by these changes. As far as I can reason, the changes should not make any cases worse.

Sunday, December 04, 2022

Blogging, Suneido, and Programming Languages

Another blogger I follow has announced they are stopping blogging. It's sad (to me) how the world has shifted away from blogging. Of course, there are still active blogs, but far fewer than there were. I thought maybe the pandemic would give people time to do more blogging but it didn't seem to work that way. Of course, it always surprised me that smart productive busy people would take the time to write good blog posts.

I'm no exception. I post less and less often, other than photographs on my non-work blog. But even that is a shift from my past longer text posts. Readers have dwindled too. (Not that I ever had a big audience.)

But looking back through some of my blog posts, I'm glad I wrote them even if hardly anyone read them. It's a bit like a sporadic diary, but more thoughtfully written than a personal diary would be. I started blogging in 2005. It would have been interesting to go back farther but even in 2005 the tech world was a different place than today.

Thinking about this led me back here. I was writing a document for my staff, and some of it seemed of slightly more general interest so I thought I'd write about it.

Here's roughly when the versions of Suneido went into "production" (used by customers)

2000 cSuneido (C++)
2010 jSuneido (Java)
2020 gSuneido client (Go)
2022 gSuneido server
2022 suneido.js (browser client)

Two data points doesn't mean much, but it's interesting that it was roughly 10 years between the major versions. The concurrent release of suneido.js is because another programmer did the development. (I started it, but developing both gSuneido and suneido.js at the same time was too much.)

All three implementation languages (C++, Java, Go) were in their early stages when I started using them. I was lucky that all went on to become mainstream. Some of the other languages I considered, such as C#/.Net and Kotlin have also been successful. Scala and D not as much, although they're still around.

C++ is a fascinating language. But it's too complex and it's unsafe. Life is too short. (gSuneido does have some unsafe code to interface with Windows and it has been the source of the hardest bugs.)

Java has a world class runtime with a great garbage collector and JIT compiler. But the language itself is not really a good fit for low level systems programming. gSuneido, with a simple byte code interpreter and no JIT still manages to outperform jSuneido. If Project Valhalla ever gets released that will help. But it's been "coming soon" since 2014.

Go felt like a bit of a gamble at the time. The garbage collector wasn't very good and neither was performance. But the language was a good fit for implementing Suneido. And it seemed like Google was committed to it. Although I wish it was a little less of a one-company language.

One of the things that bugged me about Go was its lack of generics. But now that it has them, I don't really use them that much. The people that resisted generics would say "I told you so". But it did simplify certain kinds of code and I'm still glad to have it when I need it.

I wish Go had better support for immutability. C++ and Java aren't much better but at least they have a little bit. Immutability (and pure functions) are one of the best tools to handle concurrency, and concurrency is supposed to be Go's strong point. I think there's potential for a conventional mainstream language with strong immutability. I realize there are functional languages with strong immutability, but so far they're not really mainstream.

The big thing these days is Rust, but Suneido is garbage collected, and unless you want to write your own garbage collector (no thanks, been there, done that) the implementation language also needs to be garbage collected.

As far as the Suneido language itself, I think it has stood the test of time reasonably well. It was never intended to be especially novel or original, although at the time object-oriented programming and dynamic languages were leading edge. There have been a few minor adjustments to the language, a bit of syntactic sugar here and there, but for the most part code written 20 years ago would still run today. There are a few things I'd change but overall, when I program in Suneido, there's not a lot I wish was different.

Tuesday, February 01, 2022

A Better LRU Cache for gSuneido

When I was working on gSuneido performance I discovered that the LRU cache in the standard library was one of the hot spots.

It was a simple design - a hash map of keys to values, plus an array of keys to track which was least recently used. When I wrote it, many years ago, it was just a quick addition. I had no idea it would be a heavily used bottleneck. Usage increased a lot when we added the easy ability to memoize a function.

I actually had to work backwards to find that LruCache was a bottleneck. What I first discovered (with the Go memory profiler) was that deepEqual was doing the majority of allocation. It was being called by SuObject Equal. The allocation was to track which comparisons were in progress to avoid handle self referential data. I was able to tweak deepEqual to greatly reduce the allocation. That removed the majority of the bottleneck.

But why were there so many object comparisons? I added debugging to print the Suneido call stack every 100 comparisons. It was coming from object.Remove which was being called by LruCache. To maintain its LRU list, it would move items to the most-recently-used end of the list by removing them and re-adding them. Since this was just an array, to find the item to remove it, it did a linear search. i.e. a lot of comparisons. Originally, caches had mostly been small, as you could tell from the default size of 10. But now the most common size was 200. 10 is reasonable for linear search, 200 is not so good.

In addition, originally keys were simple values like strings. But keys often ended up being several values in an object. The combination of linear searches through long arrays of composite values is what led to the bottleneck.

I decided to make LruCache built-in i.e. written in Go. But I also needed a new design that would avoid the linear searches. The normal way to track LRU is with a double linked list. I started in that direction but linked lists are not a good fit for modern CPU’s. Because the entries are individually allocated they are scattered in memory which leads to CPU cache misses. Contiguous arrays are much better because memories are optimized for sequential access. You could store the linked list entries in a contiguous block, but you still have the overhead of pointers (or indexes) and you still don't have locality or sequential access.

I ended up with the following design.

an unordered array of key, value pairs (entries)
a hash map of keys to entry indexes (hmap)
an array of entry indexes in LRU order

A successful lookup becomes slightly more complex - look in the hash table for the entry index, and then get the value from the entries array. Finally, we have to find that entry index in the lru array (a linear search but of small integers) and move it to the most recently used end.

The hash table lookup will still require comparing multiple argument objects, but much fewer than a linear search.

To replace the oldest item, we take the entry index from the least recently used end of the lru array. From that entry we get the key to remove it from the hash table. And we reuse that slot in the entries array to store the new key and value.

The new built-in LruCache was about 100 times faster than the old one. Overall, it improved the speed of our application test suite by about 5%. That’s pretty good for a couple days work.

ASIDE: There are two main aspects to a cache. The focus is usually on which items to evict i.e. least recently used or least frequently used. The other aspect is which items to add. The normal assumption is that you add everything. But actually, you really only want to add items you’re going to be asked for again. Otherwise you’re evicting potentially useful items. We can’t predict the future, but we can assume that items we see twice are more likely to be seen again. One approach is to use a bloom filter to track which keys you’ve seen and only add to the cache ones you’ve seen before.

Saturday, January 29, 2022

Checking Immutable Data

The database implementation in the Go version of Suneido relies a lot on immutable data that is shared between threads without locking.

The problem is that Go has no support for immutability. It doesn’t have const like C++ or even final like Java. (Go has const but only for simple numbers and strings.)

I was seeing some sporadic crashes and corruption. One possible cause was something accidentally modifying the theoretically immutable data. The immutable database state is a large complex set of linked structures processed by a lot of complex code. As with a lot of immutable persistent data structures you don’t copy the entire structure to make a new version, you basically path copy from each spot you want to change up the tree to the root. It's easy to miss copying something and end up mistakenly modifying the original.

I had recently been using checksums (hashes) to verify that the data read from the database file was identical to the data before it was written. (To verify the reading and writing code, not to check for hardware errors.)

I realized I could use the same checksums to detect erroneous changes to the immutable state. I ran a background go routine that looped forever, fetched the state (atomically), computed its checksum, waited a short time (e.g. 100 ms) and then computed the hash again (on the same state). If something had incorrectly mutated the state the checksum would be different.

And I actually found a few bugs that way :-)

Although unfortunately they were not the ones that were causing the crashes :-(

Friday, May 14, 2021

Another gSuneido Milestone

This screenshot probably doesn't look too significant - just the Suneido IDE. The only noticeable difference is down in the bottom left corner. Normally it would show something like:

Instead it shows:

That means gSuneido is running "standalone", i.e. using its own database instead of connecting as a client to a jSuneido database. While the surface difference is tiny, internally this is a huge jump.

I've been working away on the gSuneido database over the last year at the same time that we've been debugging and then rolling out the gSuneido client.

If I had just ported the jSuneido database implementation it would have been much easier, but what would be the fun in that. I kept the query implementation but redesigned the storage engine and transaction handling. I'd call it second system effect, but it's more like third system since I also redesigned this for jSuneido.

I still have lots to do. Although the IDE starts up, it's quite shaky and easily crashed. Many of the tests fail. But even to get to this point a huge number of pieces have to work correctly together. It's a bit like building a mechanical clock and reaching the point where it first starts to tick.

Sunday, March 14, 2021

Twenty Years of cSuneido

Last week was a milestone. We (my company) finished converting all our customers from cSuneido (the original C++ implementation) to gSuneido (the most recent implementation, in Go).

That means I no longer have to maintain cSuneido and I no longer have to deal with C++. (sigh of relief)

cSuneido has had a good run. We first deployed it to the first customer in 2000, so it's been in continuous production use for over 20 years. It's served us well.

When I first started developing Suneido, in the late 1990's, C++ was relatively new on PC's. I started with Walter Bright's Zortech C++ which became Symantec C++ (and later Digital Mars C++). Later I moved to Microsoft C++ and MinGW.

Suneido, like most dynamic languages is garbage collected. But C++ is not. I implemented a series of my own increasingly sophisticated conservative garbage collectors. But eventually I admitted my time was better spent on other things and I switched to using the Boehm-Demers-Weiser conservative garbage collector. Under normal use conservative garbage collection works well. But there are cases where memory is not recycled and eventually you run out. That was somewhat tolerable on the client side, but it wasn't so good on the server side. (That was one of the factors that prompted the development of jSuneido, the Java version that we use on the server side. Another factor was the lack of concurrency support in C++ at that time.) It seemed for a while that the pendulum was swinging towards garbage collection. But Rust has given new life to manual memory management.

Honestly, I won't be sorry to leave C++ behind. It has grown to be extremely complex, and while you can avoid much of that complexity, it's hard to not be affected by it. I've also had my fill of unsafe languages. Even after 20 years of fixing bugs, there are very likely still things like potential buffer overflows in cSuneido. (Ironically, one of the things that added a lot of complexity to C++ was template generics. Meanwhile I am anxiously waiting for the upcoming addition of generics in Go. However, Go's generics will be much simpler than C++'s Turing complete template programming.)

While it might seem crazy to re-implement the same program (Suneido) three times, it's been an interesting exercise. I've learned things each time, and made improvements each time. It's been extra work to maintain multiple versions, but it's also caught bugs that would have been missed if I'd only had a single implementation. Doing it in three quite different languages - C++, Java, and Go, has also been enlightening. And having the hard constraint of needing to flawlessly run a large existing code base (about a million lines of Suneido code) means I've avoided most of the dangers of "second system" effects.

So far I've only implemented the client side of gSuneido. We are still using jSuneido (the Java version) for the server side. I'm currently working on implementing the database/server for gSuneido (in Go). Once that's finished I intend to retire jSuneido as well and be back to a single implementation to maintain, like the good old days :-) And given where I'm at in my career gSuneido will almost certainly be the last implementation I do. I wonder if it will last as long as cSuneido did?

Monday, December 14, 2020

Coverage for Suneido

Since the early days of Suneido I've thought that it would be good to have some kind of coverage tool. But for some reason, I never got around to implementing it.

Code coverage is usually associated with testing, as in "test coverage". Lately I've been seeing it in connection to coverage based fuzz testing.

But coverage can also be a useful debugging or code comprehension tool. When you're trying to figure out how some code works, it's often helpful to see which parts are executed for given inputs. You can also determine that by stepping through the code in a debugger, but if the code is large or complex, than can be tedious, and doesn't leave a record that you can study.

For some reason I started thinking about it again and wondering how hard it would be to implement in gSuneido.

One of my concerns was performance. If coverage is too slow, it won't be used. And obviously, I didn't want to slow down normal, non-coverage execution.

While simple coverage is good, I find statement execution counts more useful. Statement counts verge on profiling, although profiling is generally more concerned with measuring time (or memory).

That got me wondering about approaches to counters. One interesting technique I came across is Morris approximate counters. That would allow an effectively unlimited count in a single byte. But I decided that the nearest power of two is a little too crude. Although it's usually not critical that large counts are exact, it's often helpful to see that some code is being executed N times or N+1 times relative to other code or to inputs.

16 bit counts (up to 65,535) are probably sufficient but I didn't want wrap-around overflow. I knew arithmetic that doesn't overflow was a standard thing but I couldn't remember the name. Eventually I found it's called saturation arithmetic. Sometimes people talk about "clamping" values to maximums or minimums (from electronics).

Often, to minimize coverage work, tracking is done on basic blocks. Normally that's part of the compiler, but I didn't really want to mess with the compiler. It's complex already and I didn't want to obscure its primary function. I realized that instead, I could get equivalent functionality based on the branches in the byte code. Obviously if you branch, that's the end of a basic block. And where you branch to is the start of a basic block. So I only need to instrument the byte code interpreter in the branch instructions. Basically the interpreter marked the start of executed blocks (branch destinations) and the disassembler identifies the end of blocks (branch origins).

I decided this would be a fun weekend project and a break from working on the database code. That didn't work out so well. At first I made rapid progress and I had something working quite quickly. Then things went downhill.

If I was tracking at the byte code level, I needed to connect that back to the source level. I had a disassembler that could output mixed byte code and source code, so that seemed like the obvious place to do it. Then I found that the disassembler didn't actually work that well. I'd only ever used it as a tool to debug the compiler. I spent a bunch of time trying to make the disassembler cover all the cases.

Meanwhile, I was starting to get depressed that it was turning into such a mess. The more I worked on it, the uglier it got. I don't usually mind when I take a wrong turn or decide to throw out code. That's just the way it goes sometimes. But when I can't find a good solution, and things keep getting worse, then I'm not a happy camper.

In the end, I had something that mostly worked. I checked it into version control so I'd have a record of it, and then I deleted it. My idea of using branches to identify basic blocks was valid, but the actual implementation (that I came up with) was far from elegant.

I maybe should have given up at this point. The weekend was over, I had more important things to work on. But I still thought it would be a worthwhile feature. And if I came back to it later I'd just have to figure it out all over again.

Once I abandoned the ugly code I felt much better. I decided to take a simpler approach. I'd add an option to the code generation to insert "cover" instructions (instrumentation) at the beginning of each statement. That was just a few lines of code. Then I just needed to implement that instruction in the byte code interpreter - a handful more lines of code. The overhead was relatively small, somewhere in the neighborhood of 5 to 10 percent.

And that was the core of it. A bit more code to turn it on and off, and get the results. Way easier, simpler, and cleaner than my first "clever" approach. Hopefully it will prove to be a useful feature.