The Software Life

Monday, December 14, 2020

Coverage for Suneido

Since the early days of Suneido I've thought that it would be good to have some kind of coverage tool. But for some reason, I never got around to implementing it.

Code coverage is usually associated with testing, as in "test coverage". Lately I've been seeing it in connection to coverage based fuzz testing.

But coverage can also be a useful debugging or code comprehension tool. When you're trying to figure out how some code works, it's often helpful to see which parts are executed for given inputs. You can also determine that by stepping through the code in a debugger, but if the code is large or complex, than can be tedious, and doesn't leave a record that you can study.

For some reason I started thinking about it again and wondering how hard it would be to implement in gSuneido.

One of my concerns was performance. If coverage is too slow, it won't be used. And obviously, I didn't want to slow down normal, non-coverage execution.

While simple coverage is good, I find statement execution counts more useful. Statement counts verge on profiling, although profiling is generally more concerned with measuring time (or memory).

That got me wondering about approaches to counters. One interesting technique I came across is Morris approximate counters. That would allow an effectively unlimited count in a single byte. But I decided that the nearest power of two is a little too crude. Although it's usually not critical that large counts are exact, it's often helpful to see that some code is being executed N times or N+1 times relative to other code or to inputs.

16 bit counts (up to 65,535) are probably sufficient but I didn't want wrap-around overflow. I knew arithmetic that doesn't overflow was a standard thing but I couldn't remember the name. Eventually I found it's called saturation arithmetic. Sometimes people talk about "clamping" values to maximums or minimums (from electronics).

Often, to minimize coverage work, tracking is done on basic blocks. Normally that's part of the compiler, but I didn't really want to mess with the compiler. It's complex already and I didn't want to obscure its primary function. I realized that instead, I could get equivalent functionality based on the branches in the byte code. Obviously if you branch, that's the end of a basic block. And where you branch to is the start of a basic block. So I only need to instrument the byte code interpreter in the branch instructions. Basically the interpreter marked the start of executed blocks (branch destinations) and the disassembler identifies the end of blocks (branch origins).

I decided this would be a fun weekend project and a break from working on the database code. That didn't work out so well. At first I made rapid progress and I had something working quite quickly. Then things went downhill.

If I was tracking at the byte code level, I needed to connect that back to the source level. I had a disassembler that could output mixed byte code and source code, so that seemed like the obvious place to do it. Then I found that the disassembler didn't actually work that well. I'd only ever used it as a tool to debug the compiler. I spent a bunch of time trying to make the disassembler cover all the cases.

Meanwhile, I was starting to get depressed that it was turning into such a mess. The more I worked on it, the uglier it got. I don't usually mind when I take a wrong turn or decide to throw out code. That's just the way it goes sometimes. But when I can't find a good solution, and things keep getting worse, then I'm not a happy camper.

In the end, I had something that mostly worked. I checked it into version control so I'd have a record of it, and then I deleted it. My idea of using branches to identify basic blocks was valid, but the actual implementation (that I came up with) was far from elegant.

I maybe should have given up at this point. The weekend was over, I had more important things to work on. But I still thought it would be a worthwhile feature. And if I came back to it later I'd just have to figure it out all over again.

Once I abandoned the ugly code I felt much better. I decided to take a simpler approach. I'd add an option to the code generation to insert "cover" instructions (instrumentation) at the beginning of each statement. That was just a few lines of code. Then I just needed to implement that instruction in the byte code interpreter - a handful more lines of code. The overhead was relatively small, somewhere in the neighborhood of 5 to 10 percent.

And that was the core of it. A bit more code to turn it on and off, and get the results. Way easier, simpler, and cleaner than my first "clever" approach. Hopefully it will prove to be a useful feature.

Tuesday, December 01, 2020

Leaving HEY

TL;DR - I used HEY for five months, gave it a fair trial I think, had no huge complaints, but I've gone back to Gmail. For the details, read on.

I was excited when Basecamp (formerly 37 Signals) announced their HEY email. I asked for and eventually received an invitation. I’ve been a long time Gmail user, from back when you needed an invitation. Back in the days when Google’s motto was “do no evil”. They’ve since replaced that motto with “get all the money”. Which doesn’t make me comfortable giving them all my email.

Basecamp has a good record of supporting old products. (I still hold a grudge over Google Reader.) And they are more privacy minded than Google. I didn't have a problem paying for the service. Paying customers are often treated better than the users of "free" services.

I like a lot of the HEY features - screening emails, spy blocking, reply later, set aside, paper trail, feed, etc.

One thing that bothered me was that it was a closed walled garden. Traditionally, email has been built on standards (like SMTP, POP, and IMAP). You can use Thunderbird to access your Gmail, use Apple Mail to access your Hotmail, etc. HEY lets you forward mail in or out, but that's as open as it gets. You can't access your HEY email from another client, and you can't use the HEY client to access other email accounts. Their explanation for this is that their special features aren't interoperable. I'm not sure I totally buy that. It seems like a more believable reason is that it simplifies what is undoubtedly a large challenge. And of course, the HEY software itself is not open source. I prefer open solutions, but I use other walled gardens, like the Apple ecosystem.

It was easy to redirect my Gmail and start using HEY. It was easy to learn and use. I was quite happy with it, and I must admit, a bit excited to be in on the beginning of something. Like when I first started using Gmail. But it wasn't long before I started running into annoyances.

On the positive side, it was nice being able to report an issue and actually get a response from what appeared to be a human. (Hard to tell with some of the bots these days.) Good luck with trying to report a problem to Gmail.

One of my big frustrations was not being able to view attachments. You could argue that's not really the responsibility of an email client. But I was accustomed to being able to view pdf's (e.g. resumes on job applications) with a single click. That single click in HEY just took me to a file save dialog. So I could download it (cluttering up my download folder and taking up storage) then find the downloaded file and then open it in a separate file. No more quick glance at the contents of an attachment. That was using the standalone Mac app. If I accessed HEY from my browser it was a little better (if I can convince Firefox I don't want to download it). The funny part was that HEY displays a thumbnail, and on iOS you can zoom in and read it. So obviously they were already interpreting the content, they weren't just treating them as opaque blobs. I told myself this was a minor issue but it continued to bug me.

There were quite a lot of bugs at first. In some ways that's not surprising for a new ambitious project. But I have to admit I was initially a little disappointed. I guess I've drunk a little too much of the Basecamp/David/Jason kool-aid and had high expectations. I told myself they would fix them, give them time. And some did get fixed. But others didn't. For example, the touted Feed uses infinite scrolling, except when it needs to load more content there's a noticeable pause and afterwards the view is messed up. You lose your scroll position and all the items are doubled. Not exactly a great experience. I can imagine most of the testing happened without enough data to hit that. They even mentioned it in a job posting as the kind of thing you might work on.

Then I listened to a podcast with David where he talked about how hard they'd worked to fix bugs after the release. But that they'd had to put that aside to work on their promised HEY for Work. Great, too busy adding features to fix bugs. Heard that story before. Then he went on to talk about how bugs are overrated, they're not really a big deal. You shouldn't feel bad about your bugs, they're "normal". They should have been playing "Don't worry, be happy". I'm exaggerating, I understand where he's coming from. And I agree there's a big difference between cosmetic bugs and functional bugs. And bugs that affect a minority of users versus ones that affect the majority. But it's a slippery slope. Where does that Feed issue fit? Is that a "cosmetic" issue to be shelved? To me it was a major problem, but I realize that's a judgement call.

To me, telling programmers not to worry about bugs is just asking for a bug filled product. And once you have a lot of bugs, it starts to feel pointless to fix them. Personally, I'm ok with feeling a little bad about my bugs. Not to the point of flagellation, but enough to make me e.g. write more tests next time.

I also found I'd (not surprisingly) grown accustomed, if not dependent, on a whole assortment of Gmail features. I screwed up and sent an email to the wrong recipient, which I would have caught with Gmail's undo feature. I was used to typing a few characters of a contact and having Gmail suggest the right person, whereas HEY constantly suggested contacts I never used. The Feed is a nice idea, but it's (currently) a pretty minimal feed reader. It doesn't keep track of what you've read, and if you get interrupted, there's no way to pick up where you left off. You have to scroll down and try to remember what you've read. I've switched to using Gmail filters to forward feed type material to my Feedly. Filters are another feature missing (or omitted) from HEY.

I'm not writing HEY off. I have my account and I don't mind having paid for it. I think they're trying to do something worthwhile and I don't mind supporting that. I'll keep an eye on it for potential future use.

I'm not completely happy going back to Gmail. I don't have anything particular to hide, but I'm not a fan of surveillance capitalism - of companies like Google making their profits from selling my private information, or the ugly things done with that information by the companies that buy it.

Wednesday, September 09, 2020

Checksums

I want to have checksums on parts of the database in gSuneido. In cSuneido I used Adler32, which is available in Go, but was it the best option? (Checksums are hash functions but designed for error detection, as opposed to hash functions designed for cryptography or for hash tables.)

I didn't find a lot of clear recommendations on the web. Adler32 is a variant of Fletcher and there was some question over which was better.

A Cyclic Redundancy Check (CRC) should be better at error detection than Adler or Fletcher. But most of what I could find talks about single bit errors. I'm more concerned with chunks of data not getting written to disk or getting overwritten. It's unclear how that relates to single bit errors.

I was also wondering if I could get away with a 16 bit (2 byte) checksum. But the Go standard library doesn't have any 16 bit checksums. I could use half the bits of a 32 bit checksum. In theory a good checksum should spread the information over all the bits, so it seemed reasonable. But how would that affect error detection?

One drawback to CRC's is that they don't handle zero bytes very well. And unfortunately, memory mapped files can easily have extra zero bytes (their default value) if data doesn't get written. But in that case the zero bytes would be instead of actual data, which should result in different checksums.

My other concern was speed. Adler and Fletcher are simpler algorithms than CRC so they should be faster. Adler uses a theoretically slower division than Fletcher, but the Go compiler converts division by constants into multiplication by the reciprocal so that shouldn't matter.

I wrote a quick benchmark (using Go's handy benchmark facility) and I was quite surprised to see that crc32 was much faster than adler32. Of course, the results vary depending on the details. This was for 1023 bytes of random data.

BenchmarkAdler32-24              3124998               332 ns/op
BenchmarkCrc32IEEE-24           11993930                91 ns/op
BenchmarkCrc32Cast-24           24993260                47 ns/op

I assume the speed difference is due to hardware (instruction set) support for CRC.

I found third party crc16 implementations, but I don't think they are using hardware support so I assume it will be faster to use half of crc32.

I also found that IEEE slowed down with random block sizes. e.g. a checksum of 1023 bytes was a lot slower than a checksum of 1024 bytes. Castagnoli was better in this respect, and supposedly it's also better at error detection.

I also did a bunch of manual experimenting with error detection. I found using 16 bits of crc32 worked quite well. (missed only one or two more errors out of 100,000 cases) It didn't seem to matter which 16 bits. (Dropping to 8 bits made a bigger difference, not surprisingly.)

So for now I'm going with 16 bits from crc32 Castagnoli.

Wednesday, August 12, 2020

A New Assert Package for Go

Back when I first started using Go I wrote an assert package similar to Hamcrest. You used it like:

Assert(t).That(x, Equals(y))

That has worked ok, but it has a few weaknesses. I blame it partly on being relatively new with Go when I wrote it.

It requires a dot import, which brings all its public names into the current package scope. That's not recommended because it pollutes the name space and can easily lead to name clashes. (Technically it doesn't "require" a dot import, but it would be very awkward to use otherwise.) The other problem with a dot import is that the code doesn't mention the package, so automatic imports don't work, so I end up copying the import from another file.

The way it works isn't really amenable to e.g. Intellisense in VS Code. You have to know the names of the functions (like Equals).

It's more complex than it really needs to be - matchers return closures, which is clever, but complicates the code and makes it harder to understand.

It's not amenable to use outside of tests, so I wrote another tiny verify package, which was similar but different.

I finally got irritated by it enough to write a new version that addresses these issues.

Here's how you use the new package:

assert.That(a || b)

assert.This(x).Is(y)

assert.T(t).This(x).Like(y)

assert.Msg("first time").That(a || b)

assert.T(t).Msg("second").This(fn).Panics("illegal")

If you're doing multiple assertions and you don't want to specify .T(t) each time, you can do:

assert := assert.T(t)

The testing.T is optional (if it's not supplied errors just panic) so you can use asserts in regular code. (And I don't need my redundant verify package.)

Now VS Code Intellisense works much better. If you type assert. it will suggest appropriate functions like That, This, T, Msg. If you type assert.This(x). it will suggest Is, Isnt, Like, Panics.

One advantage of my old style is that you could write additional matchers in a separate package. But I never actually did this. And it's my code so if I need another matcher, I can just add it to the assert package.

One of the popular Go testing packages, Testify, uses a different style:

assert.Equals(t, x, y, "message")

There's nothing wrong with that style, and in some respects it's simpler. But I've never liked it, partly because it's unclear which is the actual value and which is the expected value.

code in Github

Tuesday, August 04, 2020

Sorted List

Let's say you want to add a large number (e.g. up to millions) of values to a list, one at a time, in no particular order, and at the end you want to iterate through them in sorted order. There are a number of ways you can accomplish this.

Most languages have some sort of list data structure. In C++ you could use a vector, in Java an ArrayList, in Go a slice. Most of these grow by doubling (or possibly some other factor) in size as required. At the end you can simply sort the list. There are several drawbacks to this approach. One is that you have wasted space. Assuming doubling, the list is between 50% and 100% full, so you waste 25% on average. You also allocate roughly double the amount of space you end up with. (e.g. to get 8 elements you allocate 1, 2, 4, and then 8) In a garbage collected language this puts extra pressure on the garbage collector. And each time the size doubles you have to copy all the values to the new space.

One way around the space, allocation, and copying overhead is to split the list up into fixed size blocks and grow the list by adding more blocks.

You could use an ordered set data structure (assuming the values are unique) like a Java SortedSet or a C++ set. These are often implemented with something like a red-black tree which is not very efficient space-wise since each value has one or more pointers associated with it.

Log structured merge (LSM) trees are another option. If you just use lists for each level, the space overhead is minimal. Like the simple list, you can split the levels into blocks to reduce space and allocation overhead.

But we don't really need an ordered data structure since we don't need the order till all the values have been added. So an ordered set or LSM tree is doing more work than necessary.

My current implementation in jSuneido evolved from an LSM tree. It uses a list of blocks and at the end it sorts each block individually and does a series of two-way merges to combine them. Deciding what to use for gSuneido, it struck me that it would probably be more efficient to just sort the list as a whole since the two-way merges were merging (compare & copy) each value log2(n) times. It was relatively easy to modify my jSuneido implementation to try that. Curiously, it was slower. That seems a bit odd since the overall sort should do less compares and copies. I have to assume that once again modern cpu design is making intuition invalid. I would guess that sorting the blocks individually is fast because they fit into memory cache. And merging is fast because it's reading and writing sequentially which is optimal for cache usage.

You could merge all the blocks at once with an N-way merge, rather than a sequence of two way merges. This would mean less comparisons and copies, but it doubles the space usage during the merge. A nice aspect of a block list and two way merges, is that you can recycle the blocks. As soon as you finish merging a block, you can put it on a free list to reuse. This means the space overhead for two way merging is only a couple of blocks instead of double the size. (which reduces allocation, which reduces GC)

One interesting aspect of this approach is that the sorting and merging can be done incrementally. As soon as you fill up a block you can sort it. And as soon as you have two (or power of two) blocks you can merge them. This doesn't change the amount of work required, it just redistributes it over time. Spreading work more evenly and making it less bursty can be beneficial but it's not a clear win.

But then I realized you could do the sorting and merging concurrently with building the list. Again, it doesn't change the amount of work, it just allows some of it to be done in parallel. There are various ways to do this. I created a worker goroutine and triggered it via a channel. You could just start a new goroutine each time, but that's a bit more overhead. In Java I'd be thinking about a thread pool, but goroutines are lightweight enough that it's probably not necessary.

I ended up writing multiple Go versions so I could benchmark them. The results were pretty much what I expected. Simple slice and sort is fast, but lots of allocation. Block list is much less allocation but slightly slower overall. Merging is slightly faster than a complete sort. Block list with concurrent sort/merge was the winner. It had the lower memory usage of the block list, and was 10 to 20% faster due to the concurrency.

BenchmarkSimple 30 35164847 ns/op 664934 B/op 23 allocs/op
BenchmarkBlocks 26 43607473 ns/op 131139 B/op 6 allocs/op
BenchmarkMerge 27 37056507 ns/op 197306 B/op 31 allocs/op
BenchmarkConcur 37 28624720 ns/op 197768 B/op 34 allocs/op

This was with a block size of 4096 and adding 16,384 items.

The initial code is in Github (this points to the original commit so the link doesn't break, in the future the latest version is probably a better bet)

Thursday, July 23, 2020

Hash Array Mapped Trie in Go

When you want an immutable persistent (in the functional sense) hash table, a Hash Array Mapped Trie is a common choice.

I recently implemented one to use in the gSuneido database. I procrastinated because I had a vague memory that it had been tricky to implement in jSuneido. But when I finally got around to doing it, it turned out to be relatively easy.

A Hash Array Mapped Trie (HAMT) is a tree of nodes. Each node has an integer bit map (commonly 32 bits) that specifies which slots are occupied, and a variable size array of items. The bitmap means you don't waste space on empty slots like you do in most hash tables.

Each level of the tree consumes 5 bits (for a bitmap of 32 bits) of the hash value. In most implementations, the items are pointers, and can either point to a value, or a child node. The items could be void* in C++ or Object in Java. The equivalent in Go would be interface{} which can hold a value of any type. But Go interface{} is double the size of a simple pointer. And on top of that, it requires casting to get the proper types back out. This approach also doesn't allow embedding small values, they have to be referenced by pointer, so it's not as cache friendly. (An interface can reference a non-pointer type like a strict but internally it’s still using a pointer.)

I decided to take a slightly different approach and use two bitmaps and two slices (Go's variable sized arrays). One bitmap and slice for values, and one bitmap and slice for child pointers. That way the slices could be the proper types, no casting was required, and small values could be embedded rather than referenced by pointer. The tradeoff was that it uses slightly more space.

After finishing an initial implementation, I realized that this design allowed a slight improvement over the usual approach.

At a given node, if you have a value in a slot, and you want to insert another value with the same 5 bits of hash (a collision), you replace the value with a pointer to a child node and put the two values in the new node. If your hash is bad, they can collide again forcing yet more child nodes, poor space usage, and longer searches. (When you run out of hash bits you either have to rehash or use overflow nodes.)

But with separate slices for child pointers and values, you can have both for a given slot. So when you get a collision, you still create a child node, but you don't need to move the old value. So the new node only has one element and there cannot be any further collisions from this insertion, even if the hash values are identical. And it actually simplifies the code and reduces the average search length slightly.

The other variation in my implementation is that the data structure operates in two modes, either immutable (read-only) and safely sharable between threads, or mutable (updatable) and not sharable. Like most persistent immutable tree data structures, making a change requires path copying to the root. That's a lot of overhead to do for every insert or delete. My code tends to operate in two modes, either a single thread is updating, or multiple threads are sharing immutably. By having a mutable mode, the path copying is shared/reduced. e.g. every insert/delete doesn't have to make a new root node. I use a "generation" integer value to know whether a node is mutable. If the node's generation doesn't match the current generation, then we make a copy of the node, and set its generation to the current one so it will be mutable from now on. To switch everything back to immutable (freeze it) we simply increment the current generation. (Don't have to touch the nodes.)

Until Go gets generics, I'm using the Genny code generator to create typed versions. It's not perfect, but it works well enough.

The code is in Github as usual.

Update 2020-09-15

I discovered a bug in my implementation. If you insert two colliding items, the second will go into a child node. If you then delete the one from the parent node, you end up with a situation my code didn't handle. It assumed that if the value slot of a node was empty, then that hash didn't exist. One solution would be to check child nodes. But it made more sense to put the burden on the less common delete operation. Now when you delete a value from a node, it "pulls up" a value from a child node if there is one. This maintains the invariant that if a value slot of a node is empty, then there are no child nodes of that slot. And by pulling up from the lowest possible child, it also keeps the tree as shallow as possible.

Wednesday, July 22, 2020

Immutable B-tree Redirects

Regular update-in-place B-trees suffer from write amplification. To update a single key you usually end up (re)writing a whole node. In addition, SSD storage also suffers from write amplification because it can't update in place, it has to write new copies to an erased block.

Immutable B-trees are worse. Not only do you have to write a whole node, you have to allocate new space for that node. Plus you need to update the tree path leading to that node, which means allocating and writing several more nodes. (This "path copying" is common in immutable persistent data structures.) This can make index file space grow excessively, and require more frequent compaction or garbage collection. (On the positive side the writing is sequential which is more efficient.)

I ran into this problem when I wrote the jSuneido immutable B-tree implementation. To reduce it to a manageable level, I ended up persisting indexes to disk periodically (once per minute) rather than after each transaction.This reduces both the write and space amplification.

There are various other approaches to reducing the amplification. For example, instead of writing out a copy of an entire node, you can write only the delta (differences). The downside is that to get the current version of the node, you need to read possibly multiple deltas plus the original node.

With the new version of the database that I'm working on for gSuneido, I took the same approach of periodically writing index updates. I knew from jSuneido that this approach was workable but it still bugged me. After all, the whole point of a new design was to make improvements.

To reduce memory allocation (and subsequent garbage collection) I use a hash map of "redirects" and defer path copying till writing to disk. An obvious approach is to write the redirects to disk instead of doing the path copying. But it seemed like that had the same problem as writing node deltas (redirects are really just a kind of delta) - they improve writing at the cost of slowing down reading.

My aha moment was when I realized that I already have those redirects in memory in a fast hash table for use during transactions. They would only need to be read from disk when the database starts up. The only additional cost would be larger redirect hash tables, but this has minimal affect on speed.

Of course, at some point too many redirects will accumulate and it makes sense to "flatten" them, i.e. rewrite the nodes and eliminate the redirects.

Thinking through the ramifications of this design (long runs are good for that), I realized that one of the tricky parts was going to be "flattening" the redirects. If all you store are the original and updated offsets for the redirect, then you'd need to search the whole B-tree to find the spot to update. My first thought was that I just needed to also store the node the redirection applied to. But that's not sufficient because you have to update its parents all the way up the tree. Obviously, to path copy, you need the path. But to store the offsets of all the nodes on the path from the root to the leaf would double or triple the size of the redirections. This would somewhat defeat the purpose given that we're storing the redirections to save space.
Instead of storing the path to each redirection, I ended up storing a separate set of tree nodes on the path to redirects. This is smaller because paths share a lot of tree nodes e.g. the root node is on every path. When traversing the tree to find nodes to flatten, only tree nodes in the set need to be followed.

I have this implemented and it seems to work well. I don't have a complete enough system to benchmark yet. It'll be interesting to see what overall impact it has.

Wednesday, May 06, 2020

Eclipse Bug

This one had me puzzled for a while. I went to run some jSuneido code in Eclipse and I got a weird verifyError i.e. something wrong with the byte code. Probably just a glitch, a corrupted file or something. I recompiled the code. Same error. Meanwhile, all the jUnit tests (running all the time through Infinitest) were passing fine. And when I built the final jar, it worked fine as well.

Maybe my Eclipse or my configuration was messed up. I connected to my office machine and tried it there. And got the same error.

The problem was, I couldn't remember when this last worked. I've been mostly working in Go on gSuneido. I've made the occasional minor changes in jSuneido but probably didn't try to run anything inside Eclipse.

Maybe it was a Java 14 issue? Eclipse isn't very good at supporting the latest version of Java. Currently you have to install a patch to get Java 14 support. So I installed Java 13 and tried again with that. Same error.

I tried reinstalling Eclipse (2020-03). Didn't help. I tried installing the previous version of Eclipse (2019-12). Still no good.

I have to admit I was getting quite frustrated by this point. I'd had several computer issues in the last few days and I didn't need yet another one. I even considered switching IDE's. I installed JetBrains intelliJ (I have a license for all their products) and it had no problem running the code. But I struggled just to get that far with intelliJ. It's quite an investment to learn an IDE and I didn't really want to start over. Especially when I'm not doing much Java development these days.

I narrowed it down to a problem with the Eclipse Java compiler (they have their own, separate from the JDK). If I compiled with the JDK it was fine, but if I compiled with Eclipse, it wouldn't work inside Eclipse or outside.

I vaguely remembered that I had fallen a bit behind with Eclipse and had jumped over the previous version (2019-12). So I installed 2019-09. And it worked! Finally!

By this point I was fairly sure it was an Eclipse Java compiler bug. I searched the Eclipse bug database (and the web) but couldn't find anything related.

I figured I should report a bug, but was, as usual, hesitant. There are lots of Eclipse users, why was I the only one to run into this? Was it something I was doing wrong? (Although in that case, why did the JDK work fine?) It sounds silly, but I was afraid of looking stupid in public.

I decided if I could create a simple example that reproduced the issue, then I'd file a bug. Luckily, by then I had a good idea of the issue and it wasn't hard to make the example.

Basically the bug came from the equivalent of:

MyFunc(condition ? derived1() : derived2())

derived1 and derived2 returned private classes with a common public parent (all from another package).

The type of the ?: expression should be the common public parent. But in the Eclipse bytecode it was Object (basically an unknown type). For some reason it wasn't looking past the private classes to see the common public parent. Curiously, if I assigned the ?: expression to a variable of the parent type, it handled that fine. Or if I rewrote the ?: as an if-else it was also fine. Using javap to look at the bytecode I could see the JDK stack map showed the correct parent type, but the Eclipse stack map showed Object.

I posted the bug and the next morning I saw:

I will investigate. Reproduced with your example.
Thanks! Great test case. This is really helpful

It was good to see it was valid. Later comments confirmed the bug and that the fix was trivial and would go into future versions.

Friday, April 17, 2020

Constant Propagation and Folding

Constant folding and constant propagation are related compiler optimizations.

Folding is the evaluation of constant expressions at compile time. For example:

60 * 60

=> 3600

Propagation is the inlining of variables with constant values. For example:

minutesPerHour = 60
minutes = hours * minutesPerHour

=> minutes = hours * 60

The cSuneido compiler is single pass i.e. it generates byte code while parsing, without generating an intermediate AST. It does constant folding during byte code emission, a bit like a peephole optimizer. For example, if it sees a sequence like Push 5, Push 6, Add it replaces it with Push 11

jSuneido does have an intermediate AST. It does constant folding during code generation. This is integrated into the process. i.e. at the top level it "folds" an AST into a function constant.

gSuneido took yet another approach. It did constant folding during AST generation via a decorator around the AST builder. This has the nice characteristics that it is separate from parsing and code generation, and it avoids building constant expression parts of the AST.

You might wonder how useful constant folding is. It's likely not critical to performance (unless it happens to be in a hot loop). But it actually occurs relatively frequently. For example, long strings:

s = "..." $
"..." $
"..."

Or calculations with units:

seconds = days * 24 * 60 * 60

Knowing that you have constant folding helps avoid people doing e.g. seconds = days * 86400 for performance.

Neither cSuneido nor jSuneido do constant propagation. Again, it's likely not critical to performance. But it would be nice to have, for similar reasons as folding.

For example, if you wrote the above example like this, it would not get folded unless you do constant propagation:

hoursPerDay = 24
minutesPerHour = 60
minutesPerDay = hoursPerDay * minutesPerHour
minutes = days * minutesPerDay

Ideally, you want this to get compiled to simply: minutes = days * 1440
(admittedly somewhat contrived but you get the idea)

When constant folding and constant propagation are done as separate processes, you need to do them iteratively (repeatedly) until there's nothing more to be done. e.g. with the example above, first you'd propagate hoursPerDay and minutesPerHour, then you'd fold the multiplication, and finally you'd propagate minutesPerDay. But if the two are combined, it can be done in a single pass.

I mostly worked on this as a fun break from less interesting work. So I wasn't too concerned if the payoffs weren't huge. It was an interesting problem. But I was concerned about not adding too much overhead. Suneido loads and compiles code on demand, from source. So you want compiling to to be fast. Computers have thankfully gotten much faster over the years, but that free lunch has almost run out. We do have more cores, but I'm not quite ready to do parallel tiered compilation.

It started off quite well. It didn't take much code to get the basics working. I was trying to do it in a single pass (traversal) of the AST. That's easy in the simple cases like the examples above. But then I started running into awkward cases. For example, as soon as you have loops you don't have linear execution. You can't inline a variable when you encounter it because it might get modified later in the loop. Or if a variable you thought was constant is modified, then you can no longer inline it. And you couldn't remove a constant assignment until you determined if it was ever modified. The code got more and more complicated until I finally admitted to myself that this approach didn't make sense.

I thought it would be much simpler if I made two passes. The first pass would find "final" variables (using Java terminology) that are only assigned once and never changed. This makes the next pass quite simple. That made the code much simpler, but the overhead was larger than I liked - about 20% on compiling all of stdlib.

An obvious optimization was to merge the first pass (determining final variables) into the parser. It was nicer to have it separate, but I didn't want the overhead. That brought the overhead down to about 10%.

But I still had two complete traversals over the AST. The new one to do propagation and folding, and another one that determined which blocks needed to be closures. The propagation and folding could probably be merged into the code generation, but it needed to be done before the blocks, and the blocks couldn't be merged into code generation. And neither could be merged into the parsing because of needing the results of the final variables.

It seemed I was stuck, but then I realized that these optimizations weren't applicable to all code. If there were no blocks (which I could easily detect during parsing) then you could skip that pass. And if there were no final variables (also determined during parsing), then you didn't need to do propagation. However, I still wanted to do folding, so I put back the original folding during AST generation (sharing the same code). This is somewhat redundant, but the folding during AST generation adds very little overhead.

That brought the overhead down to a few percent, almost lost in the noise. I was happy with that.

I didn't do any performance measurements. I could easily come up with microbenchmarks showing it was a big benefit. At the same time, I'm also sure it doesn't make a big difference overall. It's nice to know we have it. I like that we can write code more clearly without worrying about performance.

As usual, the code is in Github. The main commits are:

Friday, April 10, 2020

Sometimes Bugs Are Easy

Finding a bug can be a long frustrating process. Those are the ones I tend to write about. But occasionally the stars (and the tools) align and it goes quickly and easily. I figured I'd write about one of those for a change.

I released (internally) a new version of gSuneido. Users soon found a bug that hadn't been in the previous version. It had been about two weeks and roughly 40 commits since the last version.

Thank goodness for version control. It's long ago, thankfully, but there was a time when we programmed without any version control. The bad old days. Of course, my memory was better back then.

The bug was easy to recreate, which made it easy to manually bisect my Git versions and find the commit that introduced the bug. A good reminder of the efficiency of binary search.

The offending commit only changed a single file, so that simplified things. There were a lot of changes within that file, but they were independent. So I could again manually bisect the changes and quickly find the one with the bug.

And to top it off, the fix was easy.

Thursday, April 09, 2020

Editor Color Schemes

Have you noticed that editor color schemes tend to treat comments one of two ways?

downplay the comments e.g. low contrast gray text
emphasize the comments e.g. high contrast colored text

I suspect the tendency to downplay comments comes from coding styles that have a lot of "mandatory" comments, big blocks of boilerplate that isn't very relevant, or comments on almost every line. In this case the comments can obscure the actual code, so you want to downplay their visibility.

Whereas if the style is to have only a few, relatively important comments, then you want them to stand out.

Documentation comments are somewhere in between. Some editor/language combinations let you color documentation comments differently. That's easier with e.g. Javadoc where you use special /** commenting but harder with e.g. Go that just uses regular comments in specific contexts.

Top level documentation comments aren't usually a big issue because they aren't mixed with the code and don't obscure it too much. Given a choice I would make these "normal", neither important enough to highlight, nor distracting enough to downplay.

I can't find the reference now, but I saw something recently about an editor/ide (Jetbrains?) that allowed you to toggle between displaying doc comments either as formatted preview style, or as editable "raw" text. That seems like a good option. Most of the time you just want to read the doc comments and you don't want to see e.g. the Javadoc HTML tags. Go avoids this issue by using more or less plain text (a bit like Markdown) for their comments. A formatted preview style would differentiate documentation from code without downplaying it.

What's your style? Highlight comments or downplay comments?

Personally, I don't write a lot of comments and I like them to stand out.

Wednesday, March 11, 2020

Go Bug

Go 1.14 came out recently. I updated without really thinking about it, expecting a painless experience like all the other versions. I rebuilt gSuneido (the Go version of Suneido), ran it, and ... it crashed, with an unfamiliar internal Go error.

At first I just thought the build hadn't worked properly or I had a mix of versions. Or maybe Go hadn't installed quite right.

But re-installing and re-building didn't solve the problem. It still crashed. Annoyingly, all my tests passed. On the positive side (at least in terms of debugging) it crashed consistently, every time, the same way. And if I built exactly the same source code with the previous version (1.13.8) it worked fine.

The crash call stack had some references to defer. (Go's equivalent to try-finally.) And one of the big improvements in Go 1.14 was to ... defer. That was very suspicious, but I wasn't quite ready to blame the crash on Go yet.

One big question was "why me?". (In this kind of situation it is actually a valid question.) If it was a bug in Go why weren't other people running into it? I'm not that much of a unique snowflake. Yet, despite searching the Go issues and the web in general, I couldn't find anyone reporting a similar issue. However, it was possible that I was (ab)using panic/defer/recover in a way other people aren't. For better or worse, the Suneido language has exceptions (like Java and C++). And the easiest way for me to implement exceptions was to use panic/defer/recover. Go strongly discourages this kind of thing. I'm pretty sure Go true believers would regard what I was doing as an abomination. (Another reason I was reluctant to submit a bug report.)

The win32 interface in gSuneido uses unsafe and C code, so that was the first thing to be suspicious of. It was quite possible there was a bug in that code that hadn't surfaced before. Or perhaps an unintentional dependency on some internal feature that had inadvertently changed in 1.14.

Thankfully, it didn't take long to recreate the problem without running any of the win32 interface. Which also allowed me to recreate it on Mac OS X.

The issue seemed to be related to defer, recover, and re-panic'ing. (equivalent to catching and re-throwing an exception in other languages) But strangely it seemed to require one sequence of defer/recover/panic first, and then a second sequence would crash. It was as if the first sequence was corrupting something or leaving something behind that caused the second sequence to crash.

At this point, after several days of debugging, I was starting to lean towards a bug in Go 1.14. I was hesitant to submit a bug report because I didn't have a small example to reproduce the problem. So based on what I'd discovered so far I tried to create a standalone test case. In retrospect, I think I was close, but none of my test programs would crash.

I decided to submit the bug anyway. At least if anyone else ran into it, there would be something to find when they searched. And maybe someone would have some suggestions on how to debug it further.

crash on 1.14 with unexpected return pc, fatal error: unknown caller pc

I got some responses right away which was nice. One of the first suggestions was to try building with -gcflags=-N which turns off various optimizations, including the new defer handling. Confusingly, it still crashed. Which seemed to indicate it wasn't the defer changes. Luckily at that point the author of the defer changes, Dan Scales, offered to help debug it. I was able to come up with a package of files to allow him to recreate the crash. It wasn't a small example, it was the whole Suneido source code, but at least he could reproduce the crash. It turned out that -gcflags=-N did actually fix the problem. The catch was that setting that flag didn't force recompilation, so I was still running old code. I should have used -gcflags="all=-N". (These "internal" options aren't well documented.)

So Dan confirmed that the bug was in was his changes, and he quickly came up with a fix.
He also suggested a workaround. The defer optimizations didn't kick in if the defer was in a loop. So I could put a one iteration loop around the relevant defer's. (That seems a little odd - I would have thought the compiler would optimize away one iteration loops. Maybe that happens in a later stage.) The question was which defer's. At first I only added the workaround to defer's that re-panic'ed. But it still crashed sometimes (albeit less often). I ended up adding the workaround to all the defer's with recovers and that seemed to fix the problem.

One of the problems is that this is happening in gSuneido's byte code interpreter. So it's impossible to statically determine which code might call which code, which is what's critical to this bug.

I was hoping that the fix would be cherry picked and back ported to 1.14.1 rather than having to wait for 1.15 which is six months away. Thankfully, it looks like that is going to happen.

I was impressed by the response of the Go team/community. They took my bug report seriously and gave me good feedback. No one hassled me about my lack of a small example or my abuse of panic/defer/recover. And the problem was rapidly tracked down and fixed. You can't ask for much more than that. (I also had a good experience with the other bug I've reported and was fixed - 35492)

Tuesday, January 07, 2020

Recurring Nightmares

gSuneido has been coming along nicely. It's solid enough that it has replaced cSuneido for my personal use. Of course, there are still issues but I'm gradually cleaning them up.

cSuneido has a Windows COM/OLE interface that we primarily use for the Windows browser control. I used go-ole to implement this in gSuneido. I wasn't completely happy with taking a dependency on a small project that wasn't very active, but the code was fairly straightforward and I knew I could write my own if I needed to.

go-ole worked well and I thought that area was complete. But when I got other people testing gSuneido more issues came up. Ctrl+C (copy) in a browser window crashed. Some links crashed. Passing an empty string caused a later crash.

I spent several days trying to debug the problems. The crashes were access violations inside Windows which meant you don't have a lot of insight into what's happening. Windows is a very large black box.

Working on a previous issue I had wondered if it would be better to isolate the COM/OLE interface the same way I had isolated the DLL interface. (see gSuneido Roller Coaster) I'd actually started implementing this but abandoned it when I solved the bug another way.

These COM/OLE bugs were somewhat different from the DLL issues - more reproducible and less random. But they also had a lot of similarities. I suspected there were some threading issues but I don't understand COM/OLE threading well enough to be sure.

Searching on the web I found more issues with interfacing between Go and Windows. Often they caused crashes. Go's threading model, garbage collection, and stack movement do not play well with Windows. As I've come to know, it's a dangerous area to work in. Note: This is not really a flaw with Go itself. If you stick to straight Go you won't run into these problems. It's only a problem when you have "unsafe" code calling external Windows stuff. Unfortunately, that's what I'm doing.

I decided to finish isolating the COM/OLE interface. It was a bit of a gamble since I had no idea if it would solve the problems. But it didn't seem like it would take too long and I didn't have a lot of other ideas.

It only took me a day or two to replace go-ole with my own interface. But ... it didn't work. I kept getting DISP_E_TYPEMISMATCH and CTL_E_INVALIDPROPERTYVALUE. I had basically ported my cSuneido implementation which I knew worked. cSuneido is 32 bit whereas gSuneido is 64 bit, but that didn't seem to be the problem. I also had the go-ole code as an example. I spent a whole day trying to figure out why it didn't work. I added logging to all three implementations and they looked identical.

Finally I realized that the IDispatch Invoke arguments need to be in reverse order. The way I'd added the logging had hidden this and I'd only been looking at the values of the arguments. A day's worth of time wasted!

Once I made that trivial but crucial fix, it all started working.

And even better, the original weird bugs and crashes were gone :-)

So my hunch was right and my gamble paid off (this time). And as a side bonus, I no longer have a dependency on an external library.

Tuesday, October 22, 2019

gSuneido Roller Coaster

One of the questions with the Go version of Suneido (gSuneido) was the Windows UI interface. In the long run I'm hoping to switch to a browser based front end, but that's a long term project and I would like to replace cSuneido sooner than that. One option is to implement cSuneido's general purpose DLL interface. But this involved some ugly low level code (even some assembler) that I didn't really want to port. Another option was to use Go's syscall facilities to write the specific Win32 functions that we needed. I did a quick check on how many functions we needed to open the IDE. It was about 100. I knew the final number would be higher, but it seemed reasonable to hand write that many. One advantage of doing this is that it moved the "unsafe" code from Suneido to Go, which seemed safer.

Of course, it took much longer than I expected to write (and debug) the interface functions. There were about 300 in the end. (It's not just the functions, it's converting to and from all the structs involved as well.) I definitely had a few doubts whether this was the right approach. In retrospect, I'm pretty sure I could have ported the general purpose interface in less time. It was also a bit depressing knowing that if/when we got the browser interface finished, then this code would all be thrown out. If I was to do it again, I might write a tool that would take the cSuneido definitions and generate Go language definitions.

But finally I got near to complete and I could actually run the Suneido GUI. That felt pretty good.

Except there were intermittent crashes. I didn't worry about it too much at first. This kind of low level unsafe code can easily crash if you have bugs. But even after I cleaned up the majority of the bugs, it was still crashing. It worked most of the time, so I didn't think it was blatant errors in the code. I spent days trying to track it down. Of course, like many elusive bugs, it almost never crashed in the debugger.

My first thought was something related to garbage collection. But disabling garbage collection didn't help.

Another possibility was stack movement. Because Go supports high numbers of goroutines, they start out with relatively small stacks, which grow as necessary. This is a tricky process because you have local variables on the stack, and pointers between them, all of which have to be adjusted when the stack is moved. I couldn't figure out how to debug this. There was no way to turn it off, and no way to trace when it happened. I tried to detect it by storing the address of a stack value in an integer where it wouldn't get adjusted. But this didn't seem to work and I never did figure out why. (Part of the problem is that unlike C(++), in Go you don't control whether things are on the stack or the heap. That's all handled automatically by the compiler. It's quite possible that my value was ending on the heap and therefore useless for detecting stack movement.)

Needless to say, this was another low point. For starters, all that work for nothing - it was useless if it crashed randomly. And secondly, what now? If I couldn't reliably interface with the Windows GUI from Go, then the only other option I could see was to write the GUI interface in a separate C(++) program and use some kind of interprocess communication. Presumably that would work, but it would be ugly.

It might seem odd that something so objective and abstract would involve so much emotion like elation and depression. But software development (at least so far) is a human activity, and human are nothing if not subjective and emotional.

One thing that puzzled me was that I knew there were several other Go projects that interfaced with the Windows GUI. (Like https://github.com/lxn/win and walk) Why didn't they have this problem? I started searching and soon found this issue:

don't use Go heap for MSG
It's not entirely clear to me what's going on here, and I'm only
half-certain that this fixes the issue, due to it being somewhat hard to
reproduce ... While this might point to deeper Go runtime issues, we
just work around this here by allocating the variable using GlobalAlloc.
At the very least, I haven't been able to reproduce the bug after
applying this patch.

Following the links and looking at the related issues it sounded very much like what I was seeing. I applied their fix/workaround and sure enough the crashes went away, accompanied by feelings of relief and joy.

Part of the problem/complexity is that the way the Windows GUI works, there are nested callbacks and DLL/sycalls. i.e. Windows calls application code which calls Windows code which calls application code, and so on. On top of that, Go syscalls and callbacks do a complicated dance to switch between Go stacks and syscall stacks, and attempt to protect non-Go code from garbage collection and stack movement. It's not at all surprising that there would be ugly corner cases.

Well, my happiness lasted all of a day before I had another crash. Hmmm... was that the old problem still there, or something else? I crossed my fingers that it was something unrelated. There's a fine line between wishful thinking and having a positive attitude. Sadly, the crashes kept happening and it became obvious the problem wasn't solved.

But the lxn/walk issue had seemed so similar. And the fix had certainly reduced the number of crashes. If the issue was passing pointers to stack values, then I was doing this in a lot more places than just the message loop MSG. Whereas the lxn win/walk code always took the pointers as arguments rather than declaring them locally. (Of course, I didn't figure this out quite as step by step as it sounds. There was a lot of frowning and frustration first. )

I tried changing a few of the common crash sites to avoid anything Go might put on the stack. That reduced crashes even further. Of course, once the crashes get rare, it becomes very difficult to know whether you've fixed anything or not. It always seems to work at first. Just when you're ready to declare success, then it crashes again.

It seemed like I was on the right track, so I took the plunge and changed all the functions to allocate temporary values on my own "heap/stack". This was a bit of a gamble since it was a lot of changes (several days work) with no guarantee it would help.

But the gamble seemed to pay off. No more crashes! I was a happy camper.

I soon had more or less the whole IDE working. There were numerous little glitches but they were mostly easy to fix. Until one glitch that was more problematic. I happened to try dragging some text within an editor window, and it didn't work. At first I thought the drop was failing since the text didn't appear at the destination. But I soon realized it was getting inserted at the very beginning or end of the text, instead of where I was trying to drop it. I couldn't remember how drag and drop worked so I had to do a little digging. At this point I was still assuming it was just some minor error in one of the functions. But it wasn't anything obvious. I narrowed it down to a couple of notification callbacks (EN_CHANGE or SCEN_MODIFIED). If those callbacks did anything significant then the drop would go to the wrong place. It didn't seem to matter what they did. I never did narrow it down to whether it was call stack or heap usage.

The problem was, this was a bit of a dead end. Inserting seemingly harmless, irrelevant code in between Windows and Scintilla (the editor control) caused drag and drop to fail. I looked at the Scintilla code but it didn't seem to be doing anything unusual or suspicious.

I spent another couple of days on this issue until I admitted defeat. Again, there seemed to be something going wrong in the depths of Go's complex syscall / concurrent garbage collection / stack movement. In theory I should submit a Go bug report. But, understandably, they'd want a small example that consistently reproduced the problem. And I don't have anything close to that. I have a large system that crashes rarely.

Another low point. Now what? Such a small symptom, but there's no such thing as minor corruption, it's a bit like a little bit pregnant.

Back to my previous idea of a separate GUI process written in C. Probably the fastest interprocess communication would be shared memory. But you'd still need some kind of synchronization. It still seemed ugly.

Did I really need a separate process? I want some isolation, but maybe I can do that within a single process. Go has a facility for incorporating C code in your Go program (Cgo). What if the C GUI code ran in a separate OS thread, but still within the same process? That would separate the Windows GUI code from all the Go issues of concurrent garbage collection and stack movement. But you wouldn't have the overhead and complexity of a separate process and interprocess communications. And you could, of course, share memory. (carefully!)

But you still need synchronization. When the Go side made a call to the C side, it would need to wait for the result. Similarly when the C side made a callback to the Go side, it would need to wait for the result. One option was to use a Windows synchronization facility. That would mean a syscall on the Go side. The other option was to use a Go synchronization facility, which would mean a callback from the C side. I decided it made more sense to use Go synchronization because I suspected that would play better with the Go scheduler and concurrency.

One way to handle the synchronization was to use Go channels. If it was pure Go code I might have gone that way. But with C involved it looked like sync.Cond would be simpler. When I searched for how to use sync.Cond, what I found was an issue where someone suggested the documentation needed to be improved. I wouldn't have thought that would be at all contentious. But it seems some of the Go team doesn't like sync.Cond and don't think people should use it, and therefore don't want the documentation improved. Personally, I question the logic of that. It seems like poor documentation just increases the chances of people misusing it. I appreciate the Go team and I'm grateful for all they've done, but sometimes there seems like a bit of a "we know better" attitude. (Of course, most of the time, I have no doubt they do indeed know better.) Another example is the refusal to supply a goroutine ID or any kind of goroutine local storage, again because "it might be abused". And yet internally, they use goroutine ID's. I'm just an ignorant bystander, but it seems like a bit of a double standard to me.

Despite the poor documentation (and with the help of Stack Overflow) I figured out how to use sync.Cond. Whether I used it appropriately is a question someone else can debate. No doubt some people would say I should be using channels instead.

Using Cgo turned out to be fairly straightforward. It requires GCC so I installed Mingw64. I decided to stick to C rather than use C++. For the small amount of code I had to write, it seemed simpler and therefore safer. It took a few tries to get the synchronization working properly. The code wasn't complex, but as with all concurrency, the devil is in the details.

Thankfully, it didn't require a lot of changes to my existing Go code. I basically just swapped out the syscall and callback facilities.

Of course, it was another gamble to implement a whole approach on the basis of not much more than a gut feeling it should work. And gut feelings are notoriously unreliable in software development. But it only took a day or two to implement, and I was back to running the the Suneido IDE. And the moment of truth - would drag and drop work now? It did! By this point my reaction was more sigh of relief than elation.

Are there other bugs lurking? Almost certainly. But I'm cautiously optimistic that I've solved the crashing and corruption. The only thing talking to Windows is now a single C thread (with a regular dedicated fixed stack and no garbage collection or even heap allocation) which is a very tried and true setup. You can't get much more basic than that. The only interaction is C calling a Go function to signal and wait. No stack or heap pointers are shared between the two sides. Go's concurrent garbage collection and stack movement can work totally independently of the C thread. And as a bonus, it's more efficient to run the message loop in C with no Go syscall overhead.

I was all ready to post this, but I was a little nervous because I really hadn't tested much. I didn't want to declare victory and then find another problem. Of course, there were always going to be "other problems".

So I did some more testing and it seemed a little sluggish. At first I didn't pay much attention - there's so much background stuff going on in our computers these days that can affect speed. But it didn't go away. I did a quick benchmark of a simple Windows call and got 70 us per call. That seemed high but I wasn't sure. I checked cSuneido and it was 2 us. Ouch. Maybe it was just Go syscall overhead? (i.e. can I blame someone else?) No, directly from go it was .3 us. My guess is that there is some kind of bad interaction with sync.Cond, but I don't really know.

Another downer, just when I thought I'd won. C'est la vie. The obvious alternative (that I had considered earlier) was to use Windows synchronization. Luckily Windows condition variables and critical sections were similar enough to Go sync.Cond that it didn't take long to switch.

I didn't actually hold my breath when I ran the benchmark, but I mentally had my fingers crossed. And it worked - now it was 1 us per call, better than cSuneido. And the IDE no longer felt sluggish.

I've been running this version for a couple of days, and it seems that this chapter of the saga is over, so I will post this and move on to other bugs.

Wednesday, October 09, 2019

Thread-Safe Tree Comparison

When I implemented jSuneido (Java) I ignored a lot of potential concurrency issues. (That's not as bad as it sounds - there is almost no concurrency in our application code.) But Go (gSuneido) is a lot stricter. I was surprised to get fatal potential data race errors even without the race detector.

So I was more or less forced to handle some of these issues. As usual, concurrency is trickier than you think, even if you think it's tricky.

One issue I ran into was comparing containers. The obvious approach is to compare recursively (which is what cSuneido and jSuneido did). My first attempt at making it thread-safe was to lock each container as you recursed into it. But then I realized that this would acquire multiple locks, which has the potential to deadlock. (I found sasha-s/go-deadlock useful.) The normal solution is to make sure you lock/unlock in a consistent order. But I did not have control over the order of recursion.

I searched the web for algorithms but I didn't find much. (Which was part of what prompted me to write this up.)

One option is to just use thread-safe low level operations e.g. to get a child. But locking at a low level means a lot of lock overhead. (Uncontended locks are fast these days, but they can still have performance implications.)

The only other decent solution I could come up with was to lock the object, copy the list of children, and then unlock it.

But I didn't want to add a bunch of heap allocation to copy the children. So I switched to an explicit stack and iteration instead of recursion. This may still allocate when it grows the stack, but after that it can reuse the space (which should be in cache). And it's much less than if you allocated a list for each child.

I implemented "equals" first. When I went to implement "compare" (less or greater than) it was a little trickier because I needed lexicographic order, whereas my stack was LIFO. That turned out to be relatively easy to handle by pushing the children onto the stack in reverse order.

(This code is further complicated because you have to detect recursive data. So you have to track the comparisons in progress.)

This approach is thread "safe" in terms of not crashing or getting random data. It's not necessarily "correct". But if you are making changes to containers at the same time as comparing them, I'm not sure what "correct" means. If you had immutable persistent data structures then you could grab a "snapshot" of the two containers. But even then, how do you get the two snapshots "at the same time"? Yet another lock?

Here's a link to the main part of the code as of this blog post. If you actually wanted to use this code it would probably be a good idea to look at the latest version in case I fix any bugs or come up with a better approach.

Tuesday, September 24, 2019

gSuneido Milestone

It's nowhere near finished but as of today gSuneido actually runs the Suneido WorkSpace.

There's a big pyramid of code (and hours) under this simple window. And it's a pyramid where one wrong brick at the bottom makes the whole thing come crashing down.

When you build bottom up, you spend a long time (years!) painstakingly laying foundations with not much to show for it.

Not surprisingly, it feels good to see some real results.

Thursday, August 08, 2019

gSuneido Progress

gSuneido (the Go language implementation of Suneido) finally passes all the standard library (stdlib) tests. I'm sure there are still bugs to find and issues to deal with, but it's a significant milestone.

Even better, it runs the tests faster than cSuneido (the original C++ implementation). jSuneido is slightly faster once it warms up, but that's not surprising since it compiles to Java byte code which is then JIT compiled by the JRE, whereas gSuneido always interprets byte code.

Unfortunately, this milestone isn't very useful. gSuneido can't access Windows API's yet so it can't run the front end user interface. And I haven't implemented the database, so it can't be used as the back end server either. In the long run, we're working on a browser based user interface but that's a ways off so I'll probably work on a Windows interface next because that will let us replace the aging cSuneido for the front end client. Also, I intend to overhaul the database rather than just port it, so that'll be a bigger job.

I'm still happy with the Go language. Occasionally I miss generics, but most of the time it's not a big deal. It's definitely my favorite language these days. When I started in 2014, Go was not as popular or as polished as it is now. For example, the garbage collector was fairly primitive and not really ready for what I needed. Luckily I picked a good horse in the language race and Go is certainly production ready now.

The first commit to the Github repo was in April of 2014. I've made 603 commits since then, most of them in the last year or so. It's currently about 22,000 lines of code. In comparison, cSuneido is 45,000 and jSuneido is 58,000. (via cloc) In modern terms those are tiny amounts of code. But it's plenty for me to develop and maintain.

Thursday, May 02, 2019

Merge Tree versus Block List

Looking for something else in the code, I happened to notice that temporary indexes in jSuneido were using a merge tree. Was that really the best data structure to use? Presumably it was better than what I used prior to that, but it didn't seem optimal.

A merge tree (at least my flavor) uses levels that double in size. Some levels may be empty. Each level is kept independently sorted. Items are inserted at the root and when the first levels fill up, they are merged into the next bigger level.

Merge Tree

Merge trees have a lot of good features:

Fast (amortized) insertion
Cache oblivious (i.e. don't need to be tuned for specific hardware)
Cache friendly - mostly sequential access
Optimal memory space (e.g. as opposed to array size doubling)

But they keep the data ordered all the time. Which is necessary if you are interleaving reads and writes, but for my usage, I insert all the data before reading any of it. And in order to keep the data sorted, they do quite a bit of extra work. (e.g. a million items = 20 levels, which means on average each item is copied/merged about 19 times) And merge trees are write optimized, they're not as good at lookups or iteration.

Maybe a simple list with an explicit sort step would be better in this situation? But the usual array doubling list does a lot of allocation and copying and wastes 25% of the space on average. And allocating large contiguous chunks of memory can be hard on memory management. It seemed like a better alternative would be a list of fixed size blocks. (In the diagram the blocks are only two elements, in practice 4096 elements per block was a good size.) Iteration is simple, and lookups can do a simple binary search.

Block List (after sort)

This has most of the features of a merge tree - fast insertion, cache oblivious/friendly, and low space overhead. In addition, it does less copying, never allocates huge blocks, and has faster iteration and lookups. And to top it off, the code is simpler :-)

As usual, I debated whether to go off on this tangent. But I was in between tasks, and I figured it wouldn't take long to implement enough to benchmark and decide whether to proceed.

To sort my block list I decided to use a regular sort on each block, and then merge the blocks. By using a series of two way merges and recycling blocks I could limit the extra space needed for merging to two blocks.

Although the percentage space overhead is high for small lists, it's small in absolute terms. And for small lists we have a single block and a simple sort with no merging required (and therefore no extra temporary space for merging).

I implemented this in jSuneido since that's what we're using in production. But Java doesn't have an integer array sort with a custom comparator. (There are ways to do it, but not efficiently.) I ended up borrowing the sort from fastutils (which appears to have similar origins to the Java sort). I've been mostly programming in Go lately so it was a bit of a shift to be back in Java.

My initial intuition proved correct and overall it was about twice as fast. Of course, this is just one small part of a large system so I'm not sure how much speedup we'll see in actual use. Still, I figure it was probably worth the day and a half it took me.

Tuesday, April 16, 2019

Regular Expression Implementation

This detour started with an out-of-memory error, which I tracked down to a runaway regular expression, which was a result of old C++ code expecting a nul terminated string, even though it was taking a pointer and a length.

It was easy enough to fix, but it got me looking at my regular expression implementation. Suneido has its own regular expression implementation both for historical reasons, and for portability. When I wrote jSuneido I tried to use Java regular expressions, but it proved too hard to make it compatible with cSuneido. And by now we have a lot of application code that depends on the particular flavor of regular expressions that I originally implemented.

Although it's a fairly clean object-oriented design, there was one part where it was switching on types rather than calling a virtual method. That was because it needed to update some state. No big deal, I’d just put the state on an object and pass it around.

I decided to work on this on the Go version, and Go made it very easy to write a quick benchmark to make sure my changes didn’t slow it down too much.

What I found was that it was already quite slow, and even slower with the changes. Suneido applications don’t tend to use a lot of regular expressions, so the speed wasn’t a huge issue, but it bothered me nonetheless.

The implementation compiles regular expressions to an intermediate form which is then interpreted. The speed of an interpreter depends primarily on how “large” the instructions are. In this case the instruction are very small, e.g. matching a single character. And my fancy object-oriented design added even more overhead in terms of indirection and virtual (interface in Go) calls.

What if I took a different approach and used the classic big-switch-in-a-loop style of interpreter with all the code inlined?

I made a quick prototype and benchmarked it as 300 times faster. I was a skeptical of that result, but it seemed worth investigating. As the code grew from a trivial prototype, the results came down to earth, ending up about 3 or 4 times faster. That was more believable, and still a good improvement. I suspect my first prototype was a little too simple and the compiler optimization eliminated most of it. That’s a common problem with micro-benchmarks.

Without all the object-oriented wrapping overhead, the inline interpreter wasn’t even that big, around a hundred lines of code.

Of course, while I was in there I couldn’t resist a few other optimizations. Suneido’s regular expression implementation uses fairly naive backtracking, which has some ugly worst cases. It’s a bit ad hoc, but I added specific handling for some of those bad cases. That’s where I saw up to 1000 times speed improvement, although only for those specific cases.

So I was happy with the speed improvement from this rewrite, but was the code harder to follow? I’m not sure. The interpreter is more monolithic, but it’s still reasonably small. And a whole bunch of interconnected small objects is not always easy to follow either. For me, it was a reasonable trade off.

Code in Github as usual

Wednesday, March 06, 2019

Go small structs and interfaces

The normal rule of thumb is that if you have small structs, you should pass them by value, rather than by pointer.

So in gSuneido (the Go implementation of Suneido I'm working on) that's what I'm doing.

But I was thinking about the way interfaces work in Go. An interface can only hold a pointer. So if you do:

var i interface{} = x

and x is not a pointer, then x will be copied to the heap so that i can point to it.

Some simple cases avoid the allocation. For example, true, false, and empty string ("") can be assigned to an interface without causing allocation. Presumably because the runtime has a shared value it can point to.

Suneido is dynamically typed, so values are stored and passed around as Value interfaces. (The equivalent of an abstract base class in C++ or Java.)

Given this, does it still make sense to "pass" small structs by value? Or is that just going to cause heap allocation?

I made a quick benchmark.

And the results were:

BenchmarkSmallByValue     20.7 ns/op    16 B/op    1 allocs/op
BenchmarkSmallByPointer    2.09 ns/op    0 B/op    0 allocs/op

As expected, when we pass by value, we get a heap allocation every time, making it about 10 times slower. That's a bit misleading because the benchmark doesn't do any real work. e.g. if it was doing 500ns of work, then the comparison would be 520ns versus 502ns which is much less significant.

However, the allocation is more of a concern. Heap allocation has bigger, more widespread affects like garbage collection and cache pressure and memory bandwidth. These aren't always evident from a benchmark.

With these kinds of micro-benchmarks I'm always afraid the compiler has optimized away key parts of the code. But I used Compiler Explorer to look at the code generated, and it all seems to be there.

Somewhat surprisingly, I got the same results for a simple string rather than a struct - it's faster to pass a pointer to a string. Coming from C++ or Java, you think of a string as already being a pointer. But in Go it's more like a slice, i.e. it's actually a small struct containing a pointer and a length. Because it's a small struct, you normally pass it by value. But when you use interfaces, that means heap allocation (of the slice struct as well as the actual bytes of the content).

Unless I'm missing something, it seems like I should change gSuneido to use pointers for all values, even strings.