Monday, January 04, 2021

Unix: A History and a Memoir

I just finished reading Unix: A History and a Memoir by Brian Kernighan. I'm not much of a history buff, but I enjoyed it, mostly because it brought back memories.

By the time I was in high school I was obsessed with computers. My father somehow arranged for me to get a Unix account in the university computer science department. I'm not sure of the details - my father worked on campus, but wasn't part of the university, he was an entomologist (studied insects) with Canada Agriculture.

My father also arranged permission for me to sit in on some computer science classes. But my high school principal refused to let me take a few hours a week for it. I can't recall why, something about disrupting my school work. Which is totally ridiculous given that I was at the top of my classes. You wonder what goes through the minds of petty bureaucrats.

I remember being quite lost at first. I was entirely self taught which left plenty of gaps in my knowledge. I was intimidated by the university and too shy to ask anyone for help, But I muddled along, teaching myself Unix and C. This would have been in 1977 or 1978, so the early days of Unix. Of course, it was in the universities first.

I recall being baffled that C had no way to input numbers. For some reason I either didn't discover scanf or didn't realize it was what I was looking for. It wasn't really a problem, I just figured out how to write my own string to integer conversions. When the C Programming Language book (by Kernighan and Ritchie) came out in 1978, that helped a lot.

Software Tools (by Kernighan and Plauger) was probably the book I studied the most. I implemented a kind of Ratfor on top of TRS-80 Basic, and implemented most of the tools multiple times over the years. I still have my original copy of the book - dog eared, coffee stained, and falling apart. A few years later, I reverse engineered and implemented my own Lex and then Yacc. Yacc was a challenge because I didn't know the compiler-compiler theory it was based on. Nowadays there are open source versions of all this stuff, but not back then.

I read many more Kernighan books over the years, The Elements of Programming Style (with Plauger), The Unix Programming Environment (with Pike), The Practice of Programming (with Pike), and more recently, The Go Programming Language (with Donovan).

Nowadays, with the internet and Stack Overflow, it's hard to remember how much harder it was to access information in those days (especially if you didn't talk to other people). I had one meeting with someone from the computer science department.  (Again, arranged by my father, probably because I was so full of questions.) Looking back it was likely a grad student (he was young but had an office). I questioned him about binding time. He didn't know what I was talking about. I don't remember why I was fixated on that question. I must have seen some mention of it in a book. Me and unsolved questions/problems, like a dog and a bone.

Presumably the Unix man pages were my main resource. But if you didn't know where to start it was tough. I remember someone gave me a 20 or 30 page photocopied introduction to Unix and C. That was my main resource when I started out. Nowadays, I'd be hard pressed to make it through a day of programming without the internet.

Monday, December 14, 2020

Coverage for Suneido

Since the early days of Suneido I've thought that it would be good to have some kind of coverage tool. But for some reason, I never got around to implementing it.

Code coverage is usually associated with testing, as in "test coverage". Lately I've been seeing it in connection to coverage based fuzz testing.

But coverage can also be a useful debugging or code comprehension tool. When you're trying to figure out how some code works, it's often helpful to see which parts are executed for given inputs. You can also determine that by stepping through the code in a debugger, but if the code is large or complex, than can be tedious, and doesn't leave a record that you can study.

For some reason I started thinking about it again and wondering how hard it would be to implement in gSuneido.

One of my concerns was performance. If coverage is too slow, it won't be used. And obviously, I didn't want to slow down normal, non-coverage execution.

While simple coverage is good, I find statement execution counts more useful. Statement counts verge on profiling, although profiling is generally more concerned with measuring time (or memory).

That got me wondering about approaches to counters. One interesting technique I came across is Morris approximate counters. That would allow an effectively unlimited count in a single byte. But I decided that the nearest power of two is a little too crude. Although it's usually not critical that large counts are exact, it's often helpful to see that some code is being executed N times or N+1 times relative to other code or to inputs.

16 bit counts (up to 65,535) are probably sufficient but I didn't want wrap-around overflow. I knew arithmetic that doesn't overflow was a standard thing but I couldn't remember the name. Eventually I found it's called saturation arithmetic. Sometimes people talk about "clamping" values to maximums or minimums (from electronics).

Often, to minimize coverage work, tracking is done on basic blocks. Normally that's part of the compiler, but I didn't really want to mess with the compiler. It's complex already and I didn't want to obscure its primary function. I realized that instead, I could get equivalent functionality based on the branches in the byte code. Obviously if you branch, that's the end of a basic block. And where you branch to is the start of a basic block. So I only need to instrument the byte code interpreter in the branch instructions. Basically the interpreter marked the start of executed blocks (branch destinations) and the disassembler identifies the end of blocks (branch origins).

I decided this would be a fun weekend project and a break from working on the database code. That didn't work out so well. At first I made rapid progress and I had something working quite quickly. Then things went downhill.

If I was tracking at the byte code level, I needed to connect that back to the source level. I had a disassembler that could output mixed byte code and source code, so that seemed like the obvious place to do it. Then I found that the disassembler didn't actually work that well. I'd only ever used it as a tool to debug the compiler. I spent a bunch of time trying to make the disassembler cover all the cases.

Meanwhile, I was starting to get depressed that it was turning into such a mess. The more I worked on it, the uglier it got. I don't usually mind when I take a wrong turn or decide to throw out code. That's just the way it goes sometimes. But when I can't find a good solution, and things keep getting worse, then I'm not a happy camper.

In the end, I had something that mostly worked. I checked it into version control so I'd have a record of it, and then I deleted it. My idea of using branches to identify basic blocks was valid, but the actual implementation (that I came up with) was far from elegant.

I maybe should have given up at this point. The weekend was over, I had more important things to work on. But I still thought it would be a worthwhile feature. And if I came back to it later I'd just have to figure it out all over again.

Once I abandoned the ugly code I felt much better. I decided to take a simpler approach. I'd add an option to the code generation to insert "cover" instructions (instrumentation) at the beginning of each statement. That was just a few lines of code. Then I just needed to implement that instruction in the byte code interpreter - a handful more lines of code. The overhead was relatively small, somewhere in the neighborhood of 5 to 10 percent.

And that was the core of it. A bit more code to turn it on and off, and get the results. Way easier, simpler, and cleaner than my first "clever" approach. Hopefully it will prove to be a useful feature.

Tuesday, December 01, 2020

Leaving HEY

TL;DR - I used HEY for five months, gave it a fair trial I think, had no huge complaints, but I've gone back to Gmail. For the details, read on.

I was excited when Basecamp (formerly 37 Signals) announced their HEY email. I asked for and eventually received an invitation. I’ve been a long time Gmail user, from back when you needed an invitation. Back in the days when Google’s motto was “do no evil”. They’ve since replaced that motto with “get all the money”. Which doesn’t make me comfortable giving them all my email.

Basecamp has a good record of supporting old products. (I still hold a grudge over Google Reader.) And they are more privacy minded than Google. I didn't have a problem paying for the service. Paying customers are often treated better than the users of "free" services.

I like a lot of the HEY features - screening emails, spy blocking, reply later, set aside, paper trail, feed, etc.

One thing that bothered me was that it was a closed walled garden. Traditionally, email has been built on standards (like SMTP, POP, and IMAP). You can use Thunderbird to access your Gmail, use Apple Mail to access your Hotmail, etc. HEY lets you forward mail in or out, but that's as open as it gets. You can't access your HEY email from another client, and you can't use the HEY client to access other email accounts. Their explanation for this is that their special features aren't interoperable. I'm not sure I totally buy that. It seems like a more believable reason is that it simplifies what is undoubtedly a large challenge. And of course, the HEY software itself is not open source. I prefer open solutions, but I use other walled gardens, like the Apple ecosystem.

It was easy to redirect my Gmail and start using HEY. It was easy to learn and use. I was quite happy with it, and I must admit, a bit excited to be in on the beginning of something. Like when I first started using Gmail. But it wasn't long before I started running into annoyances.

On the positive side, it was nice being able to report an issue and actually get a response from what appeared to be a human. (Hard to tell with some of the bots these days.) Good luck with trying to report a problem to Gmail.

One of my big frustrations was not being able to view attachments. You could argue that's not really the responsibility of an email client. But I was accustomed to being able to view pdf's (e.g. resumes on job applications) with a single click. That single click in HEY just took me to a file save dialog. So I could download it (cluttering up my download folder and taking up storage) then find the downloaded file and then open it in a separate file. No more quick glance at the contents of an attachment. That was using the standalone Mac app. If I accessed HEY from my browser it was a little better (if I can convince Firefox I don't want to download it). The funny part was that HEY displays a thumbnail, and on iOS you can zoom in and read it. So obviously they were already interpreting the content, they weren't just treating them as opaque blobs. I told myself this was a minor issue but it continued to bug me.

There were quite a lot of bugs at first. In some ways that's not surprising for a new ambitious project. But I have to admit I was initially a little disappointed. I guess I've drunk a little too much of the Basecamp/David/Jason kool-aid and had high expectations. I told myself they would fix them, give them time. And some did get fixed. But others didn't. For example, the touted Feed uses infinite scrolling, except when it needs to load more content there's a noticeable pause and afterwards the view is messed up. You lose your scroll position and all the items are doubled. Not exactly a great experience. I can imagine most of the testing happened without enough data to hit that. They even mentioned it in a job posting as the kind of thing you might work on.

Then I listened to a podcast with David where he talked about how hard they'd worked to fix bugs after the release. But that they'd had to put that aside to work on their promised HEY for Work. Great, too busy adding features to fix bugs. Heard that story before. Then he went on to talk about how bugs are overrated, they're not really a big deal. You shouldn't feel bad about your bugs, they're "normal". They should have been playing "Don't worry, be happy". I'm exaggerating, I understand where he's coming from. And I agree there's a big difference between cosmetic bugs and functional bugs. And bugs that affect a minority of users versus ones that affect the majority. But it's a slippery slope. Where does that Feed issue fit? Is that a "cosmetic" issue to be shelved? To me it was a major problem, but I realize that's a judgement call.

To me, telling programmers not to worry about bugs is just asking for a bug filled product. And once you have a lot of bugs, it starts to feel pointless to fix them. Personally, I'm ok with feeling a little bad about my bugs. Not to the point of flagellation, but enough to make me e.g. write more tests next time.

I also found I'd (not surprisingly) grown accustomed, if not dependent, on a whole assortment of Gmail features. I screwed up and sent an email to the wrong recipient, which I would have caught with Gmail's undo feature. I was used to typing a few characters of a contact and having Gmail suggest the right person, whereas HEY constantly suggested contacts I never used. The Feed is a nice idea, but it's (currently) a pretty minimal feed reader. It doesn't keep track of what you've read, and if you get interrupted, there's no way to pick up where you left off. You have to scroll down and try to remember what you've read. I've switched to using Gmail filters to forward feed type material to my Feedly. Filters are another feature missing (or omitted) from HEY.

I'm not writing HEY off. I have my account and I don't mind having paid for it. I think they're trying to do something worthwhile and I don't mind supporting that. I'll keep an eye on it for potential future use.

I'm not completely happy going back to Gmail. I don't have anything particular to hide, but I'm not a fan of surveillance capitalism - of companies like Google making their profits from selling my private information, or the ugly things done with that information by the companies that buy it.

Wednesday, September 09, 2020


I want to have checksums on parts of the database in gSuneido. In cSuneido I used Adler32, which is available in Go, but was it the best option? (Checksums are hash functions but designed for error detection, as opposed to hash functions designed for cryptography or for hash tables.)

I didn't find a lot of clear recommendations on the web. Adler32 is a variant of Fletcher and there was some question over which was better.

A Cyclic Redundancy Check (CRC) should be better at error detection than Adler or Fletcher. But most of what I could find talks about single bit errors. I'm more concerned with chunks of data not getting written to disk or getting overwritten. It's unclear how that relates to single bit errors.

I was also wondering if I could get away with a 16 bit (2 byte) checksum. But the Go standard library doesn't have any 16 bit checksums. I could use half the bits of a 32 bit checksum. In theory a good checksum should spread the information over all the bits, so it seemed reasonable. But how would that affect error detection?

One drawback to CRC's is that they don't handle zero bytes very well. And unfortunately, memory mapped files can easily have extra zero bytes (their default value) if data doesn't get written. But in that case the zero bytes would be instead of actual data, which should result in different checksums.

My other concern was speed. Adler and Fletcher are simpler algorithms than CRC so they should be faster. Adler uses a theoretically slower division than Fletcher, but the Go compiler converts division by constants into multiplication by the reciprocal so that shouldn't matter.

I wrote a quick benchmark (using Go's handy benchmark facility) and I was quite surprised to see that crc32 was much faster than adler32. Of course, the results vary depending on the details. This was for 1023 bytes of random data.

BenchmarkAdler32-24              3124998               332 ns/op
BenchmarkCrc32IEEE-24           11993930                91 ns/op
BenchmarkCrc32Cast-24           24993260                47 ns/op

I assume the speed difference is due to hardware (instruction set) support for CRC.

I found third party crc16 implementations, but I don't think they are using hardware support so I assume it will be faster to use half of crc32.

I also found that IEEE slowed down with random block sizes. e.g. a checksum of 1023 bytes was a lot slower than a checksum of 1024 bytes. Castagnoli was better in this respect, and supposedly it's also better at error detection.

I also did a bunch of manual experimenting with error detection. I found using 16 bits of crc32 worked quite well. (missed only one or two more errors out of 100,000 cases) It didn't seem to matter which 16 bits. (Dropping to 8 bits made a bigger difference, not surprisingly.)

So for now I'm going with 16 bits from crc32 Castagnoli.

Wednesday, August 12, 2020

A New Assert Package for Go


Back when I first started using Go I wrote an assert package similar to Hamcrest. You used it like:

Assert(t).That(x, Equals(y))

That has worked ok, but it has a few weaknesses. I blame it partly on being relatively new with Go when I wrote it.

  • It requires a dot import, which brings all its public names into the current package scope. That's not recommended because it pollutes the name space and can easily lead to name clashes. (Technically it doesn't "require" a dot import, but it would be very awkward to use otherwise.) The other problem with a dot import is that the code doesn't mention the package, so automatic imports don't work, so I end up copying the import from another file.
  • The way it works isn't really amenable to e.g. Intellisense in VS Code. You have to know the names of the functions (like Equals).
  • It's more complex than it really needs to be - matchers return closures, which is clever, but complicates the code and makes it harder to understand.
  • It's not amenable to use outside of tests, so I wrote another tiny verify package, which was similar but different.

I finally got irritated by it enough to write a new version that addresses these issues.

Here's how you use the new package:

assert.That(a || b)
assert.Msg("first time").That(a || b)

If you're doing multiple assertions and you don't want to specify .T(t) each time, you can do:

assert := assert.T(t)

The testing.T is optional (if it's not supplied errors just panic) so you can use asserts in regular code. (And I don't need my redundant verify package.)

Now VS Code Intellisense works much better. If you type assert. it will suggest appropriate functions like That, This, T, Msg. If you type assert.This(x). it will suggest Is, Isnt, Like, Panics.

One advantage of my old style is that you could write additional matchers in a separate package. But I never actually did this. And it's my code so if I need another matcher, I can just add it to the assert package.

One of the popular Go testing packages, Testify, uses a different style:

assert.Equals(t, x, y, "message")

There's nothing wrong with that style, and in some respects it's simpler. But I've never liked it, partly because it's unclear which is the actual value and which is the expected value.

Tuesday, August 04, 2020

Sorted List

Let's say you want to add a large number (e.g. up to millions) of values to a list, one at a time, in no particular order, and at the end you want to iterate through them in sorted order. There are a number of ways you can accomplish this.

Most languages have some sort of list data structure. In C++ you could use a vector, in Java an ArrayList, in Go a slice. Most of these grow by doubling (or possibly some other factor) in size as required. At the end you can simply sort the list. There are several drawbacks to this approach. One is that you have wasted space. Assuming doubling, the list is between 50% and 100% full, so you waste 25% on average. You also allocate roughly double the amount of space you end up with. (e.g. to get 8 elements you allocate 1, 2, 4, and then 8) In a garbage collected language this puts extra pressure on the garbage collector. And each time the size doubles you have to copy all the values to the new space.

One way around the space, allocation, and copying overhead is to split the list up into fixed size blocks and grow the list by adding more blocks.

You could use an ordered set data structure (assuming the values are unique) like a Java SortedSet or a C++ set. These are often implemented with something like a red-black tree which is not very efficient space-wise since each value has one or more pointers associated with it.

Log structured merge (LSM) trees are another option. If you just use lists for each level, the space overhead is minimal. Like the simple list, you can split the levels into blocks to reduce space and allocation overhead.

But we don't really need an ordered data structure since we don't need the order till all the values have been added. So an ordered set or LSM tree is doing more work than necessary.

My current implementation in jSuneido evolved from an LSM tree. It uses a list of blocks and at the end it sorts each block individually and does a series of two-way merges to combine them. Deciding what to use for gSuneido, it struck me that it would probably be more efficient to just sort the list as a whole since the two-way merges were merging (compare & copy) each value log2(n) times. It was relatively easy to modify my jSuneido implementation to try that. Curiously, it was slower. That seems a bit odd since the overall sort should do less compares and copies. I have to assume that once again modern cpu design is making intuition invalid. I would guess that sorting the blocks individually is fast because they fit into memory cache. And merging is fast because it's reading and writing sequentially which is optimal for cache usage.

You could merge all the blocks at once with an N-way merge, rather than a sequence of two way merges. This would mean less comparisons and copies, but it doubles the space usage during the merge. A nice aspect of a block list and two way merges, is that you can recycle the blocks. As soon as you finish merging a block, you can put it on a free list to reuse. This means the space overhead for two way merging is only a couple of blocks instead of double the size. (which reduces allocation, which reduces GC)

One interesting aspect of this approach is that the sorting and merging can be done incrementally. As soon as you fill up a block you can sort it. And as soon as you have two (or power of two) blocks you can merge them. This doesn't change the amount of work required, it just redistributes it over time. Spreading work more evenly and making it less bursty can be beneficial but it's not a clear win.

But then I realized you could do the sorting and merging concurrently with building the list. Again, it doesn't change the amount of work, it just allows some of it to be done in parallel. There are various ways to do this. I created a worker goroutine and triggered it via a channel. You could just start a new goroutine each time, but that's a bit more overhead. In Java I'd be thinking about a thread pool, but goroutines are lightweight enough that it's probably not necessary.

I ended up writing multiple Go versions so I could benchmark them. The results were pretty much what I expected. Simple slice and sort is fast, but lots of allocation. Block list is much less allocation but slightly slower overall. Merging is slightly faster than a complete sort. Block list with concurrent sort/merge was the winner. It had the lower memory usage of the block list, and was 10 to 20% faster due to the concurrency.

BenchmarkSimple  30 35164847 ns/op 664934 B/op 23 allocs/op
BenchmarkBlocks  26 43607473 ns/op 131139 B/op  6 allocs/op
BenchmarkMerge   27 37056507 ns/op 197306 B/op 31 allocs/op
BenchmarkConcur  37 28624720 ns/op 197768 B/op 34 allocs/op

This was with a block size of 4096 and adding 16,384 items.

The initial code is in Github (this points to the original commit so the link doesn't break, in the future the latest version is probably a better bet)

Thursday, July 23, 2020

Hash Array Mapped Trie in Go

When you want an immutable persistent (in the functional sense) hash table, a Hash Array Mapped Trie is a common choice.

I recently implemented one to use in the gSuneido database. I procrastinated because I had a vague memory that it had been tricky to implement in jSuneido. But when I finally got around to doing it, it turned out to be relatively easy.

A Hash Array Mapped Trie (HAMT) is a tree of nodes. Each node has an integer bit map (commonly 32 bits) that specifies which slots are occupied, and a variable size array of items. The bitmap means you don't waste space on empty slots like you do in most hash tables.

Each level of the tree consumes 5 bits (for a bitmap of 32 bits) of the hash value. In most implementations, the items are pointers, and can either point to a value, or a child node. The items could be void* in C++ or Object in Java. The equivalent in Go would be interface{} which can hold a value of any type. But Go interface{} is double the size of a simple pointer. And on top of that, it requires casting to get the proper types back out. This approach also doesn't allow embedding small values, they have to be referenced by pointer, so it's not as cache friendly. (An interface can reference a non-pointer type like a strict but internally it’s still using a pointer.)

I decided to take a slightly different approach and use two bitmaps and two slices (Go's variable sized arrays). One bitmap and slice for values, and one bitmap and slice for child pointers. That way the slices could be the proper types, no casting was required, and small values could be embedded rather than referenced by pointer. The tradeoff was that it uses slightly more space.

After finishing an initial implementation, I realized that this design allowed a slight improvement over the usual approach.

At a given node, if you have a value in a slot, and you want to insert another value with the same 5 bits of hash (a collision), you replace the value with a pointer to a child node and put the two values in the new node. If your hash is bad, they can collide again forcing yet more child nodes, poor space usage, and longer searches. (When you run out of hash bits you either have to rehash or use overflow nodes.)

But with separate slices for child pointers and values, you can have both for a given slot. So when you get a collision, you still create a child node, but you don't need to move the old value. So the new node only has one element and there cannot be any further collisions from this insertion, even if the hash values are identical. And it actually simplifies the code and reduces the average search length slightly.

The other variation in my implementation is that the data structure operates in two modes, either immutable (read-only) and safely sharable between threads, or mutable (updatable) and not sharable. Like most persistent immutable tree data structures, making a change requires path copying to the root. That's a lot of overhead to do for every insert or delete. My code tends to operate in two modes, either a single thread is updating, or multiple threads are sharing immutably. By having a mutable mode, the path copying is shared/reduced. e.g. every insert/delete doesn't have to make a new root node. I use a "generation" integer value to know whether a node is mutable. If the node's generation doesn't match the current generation, then we make a copy of the node, and set its generation to the current one so it will be mutable from now on. To switch everything back to immutable (freeze it) we simply increment the current generation. (Don't have to touch the nodes.) 

Until Go gets generics, I'm using the Genny code generator to create typed versions. It's not perfect, but it works well enough.

The code is in Github as usual.

Update 2020-09-15

I discovered a bug in my implementation. If you insert two colliding items, the second will go into a child node. If you then delete the one from the parent node, you end up with a situation my code didn't handle. It assumed that if the value slot of a node was empty, then that hash didn't exist. One solution would be to check child nodes. But it made more sense to put the burden on the less common delete operation. Now when you delete a value from a node, it "pulls up" a value from a child node if there is one. This maintains the invariant that if a value slot of a node is empty, then there are no child nodes of that slot. And by pulling up from the lowest possible child, it also keeps the tree as shallow as possible.