The Software Life: March 2014

Monday, March 31, 2014

The Go Programming Language

Go has been around for a while. I looked at it when it first came out but didn't get too excited, partly because there weren't any good books about it, and that's how I like to investigate a language.

Recently I've been looking at it again. I've read two books - Programming in Go by Mark Summerfield and The Go Programming Language Phrasebook by David Chisnall. An Introduction to Programming in Go by Caleb Doxsey is available for free. These are decent, but so far I haven't found a Go book that I'd call great. There's a lot of material on the web which is useful, but I still prefer a good book. Also, most of the material is introductory - I haven't found much expert level material.

Here are my thoughts on Go (1.2), in no particular order. These are personal opinions and biased by thinking in terms of implementing Suneido, because that's the best basis for comparison that I have.

+ safe
Pointers, but no pointer arithmetic. Array bounds are checked. In theory, crash proof. (Like Java, unlike C/C++) But if you need it, there's the unsafe package.

+ garbage collection
Not as mature or high performance as Java, but steadily improving. And because not everything has to be on the heap (as in Java) there is less pressure on GC.

+ available on main platforms (Linux, Mac, Windows)

+ optional semicolons
Minor, but it's one less thing to type and to clutter up the code, and definitely my preference. I was always disappointed that D chose to keep semicolons.

+ capitalization for public / private
I'm biased since I use a similar approach in Suneido. The only (minor) thing I don't like about it is that you can't use capitalization to differentiate type names as is normal in C++ or Java.

+ goroutines and channels
An attractive alternative to threads and locks.

– no generics
Not so critical for application code, but this makes it really hard for library writers to provide things like alternate data structures. There's a built-in map that is generic, but if you want an ordered map, or a concurrent map or an immutable map you can't make it generic. Obviously, this is not an easy thing to add and it adds a lot of complexity to the language, but to me it's a drawback.

? built-in maps (hash tables)
Maps are obviously a good thing to have, but making them built-in to the language seems like an admission that the language itself is not sufficient to write such things (due to lack of generics and operator overloading)

+ fast compiles
This was one of the explicit goals of Go.

+ simple, standard build system
No complicated make or ant files.

+ native executables (no runtime required like Java or .NET)
I don't mind having a VM like Java or .NET, but it is an extra hassle. Even with .NET, which is part of Windows, you can run into versioning issues. And having to distribute a 90mb JRE is a pain.

+ variable types can be inferred from initialization
e.g. var x = 123 or just x := 123
This avoids having to duplicate long types e.g. m := map[long_type]another_long_type

+ multiple return values, multiple assignment
e.g. a,b = fn() or x,y = y,x
This means you don't have to create special data structures (and in Java, allocate them on the heap) just to return multiple values.

– no ? : ternary operator
In conjunction with mandatory braces, this turns thing like "return x < y ? x : y" into 5 lines of code. Supposedly ?: is hard to understand, personally I've never found it a problem. If you don't want the extra operators you can make if-then-else an expression like Scala and write "return if (x < y) x else y". However you write it, it's still a branch you have to understand.

+ standalone functions
Not just methods as in Java. (Java 8 now has lambdas, but still doesn't allow top-level functions.)

+ functions are first class values
You can pass around references to functions, put them in variables or data structures, etc.

+ closures
Function literals are automatically closures that can reference variables in their environment, even if the function outlasts the environment.

+ slices
A slice is a safe reference to a portion of an array. Unlike C/C++ pointers, a slice includes length and capacity. This is a big improvement over bare pointers. (D has similar slices.) See Arrays, slices (and strings)

– no immutable or read-only or const (other than const scalars)
I'm not surprised at this, but it's too bad. Immutability is very useful, especially with concurrency. Interestingly, Go strings are immutable (so at some level people see the benefit), but nothing else. One unfortunate result of this is that conversions between strings and arrays of bytes require copying, even though the underlying data is identical. This is another issue, like generics, that can add considerable complexity to the language and type system. And even where it is attempted, like C++ const, it often isn't ideal. The D language does have immutable and pure. But I would love to see immutable data and pure functions. (You can write your own immutable data structures, but again the lack of generics and operator overloading makes them clumsy.)

– no concurrent containers
I realize the Go way is to use channels and goroutines for concurrency, but I think there are still going to be times when a concurrent container would be useful and give better performance. And without generics, it's hard for libraries to provide this.

+ not everything on the heap (as opposed to Java)
You can embed one struct inside another without a pointer and a separate heap allocation. And you can pass or return a struct or put it in a variable by value, again meaning it doesn't have to be on the heap.

? no classes or inheritance
This is one of the more unconventional aspects of Go. Not having written much Go code it's hard to judge this. My feeling is that Go provides good alternatives. The only drawback may be porting existing code that uses classes and inheritance.

+ no separate primitives and boxed versions
This is a pain point in Java and a definite performance issue when implementing something like Suneido.

+ type declarations
Similar to a C/C++ typedef except that a Go type declaration introduces a new type. This is useful to create short name for a complex type, or to prevent mixing incompatible units e.g. celsius and fahrenheit.

+ can define methods on scalars (e.g. int or string), not just on classes
For example, you could declare Celsius as int, and then define a ToFahrenheit method on it. (This is a bit like the Common Lisp Object System)

+ "duck" typed interfaces
Types satisfy interfaces implicitly (by having the required methods), they do not have to explicitly declare what interfaces they satisfy. Very nice.

? no function overloading
Minor. It means more name variations, but keeps the language simpler.

? no operator overloading
Mostly not an issue. Makes it more awkward to use library provided data types.

+ flexible switch statement
Not limited to integer constants like C.

? no assert
This is explained in the FAQ but I'm not sure I agree. But it's minor because it's easy enough to write your own.

? error return values instead of exceptions, no try-catch-finally
This is different than what I'm used to but I haven't written enough Go code to really evaluate it. Go's panic and recover are sufficient to implement Suneido's exceptions. See the FAQ, Error handling and Go, and Defer, Panic, and Recover

+ can integrate with C

+ strings are arrays of bytes (generally but not necessarily UTF8), not wide characters
The D language also takes this approach. It makes sense to me, and also fits with how Suneido works. One of the advantages is that it reduces conversions when reading UTF8 files. It also reduces the size in memory when dealing with mostly ASCII. See Strings, bytes, runes and characters in Go

+ decent standard libraries and an increasing number of third party ones

+ standard testing framework
Basic but reasonable.

+ good standard tools
Nice to have standard tools for building, formatting, embedded documentation, profiling, race detection, etc.

– limited IDE and refactoring support
Although there are Eclipse and IntelliJ plugins available, they are fairly basic and don't include much refactoring support. I've been using the Sublime Text plugin. It sounds like the primary Go developers aren't IDE users so this area has lagged a bit. There is some limited refactoring ability in the go fmt tool. This area will likely improve over time.

+ standard formatting
Minor, but it's nice to sidestep any formatting debates. I like that they chose tabs for indenting, since that's always been my preference. But I'm sure it bugs people who prefer spaces.

Wednesday, March 26, 2014

B-tree Range Estimation

One of the trickier issues to deal with in Suneido is when the database query optimizer picks what seems like a poor strategy. One of the difficulties is that these issues are usually data dependant - so you need to do the debugging on a large database.

Suneido's query optimizer is fairly straightforward, but even on relatively simple queries there are a lot of possible strategies and it can be quite hard to understand the end results.

We had an issue recently with jSuneido where the sort could be satisfied by an index, but the strategy it chose was reading by a different index and then sorting (with a temporary index).

It wasn't hard to discover that it thought reading by the other index was enough faster to justify the temporary index. The problem was it was wrong.

It came down to the the btree range estimation. It was estimating the same key range quite differently for the two indexes, where they should have been exactly the same.

One of the important choices in query optimization is which indexes to use. To help with this, Suneido uses the index B-trees to estimate how "selective" a given range of index keys is, i.e. what fraction of the records does a range include. Because the query optimizer looks at lots of possible strategies (combinatorial explosion) you want this estimation to be fast. You don't want to read the entire index.

Strangely, I used the identical approach on both cSuneido and jSuneido, but they ended up choosing different strategies for the same data.

The code was doing a lookup on the "from" and "to" keys, estimating the position of the key from the path down the tree. Because a single path down the tree only looks at a few nodes out of a potentially large number, it has to assume that the rest of the tree is balanced and node sizes are "average".

It didn't take long to discover a rather blatent bug in both jSuneido and cSuneido. I was averaging the node sizes of the "from" search and the "to" search in order to get a better idea of the average node size on each level. But I wasn't adjusting the position within the node. For example, if the "from" search went through position 5 of a node of size 10, and the "to" search went through position 25 of a node of size 30, when I average the node sizes it now thought the "to" search went through position 25 of a node of size 20 - obviously wrong. On the positive side, if the node sizes don't vary too much then it doesn't have a big affect.

One reason why this bug hadn't surfaced yet is that the exact estimates aren't that important since the optimizer is just comparing whether different indexes are better or worse, not what the absolute numbers are.

That bug was easy enough to fix, and I came up with a much simpler way to write the code, but I still wasn't getting very accurate results. I started running the numbers through a spreadsheet to try to see what was wrong with my calculations.

After banging my head against the wall for a long time I finally realized that the problem was that the trees were actually not very balanced. A B-tree only guarantees that the tree is balanced in terms of height i.e. all the branches are the same length. It does not guarantee that each branch of the tree leads to the same number of keys. Most B-tree allow nodes to vary between half empty and full. When a node gets full it is split into two half full nodes.

However, if you add keys in order, this leads to most nodes only being half full. So Suneido uses a common optimization to split unevenly if a node gets full by adding on the end. In the "worst" case (in terms of balance) this leads to the root of the tree having a left branch that is full, and a right branch that has a single key. That is exactly what my data looked like in the "bad" case, but I hadn't realized the significance.

Side note - this optimization applies more to B-trees that use fixed node sizes, like cSuneido. jSuneido's append-only database uses variable sized nodes, so space wastage isn't an issue. However, you still want to keep the branching high to minimize the height of the tree.

My solution was for the estimation code to look at the sizes of all the nodes on the first and second level of the tree (root and one level below), rather than assume they were all the same average size. This doesn't handle the tree being unbalanced below there, but the lower levels of the tree have less of an effect on the final result. Finally I got some results that were at least in the right ballpark, and good enough so that jSuneido now chooses the "correct" strategy.

Maybe I've been spoiled by Java, but the unsafe nature of C++ scares me. I made a stupid bug in the cSuneido changes that wrote one past the end of an array. This passed all the unit tests, but gave some strange random errors later on. In this case it wasn't hard to find the problem, but it makes me wonder.

As usual the code is in version control on SourceForge.

Thursday, March 20, 2014

A Curious Error

I got this curious error from Evernote on the Mac:

I'm pretty sure that small amount of text doesn't exceed 100mb :-)

Maybe it thought I was trying to insert something huge, if so, it wasn't anything I did intentionally. (I think I hit the TAB key, but when I tried that again it didn't give any errors.)

Tuesday, March 18, 2014

Eclipse Crashing

I got back from six weeks travel and fired up Eclipse (4.3 Kepler on Mac OS X). Not surprisingly, there were updates so I said ok to install them and restarted when prompted. Except instead of restarting it crashed :-( I tried restarting several more times but it kept crashing.

Rather than waste a lot of time figuring out what went wrong it was quicker to rename my old copy of Eclipse, download a new copy, and then import my plugins from the old install. (One of the nice things about Eclipse is that it's just a folder, and you can trivially have multiple installs.)

The only other thing I had to do was check which Java Eclipse was using (Eclipse > Preferences > Java > Installed JREs). I found it only listed an old version so I used the Search button to find the new one, made it the default, quit from Eclipse, and deleted the old version of Java to ensure nothing would use it.

You'd think that leaving a computer turned off would ensure that everything would work when you got back (barring hardware failures). But interrupting the modern day firehose of updates unfortunately often leads to problems. (Although in theory it shouldn't.)