Tuesday, May 27, 2014

Java 8 Performance

I was just looking at some stats on the average time our customers' servers take to run our application test suite.

I noticed on a particular day the times dropped from an average of 240 seconds to an average of 200 seconds. (These are averages from about 240 customer sites.) The numbers are generally quite stable so I was curious what changed.

I discovered that was the day we updated everyone to the Java 8 JRE so it looks like that's the reason for the improvement. Assuming that's the correct explanation that's a pretty nice upgrade!

It made sense that it was Java related since customers running the older cSuneido did not show any improvement that day.

Note: jSuneido is still compiled for Java 7, this improvement would just be from the runtime.

Monday, May 19, 2014

Portable Tests

With two implementations of Suneido (the original C++ cSuneido and the newer Java jSuneido) I've ended up with three sets of overlapping tests - in each of the implementations plus in the Suneido standard library. And as I play with implementing Suneido in Go I find myself creating yet another set of tests.

Obviously this is not ideal. Apart from the duplication, each version of the tests has better or worse coverage of different areas depending on where I had issues with the implementation. Ideally, I'd like to run the same complete set of tests everywhere, and if I added a test case it would be included everywhere, not just in one of the versions.

One option would be to use something like Fit or Fitnesse. But that would still require writing code (for fixtures and "slim" interfaces) and it would mean accepting a third party dependency which in turn depends on Java.

I figured the simplest thing would be to have the test cases in text files and to write a test runner for each of the versions.

But what format should I use for the test cases? I realized that I could use a format that was easy to read with the Suneido lexical scanner. Any implementation of Suneido has to have this, and it's generally one of the first things I implement. Using the scanner made it easy to handle quoted strings and to ignore comments and whitespace.

I implemented a test runner in Suneido code first, and designed the format to keep the parsing simple. Here is an example:

@add

1, 1, 2 // i.e. assert 1 + 1 == 2

@regex_match

"abc" "b"
"abc", "x", false // this should not match

"foo
bar", 
"^bar" // ^ should match after a newline

An '@' followed by a name precedes a list of test cases for the named test "fixture". Each version has to implement each of the fixtures, but these are simple and I already have equivalent code in the existing tests.

Normally each line of values is a test case. Commas between values are optional, but newlines are ignored after a comma to allow splitting a test case over several lines.

After the Suneido code version it was straightforward to implement a version in Go. Java and C++ should also be simple.

I still want to run these tests as part of the existing automated testing, but that's easy to do by writing a test (e.g. in JUnit) that calls the test runner for the portable tests.

A remaining question is where to put the test files. Expecting them to be in the current directory or a subdirectory is easiest, but then each version will have its own copy and I'd have to keep them in sync. It makes more sense to have a single copy somewhere, but then I need some way to point each version at that central location. One option would be an environment variable but that can be problematic. Instead I decided I'd put a text file at the root of each project that would contain the directory path to the tests. (And if that doesn't fit, each implementation is free to handle this differently.)

My main concern with this was tests with a lot of different cases, where you'd use data driven tests. (like regular expressions) In other areas what I probably should have is more of a BDD (Behavior-driven development) style of tests that would form a kind of specification for Suneido. To keep this portable it would make sense to use the JBehave style that separates the specification from the implementation.

Wednesday, May 07, 2014

Go: When nil isn't nil

Go is a fairly simple language. But it's still complex enough to have some oddities. I recently ran into one of them.

Lets say you have a nil pointer and you assign it to an interface:

var p *T = nil
intfc interface{} = p

I expected intfc to now be nil, but it's not.

The explanation (it's even in the FAQ) is in terms of the implementation. An interface contains a pointer and a type. When we assign a nil pointer the interface type is set. And an interface is only nil if both the pointer and the type are "empty".

I understand the explanation and it's not hard to work around once you're aware of it. If you want to return an interface that compares equal to nil you just have to make sure you don't assign a nil pointer to it.

But I don't find the explanation very satisfying. 

First, justifying some external behavior by explaining how you happened to implement it seems wrong. (Although to be fair, maybe the explanation is intended to be in terms of what an interface means, not really how it's implemented.)

Second, the explanation doesn't explain why they chose to make it work this way. Granted, it's simple because the nil check is just for a zeroed value. But it doesn't seem like it would be much harder for the compiler to just check for a zeroed pointer and ignore the type. It seems like this would avoid the unexpected behavior with no real loss of functionality.

I did some searching, but the only justification I could find is that a nil pointer in an interface can still satisfy the interface and you can still call methods on it. Which is fine, but what does that have to do with whether the interface value compares equal to nil? I guess it would introduce a new oddity in that you'd have two kinds of nil interfaces only one of which you could call methods on.

The other issue is that (afaik) you can't determine if an interface holds a nil pointer without using reflection.

It's not a big deal, and I'm not hung up on it, it just seems odd. As far as languages are concerned, I would say Go has relatively few sharp corners to get caught on.

Sunday, May 04, 2014

Go Suneido Spike

In software, a spike is some programming to investigate options or answer questions or to test design ideas. The term came from XP (extreme programming) and is used in agile and Scrum.

As a way to learn Go I've been investigating what it would be like to use it to implement Suneido. I've implemented various bits and pieces like lexical scanning, memory mapped file access, decimal floating point numbers, string concatenation, and hash tables.

In the Java implementation of Suneido I was able to leverage the Java virtual machine. In Go (as in the original C++ version of Suneido) I would need to write my own bytecode interpreter. To investigate this, I did an end to end spike from lexing to parsing to code generation to byte code interpreter. I even implemented a basic REPL (Read, Eval, Print Loop). All it currently handles are simple expressions like 100 + 50 - 25 or "hello" $ "world". But it's implemented with the right structure to flesh out into the full language. (The code is on Github if you're interested.)

I've written about 4000 lines of Go code so far. Not enough to be an expert by any means, but enough that I don't feel like a newbie anymore. It's been mostly low level code, I haven't done anything with goroutines and channels yet.

It's been remarkably painless. I think it helps to have a C/C++ background. Of all the languages I've dabbled in in recent years, Go has been the easiest to pick up and be productive in. That's partly due to the small, simple language, and partly due to the good default tools. The Java language wasn't hard to pick up, but the libraries and tools are definitely more complex.

The fast compiles are a key feature. After working in Java I think it would be hard to give this up. Considering the classic compile and link approach, running Go code seems just as fast as running Java code.

I almost find Go's simplicity a little disappointing. I love reading about complex languages like C++ and Scala and all the wild and wonderful things you can do with them. What software geek wouldn't love turing complete templates and type systems! Go doesn't have those kinds of things, and therefore doesn't have intricate books about them.

But as far as something I actually want to use, and not just read about - in that respect Go is great.

Saturday, May 03, 2014

A Go Hash Map

I was a little disappointed to discover that the built in Go map doesn't allow interfaces as keys. Considering that interfaces are the way to do anything dynamic in Go, and that dynamic stuff often uses maps, it seems a little odd.

To act as a hash table key, you need equals and hash code methods. But it's easy to define an interface for that and require that keys implement it, similar to how Sort requires a container to implement Len, Less, and Swap.

I looked around to see what's out there and found a few options, but none really excited me. And I'm still learning Go, so it made more sense to implement it myself.

My first thought was to port my hash map code from cSuneido. But I wondered what Go's map was like. The code is straightforward but it uses an interesting approach that I haven't encountered before. It's a variant of separate chaining with each slot in the hash table being a bucket that can hold a small number of entries (e.g. 8). Additional overflow buckets can be chained together. In many hash table designs, collisions are a nuisance to be tolerated, but this design almost embraces them, by making the table 8 times smaller you assume collisions.

Buckets holding a number of entries are also better for cache locality than a linked list.

Another interesting feature is that the buckets have an additional byte per entry that holds the high byte of the hash code of the key. This helps in searching because if this piece of the hash code doesn't match then you can avoid comparing keys (which is cache unfriendly and also slow if keys are large or complex).

This design also works well for small tables since you can use a single bucket, which basically reduces it to a small array with linear searching, which is what you want for a few entries.

So I implemented this design in Go, following the C code fairly closely, except that I didn't implement the incremental resizing. It might be worthwhile in some situations, but it makes the code more complex (especially iteration) and probably makes the resizing slightly slower in total, albeit amortized. The lack of incremental resizing hasn't been a noticeable issue in cSuneido.

Have a look, it's about 200 lines of Go.

The next issue was what to use for hashing strings. Go has standard packages for hashing but they require converting to a byte array which requires allocation and copying. (Go 1.3 has an optimization for this, but only for the built in map.) So again, I wrote my own version, following the approach in hash/fnv.

It seems reasonable. The main drawback comes from Go's lack of generics - it has to work in terms of interface{} (the equivalent of Java Object or C/C++ void*) so you have to cast everything that comes out of it, reminiscent of Java prior to generics. Another minor awkwardness is that you can't use tbl[key] syntax like built in maps.

Another hash table approach which would be a natural fit with Go would be to use a growable slice for each entry in the table (rather than a list of buckets). This would avoid the space overhead from chain links and from partially full buckets, at the cost of the slice itself (a pointer and two int's), plus more individual allocations.

Related interesting reading:

Saturday, April 19, 2014

CharMatcher in Go

The Guava library for Java has a CharMatcher that provides a way of composing character matching predicates plus functions that use those predicates.

For example, AnyOf("\r\n").Negate() creates a CharMatcher that matches any character except return or newline. You can then do things like cm.IndexIn(str) or cm.CountIn(str)

Some of this you can do directly with Go libraries. The unicode package provides some standard "matchers" like IsDigit and IsLetter. And the strings package has functions like IndexFunc and TrimFunc that take predicates.

But they don't do everything that CharMatcher does, so as an exercise I thought I'd try implementing something like CharMatcher in Go.

My first approach was basically an object-oriented style like I'd use in Java with CharMatch as an interface.

But when I started adding more matchers it seemed excessive to have to define three pieces for each - a struct, a match method for the struct, and a function to construct the struct.

My next thought was to get rid of the interface and have a generic struct containing a matching function as a member. This uses closures to store the matcher parameters rather than structs.

I was stuck on the idea of a struct so that I could define methods like Negate and IndexIn on it. Then I realized that in Go I could make CharMatch just a function, and still define methods on it. That led to this version:

I used InRange for DIGIT and AnyOf for SPACE as examples, these could also use the unicode package equivalents.

IndexIn is an example of a method that just wraps a strings package function, whereas CountIn has no strings equivalent.

The tests give some examples of how it's used.

One potential drawback of this approach is that the matcher parameters are "buried" in closures. This makes it impossible to do any processing or optimization (like the Guava CharMatcher precomputed method). For example, Is('a').Or(Is('b')) could be folded into AnyOf('ab'). If you wanted to do this, I think you'd have to go back to using structs (like my first approach).

Friday, April 18, 2014

More Concatenation

I did some quick benchmarks in Go. (I really like how that ability is part of the standard tools.) Here are some results. As with any benchmark, don't take them as exact. Changing the parameters of the benchmarks gives varying numbers but with the same overall result.

Buffer37322 ns/op51104 B/op10 allocs/op
Array49456 ns/op53632 B/op17 allocs/op
Linked122558 ns/op54047 B/op1010 allocs/op
Merge311005 ns/op323552 B/op1998 allocs/op
Naive2225408 ns/op5371680 B/op999 allocs/op


An "op" in this case was appending 10 characters, 1000 times.

Some observations:
  • Naive concatenation is indeed bad, both in speed and memory
  • Almost anything is much better than the naive approach
  • As expected, a buffer is the best in both speed and memory
  • An array of substrings (without any merging) does surprisingly well
  • For an immutable option, a linked list isn't too bad
  • A merge tree was not such a good idea