The Software Life: March 2009

Tuesday, March 31, 2009

Testing Techniques

I've written quite a few tests that look like:

String[][] cases = new String[][] {
    { "something", "expected result" },
    ...
};
for (String[] c : cases)
    assertEquals(c[0], c[1], process(c[0]));

This works well enough, but one drawback is that when one case fails, there's no easy way to "go" to that case to edit it.

The testing people would probably say each case should be a separate test method, but that's a lot of extra declarations of methods.

So lately I've been writing tests like:

    test("something", "expected result");
    ...

void test(String input, String expected) {
    assertEquals(input, expected, process(input));
}

Written this way, when there's a failure Eclipse will take me straight to the failing case.

Another Technique

Sometimes it's useful to do some manual "exploratory" testing. (I know you should program test first, but I'm afraid I don't always manage it.)

For example, working on jSuneido's expression code generation I wrote a simple read-eval-print-loop so I could type in expressions and have them compile and run and print out the result.

One of the dangers is that you don't write so many automated tests because you're doing manual testing. To get around this I had my program write the cases to a text file in the correct format to paste straight into my tests.

For example, this session:

> 123 + 456
 => 579
> f = function (x, y) { x + y}; f(123, 456)
 => 579
> "hello world".Size()
 => 11
> q
bye

Added this to the log file:

test("123 + 456", "579");
test("f = function (x, y) { x + y}; f(123, 456)", "579");
test("'hello world'.Size()", "11");

Scala Attractions

Here are some of the things I find attractive about Scala. I still haven't written a line of code, though, so don't take it too seriously. Just first impressions of a language connoisseur :-)

user defined operators
In C++ you'd call this operator overloading, but Scala doesn't have a reserved set of operators. Even "+" is just a method. Obviously, this has potential to be abused, like any other feature. But it's a lot nicer to be able to say x + y for a new type of numbers, rather than x.plus(y)

traits
Scala traits are like Java interfaces, but they can include method implementations. They're similar to mix-ins. I struggled to do similar things in C++ so I can see the value.

implicit conversions
Ruby and other dynamic languages have "open classes" where you can modify existing classes at any time. They call it "monkey patching". It can be really useful, but I also find it really scary. Maybe I'm just a wimp, but I don't want standard classes changing under me. If I use some third party library or plug-in, I'm not sure I want the standard string or list class to change as a result.

Scala takes a different approach - you define implicit conversions to your own class. So instead of adding methods to String, you define an implicit conversion to MyString, which then has the extra methods you want. Scala provides a number of "rich wrappers" with implicit conversions for things like RichString and RichInt.

This is a good example of why you need to focus on the end, not the means. If what you want is to be able to write str.myOperation() there's more than one way to do that, and it doesn't necessarily require a dynamic language.

Of course, implicit conversions can cause problems as I know well from C++. But Scala's approach looks like it avoids most of the problems.

everything's an object
In other words, no more "primitive types", like in Java, that have to be handled specially. And no more boxing and unboxing. It's too bad Java didn't take this approach. There are performance issues, but with todays compiler technology they can be overcome.

first class functions
Meaning you can have unnamed function literals and pass functions around as values. To me, this is huge. Sure, you don't need this all the time, but to write higher level code it's invaluable.

lazy
This is a little thing, but I think it's cool that Scala provides a built-in way to do lazy initialization. As opposed to Java, where Effective Java has a whole section on how to do (and not do) lazy initialization.

type inference
Scala seems to combine the best of both worlds - the benefits of static typing, with almost the brevity of dynamic languages. Would you rather write:

val a = Array("one", "two", "three")
val b = new Map[String,String]

or:

String[] a = new String[] { "one", "two", "three" };
Map<int,string> b = new HashMap<string,string>();

I've talked about this before. C++ is just as bad as Java, although they are proposing an improvement in C++0X.

== is equals
I think Java made the wrong choice in making == compare pointers and equals compare values. I've talked about this before. It's great to see Scala break with Java and normally make == the same as equals.

actor model for concurrency
This is actually a Scala library rather than part of the Scala language. But it's a standard library. The actor model is what has made Erlang known for writing concurrent applications. I'm not sure how Scala's implementation compares to Erlang's, but it's nice to see tools for concurrency. Especially coming from C and C++ that basically said "not our problem". I wonder if I can make use of this when I get to making jSuneido's database concurrent?

lists and tuples
These are classic functional data structures. Although I wouldn't say I "know" functional programming, I'm certainly not ignorant of the benefits of imutable data structures and the dangers of side effects. My programmers will tell you I've been know to harp on the evils of side effects. cSuneido makes good use of lists and other imutable data structures.

class parameters
This looks like a great shortcut for writing simple classes. Would you rather write:

class Data(count: Int, name: String)

or:

class Data {
    final public int count;
    final public String name;
    Data(int count, String name) {
        this.count = count;
        this.name = name;
    }
}

optional semicolons
If nothing else, this will save me from the annoyance of typing semicolons in Suneido where I don't need them, and then turning around and omitting them from Java where I do! The joys of switching languages every other day!

Sunday, March 29, 2009

Mozilla Labs - Bespin

Here's another interesting Mozilla Labs project - a web based programmer code editor, with some interesting UI innovations.

Mozilla Labs - Bespin

Introducing Bespin from Dion Almaer on Vimeo.

Mozilla Labs - Ubiquity

This isn't new, but I just came across a reference and looked it up. Sounds pretty interesting.

Mozilla Labs - Ubiquity

Ubiquity for Firefox from Aza Raskin on Vimeo.

Thursday, March 19, 2009

Not Your Father's Spreadsheet

Google Spreadsheets has some interesting features like Google Lookup and Auto-Fill.

Wednesday, March 18, 2009

Scala Gets My Vote

Of all the languages I've read about recently, Scala strikes me as the one I'd most like to use.

I've always been interested in (computer) languages. I haven't actually seriously used many different languages - mainly C, C++, and recently Java. But I've avidly read about many.

I have books on Scheme, Smalltalk, Perl, Python, Icon, Dylan, C, C++, C#, Tcl, Ruby, Java, Lua, JavaScript, Erlang, Groovy, Scala, and probably more that I've forgotten.

I'm just finishing reading Programming in Scala by Martin Odersky (the creator of the language), Lex Spoon, and Bill Venners. Part of what prompted me to read it was a podcast with Bill Venners talking about why he likes Scala.

I liked C++. People criticize it because it was too complex, but I liked it's powerful features.

I can't get excited about Java. It's ok, and it has great IDE support, but it's a little boring. I like the garbage collection and the safety of no pointers, but I miss C++'s tricks with things like templates and operator overloading.

For some reason I haven't got too excited about Ruby either. I haven't written much code in it, but I've read a fair bit, gone to conferences, and managed a project written with Rails.

Ironically, Java compared to C++ is a bit like Suneido compared to Ruby. Both Java and Suneido try to avoid features that, although powerful, allow you to "shoot yourself in the foot" (if not in the head!)

Scala feels more like C++, lots of power and flexibility and tricks, enough to probably make your head hurt.

Both Scala and C++ try to support writing powerful, natural libraries and extensions, which I like.

For "system" programming (as opposed to "application" programming) I still lean towards static typing. But I do get tired of the verbosity of type declarations in C++ and Java. Scala is statically typed, but it also does type inference which greatly reduces the verbosity.

And it's a plus for me that Scala runs on the JVM and can be easily integrated with Java. (There's a .Net version as well, although not as well supported.) It seems like I could write parts of jSuneido in Scala if I wanted. I suspect you still might want to implement performance bottlenecks in Java, though. Or at least be careful which Scala features you used, since power often comes at a price.

Scala supports functional programming, but in combination with more conventional object-oriented programming. I haven't spent the time to wrap my head around "pure" functional languages like Haskell or (O(Ca))ML. But I could see using some functional programming in Scala.

I thought the Programming Scala book itself was good. It's not small - 2.6 lbs and 776 pages, but I had no trouble getting through it. I see it won a Jolt Award. Apress also has a Scala book, and both Pragmatic and O'Reilly have Scala books coming.

Tech Social Skills

Our internet access quit working the other night. I tried the usual things, restarting computers, routers, cable boxes, etc. But no luck.

So I called tech support. After working my way through the voice menus I finally reached a recording - which said more or less:

"We are currently experiencing problems in the City Park area.
We are working on it.
If you are calling from this area, hang up now."

Perfectly logical, but it sounds like Spock from Star Trek. No apology, no explanation, no asking for patience. Basically just "get lost, don't bug us".

Update: The next day (yesterday) the internet was working during the day but sometime around 4 or 5 pm it died. I would speculate that everyone got home from work and rushed to see if it was working and the load crashed their system and it stayed off all evening.

Side Note: Why do you have to work through Shaw Cable's entire voice menu system to get to internet support? I got the number from a sticker on the cable modem. Couldn't they have a separate phone number? I guess that wouldn't be as convenient for them.

Sunday, March 15, 2009

"Getting" Git

The Git version control system, originally developed by Linus Torvalds for the Linux kernel, has been getting pretty popular lately, especially in the Ruby on Rails crowd.

There's a preliminary Eclipse plugin and even a TortoiseGit project. SourceForge is even offering it already, whereas it took them forever to get Subversion.

At RailsConf last year I went to a session on Git Internals by Scott Chacon and was quite intrigued. You can see the slides with voice over.

Suneido has it's own version control system that has served us well, but it's pretty basic. Git's distributed style and easy branching would be nice to have.

At OSCON 2006 I had gone to a tutorial on Subversion API's with the idea that maybe we could replace Suneido's version control with an interface to Subversion. I was disappointed to find that the API's are all file oriented. I even talked to the presenter after the session to see if there was any way around this, but he couldn't suggest anything. As far as I can tell, Git is similarly file based. Probably a big reason for this is that these types of systems usually evolve from shell scripts that, understandably, work with files.

My interest in Git was raised by reading Pragmatic Version Control Using Git. On top of that, we've recently had some problems with Suneido's version control. Nothing major but still impetus to think about something better. I started to think about what it would take to implement Git like version control in Suneido. Having enjoyed his talk, I also bought Scott Chacon's PeepCode pdf book on Git Internals which expands on the same material.

The basic ideas behind Git are pretty straightforward, although quite different from most other version control systems. The way it works is attractively elegant and simple. Yet it supports powerful uses.

One area where Suneido's version control has problems (as do other version control systems) is with moving and renaming. Git supposedly handled this, but I couldn't figure out how. There didn't seem to be anything in the data structures to track this. Strangely, neither the Pragmatic book or the Git Internals pdf talked much about this.

But it didn't take long on the internet to find the answer. The Wikipedia article was surprisingly useful. It turned out to be an issue that had seen some discussion. For example this mailing list message and this one.

I was right, the data structures don't record any information about moves or renames. Instead, that information is heuristically determined after the fact, by looking at file contents.

One of the common uses of Suneido's version control is to check the history of a record. (Although we haven't gone as far as to implement a "blame" utility yet.) But the way Git works, this is not a simple task. You have to work back through each version of the source tree looking for changes to the record. But no doubt there are ways to speed this up, maybe by caching which records were involved in each commit.

I worked up enough interest and understanding to take a stab at implementing something this weekend. I have a prototype (a few hundred lines of code) that can save a source tree in a repository and retrieve it again.

Performance is a bit of an issue. Git works by comparing source trees. We have about 15,000 records (think source files) in our main application. Building a Git style tree of the working copy takes some time. But it turns out most of the time is not reading the records, it's calculating the hashes. It shouldn't be too hard to make a cache to speed that up. (You could store the hashes in the libraries but that would be dangerous since something might update content without updating the hash.)

There's obviously still a ways to go before it would be able to replace Suneido's existing version control, but it's enough to convince me that it would be feasible. Just what I needed, another project :-)

Leaky Abstractions

Originally, Suneido memory mapped the whole database file at once. This is a simple, fast way to access a file. (Note: This doesn't load the whole file into memory. It uses the virtual memory system to load pages of the file on demand.)

Unfortunately, 32 bit cpu's have a limited amount of address space. Windows only allows a maximum of 2 gb and often you have to share that space with other things.

To allow databases bigger than the address space, I changed Suneido to memory map the database file in sections. Once you've filled up the address space, if you want to map another section, you have to unmap one of the previous ones.

This is a lot like how virtual memory works, and you have the same question of which section to unmap. What you really want is to unmap the section that you're not going to need for the longest time into the future. Unfortunately, there's no way to predict the future. Instead, various heuristics are used, a common one is to unmap the least recently used (LRU) section. In other words, the mapped sections form an LRU cache into the database file.

I used an efficient approximation of LRU called the clock algorithm. We've been using this for a long time and it seemed to work reasonably well.

But recently, we started to get occasional errors from loading large databases. Often this kind of thing is caused by memory or disk errors or corrupted data. But the strange thing was that the error was consistently reproducible. We could bring the dump file back to our office and try to load it and get the same error. And the database wasn't corrupt as far as we could tell - later dumps would load fine.

I started to suspect something to do with the memory mapping. One clue was that we only had this problem on large databases.

Eventually, I discovered the problem. The rest of the database doesn't do anything to "lock" sections in memory while it works on them. Instead, it simply relies on LRU not unmapping anything that's been recently used.

But I was using an approximation of LRU. In certain rare cases the clock algorithm can end up unmapping a recently used page. (That's why it's only an approximation of LRU.) Normally that's not a problem. But because I was depending on LRU, when this rare case happened, it resulted in the error we were getting.

To fix it, I simply implemented a "true" LRU algorithm. It'll be a little slower than the clock algorithm, but given how it's used I don't think this will have a noticeable affect. And the code is actually simpler, which is always nice.

The moral of the story is I got burnt by what Joel Spolsky calls a leaky abstraction.

PS. With jSuneido, I'm considering going back to the old system of mapping the whole database file at once. This would limit the size of the database on 32 bit systems, but it's probably not unreasonable, especially in the future, to require a 64 bit system if you want a big database. This would simplify the code and eliminate bugs like the one above.

Friday, March 13, 2009

String Concatenation

Working on jSuneido, I noticed that I hadn't implemented Suneido's optimizations for concatenating strings.

In most languages this kind of code is grossly inefficient:

s = ""
for each line
s = s + line // concatenate on the end

That's because each concatenation is allocating a new string, a little bit longer each time. The normal solution is to use a different mechanism, e.g. a StringBuffer in Java.

Suneido optimizes this case by constructing a linked list instead of allocating a new string each time. Later operations then "flatten" the list into a simple string.

The downside is that if you concatenate a lot of little strings the linked list representation uses a lot more memory than a simple string. i.e. we've traded memory for speed

Reading the 2nd edition of Programming Lua, I came across the usual discussion of why it's bad to concatenate like this. But they recommend a different solution - to keep a stack/list of pieces with the invariant that you can't put a bigger piece on top of a smaller piece. Instead you combine them, and if the newly combined piece is bigger than the one below it, you combine them, and so on.

It took me a minute to think this through. If you're adding random sized pieces, then 50% of the time you'll be adding one bigger than the last one, and therefore combining them. The end result, if I'm thinking this through properly, is that, on the average, each piece in the stack/list will be half as big as the one before. So a list of N entries will hold roughly 2 ^ N characters. i.e. you can build a big string with a short stack/list.

The Lua book is suggesting doing this in your application code, but my thought is to replace Suneido's current algorithm with this one. I'd be interested to see how it compares. It will do more concatenation and allocation, but will use less memory when dealing with a lot of pieces. It also does the concatenation more "incrementally" rather than one big concatenation to flatten the list like cSuneido currently does.

High-Tech Storytelling

This is a cool video. It's about 10 min and best viewed full screen (button just left of the "vimeo" at the bottom).

World Builder from Bruce Branit on Vimeo.

Tuesday, March 10, 2009

Digital Photography

One of the interesting trends in digital photography is combining multiple images to overcome limitations of sensors and lenses. This is primarily a software trend, that seems to be gradually moving from desktop software to in-camera features. Of course, the desktop software isn't standing still either - the latest Photoshop has some amazing features, and special purpose software does even more.

One of the first uses of multiple shots was for simple panoramas. But people are also using this to get high resolution images from lower resolution cameras. For a while, cameras have had modes to help with this, showing you the previous shot while you line up the next one (and keeping the exposure the same). Now the new Sony HX1 has a "sweep" mode where you just hold down the shutter button and "wave" the camera. It decides what individual shots to take. Cool. (It also has a 20x zoom lens and shoots 1080p hi-definition video!)

Multi-shot techniques are also being used for high dynamic range (HDR), where you take multiple shots at different exposures and blend them together. (examples) You can do this with software like Photoshop, but it's also moving into cameras like the Ricoh CX1.

And multiple shots can be used to overcome limited depth of focus. (You can also go the other direction and simulate a smaller depth of focus.)

The Sony HX1 will even take multiple shots and keep the one where the person's eyes are open.

None of this would be practical without a variety of technological advances - faster image capture, bigger/cheaper memory, more processing power in camera and in computer.

These techniques are sometimes called "computational photography".

There are so many cameras out these days that it's impossible to keep track. A couple that have caught my eye lately are the Canon PowerShot SX200 IS which is a descendant of the S3 I have but much smaller and with 12 mp and 720p video. It looks perfect for travel. And the Olympus Stylus Tough-8000 which is waterproof (to 10m/33ft), shockproof, freezeproof, and crushproof. Can't beat that for adventure sports.

I used to shake my head at my father's collection of cameras, but if I'm not careful I'll end up with my own. The sad difference is that his film cameras stayed relevant a lot longer than any camera you'll buy today, which will be out of date before you get it home.

Sunday, March 08, 2009

jSuneido Language Implementation

As the saying goes, "be careful what you wish for".

I was getting tired of the somewhat mechanical porting of code from C to Java, and I was looking forward to the language implementation because I knew it would involve some thinking.

Of course, now that I'm in the middle of thrashing around trying to figure out the best way to implement things, I'm thinking maybe there's something to be said for having a clear path!

"Mapping" one complex object (Suneido's language) onto another (the JVM) is complicated. There are a lot of choices, each of them with their pro's and con's. And you never have complete information either about how things work, or what their performance characteristics are.

Some of the choices won't be easy to reverse if they turn out to be wrong. Often, you're not even sure they were "wrong". They might turn out to have awkward parts, but then again, any design choice is likely to have awkward aspects.

One of the things to remember is that the Java language and the Java Virtual Machine (JVM) are different. Some features are implemented at the Java compiler level, and so they're not directly available at the JVM level. You have to implement them in your own compiler if you want them.

I'm also having trouble fighting premature optimization. On one hand, I want a fast implementation. On the other hand, it's better to get something simple working and then optimize it.

For example, I'd like to map Suneido local variables to Java local variables. But to implement Suneido's blocks (think closures/lexical scoping) you need a way to "share" local variables between functions. The easiest way to do that is to use an array to store local variables. But that's slower and bigger code. One option is to only use an array when you've got blocks, and use Java local variables the rest of the time. Or even more fine grained, put variables referenced by blocks in an array, and the rest in Java local variables. While faster and smaller, this is also more complicated to implement. To make it worse, when the compiler first sees a variable, it doesn't know if later it will be used in a block. So you either have to parse twice, or parse to an abstract syntax tree first. (Rather than the single pass code generation during parse that cSuneido does.)

One of the things that complicates the language implementation is that Suneido allows you to Eval (call) a method or function in the "context" of another object, "as if" it were a method of that object. This is relatively easy if you're explicitly managing "this", but it's trickier if you're trying to use the Java "this".

I'd prefer to map "directly" to JVM features like local variables and inheritance because the resulting code will be smaller and presumably faster (since JVM features will have had a lot of optimization). And presumably the closer the match, the easier it will be to interface Suneido code with Java code.

Despite the initial thrashing, I think I'm settling into a set of decisions that should work out. I already have code that will compile a minimal Suneido function and output java bytecode (in a .class file).

So far, ASM has been working well. Another useful tool has been the Java Decompiler so I can turn my generated bytecode back to Java to check it.

Friday, March 06, 2009

Computer Vision

Knowing in theory that things are possible is a far cry from actually seeing them.

I'm not sure what the practical applications of this technology are, but it's sure cool!

Wednesday, March 04, 2009

jSuneido Progress

Yesterday I worked on re-implementing the database request and query parsers.

I got the request one done and a good portion of the query one. I'd hoped to finish it in one day, but I didn't quite make it. Probably another half day or so.

It slowed me down a bit trying to share code with the language lexer and parser, yet keep the code independent.

The only difference in the lexers is in the keywords. Unfortunately, there's no way (that I know of) to extend enums. Separate Token classes would have meant duplicating the common stuff - yuck!

Instead, I decided to make the lexer include both sets of keywords. The extra coupling is unfortunate, but I figure it's better than duplication.

Luckily, the way the parsers work, keywords are not "reserved". So it doesn't hurt either parser to have extra keywords. (It wouldn't have been hard to "enable" only ones needed in a particular case, but it would have been one more complication.)

I also ran into complications with generics. The parsers take a type parameter so they can be reused with different "generators" e.g. to generate code, or execute directly, or build a syntax tree. There's a lot of actions shared between the different parsers (e.g. for expressions). The parsers also call each other (e.g. queries call the expression parser) so the generators have to be type compatible. I ended up with code like:

But compared to some of the twisted C++ template code, it doesn't seem too bad.

(I ended up having to use an image for that line of code because Blogger kept getting messed up by the angle brackets, even if I entered them as & lt ; in Edit Html mode.)

Java Enums

When Java (1.)5 came out with the new enum facilities, I thought it was cool, but I wasn't sure how practical it was. Allowing members and methods on enums? It seemed weird. Of course, this was coming from a C(++) background where an enum is just a thinly disguised integer.

But now that I'm programming in Java I'm actually finding these facilities quite useful. For example, in the lexical scanner Token class I attach various information to the tokens:

the string "name" in the case of keywords
the "matching" token for parenthesis, braces, and brackets
whether it's an infix or assignment operator

I use a separate TokenFeature enum for the infix and assignment features, and store them using an EnumSet, which represents them as bits in an integer (for small sets).

The only awkward part is having to define a bunch of different constructors.

Having this information attached to the tokens simplifies the code quite a bit, avoids duplicating the information in several places, and makes it easy to maintain.

PS. Looking back at the code, I think it could be simplified. All assignment operators are infix (at least for my purposes) so I don't really need a set of features, the infix method could just check for INFIX or ASSIGN. Sigh ... sometimes it seems like the refactoring never ends!

Robots

Check out these photographs:

http://www.boston.com/bigpicture/2009/03/robots.html

Monday, March 02, 2009

How FriendFeed uses MySQL to store schema-less data

Here's an interesting database approach that deliberately does not use conventional database indexes.

How FriendFeed uses MySQL to store schema-less data

Sunday, March 01, 2009

cSuneido Query Improvement

Recently we've been working on fixing some of the "slow" queries in our applications. Usually this means adding appropriate composite (multi-field) indexes.

But in one case, that didn't completely fix the problem. We added the index necessary to select the right data but then we summarizing by different fields and that was requiring a temporary index.

Thinking about this, I realized that it doesn't make much sense to create a temporary index so summarize can read the data in the right order to process it sequentially. Better to just read the data in any order and accumulate the results. That means you have to process all the data before you can produce any output, but that's no different than having to create the temporary index first. And (in most cases) the accumulated results will be smaller that the temporary index, since you're summarizing.

It didn't seem like this should be too hard to add. (typical programmer optimism)

But then I started looking at the code and I realized that although the query operations choose strategies, they weren't actually using the strategy pattern. Instead they just have a bunch of if's to alter their behavior based on the chosen strategy.

It's funny that hasn't struck me before. It seems so obvious. And it's not like I just learned about the strategy pattern. I read Design Patterns when it first came out in 1994, 15 years ago. I'd made each query operation a separate class, but for some reason I'd stopped there.

So I decided to refactor the summarize code to use separate strategy classes. Which again turned out to be a bigger job than I thought.

I started out with just using my text editor (Scite) but I've been spoilt with all my IDE usage. So I tried using NetBeans (6.5). But that didn't work too well. I could build from NetBeans but I couldn't double click on the errors to go to the offending source. (Even with Scite I can do that.) And I couldn't seem to configure it to my indenting style (Whitesmith's).

So I switched to Eclipse. It's not as easy to import a project into Eclipse as it is in NetBeans, but after a few false starts I get it going. It does support Whitesmith's style. And you can double click on the errors. But C++ support in NetBeans or Eclipse isn't anywhere close to as nice as Java support.

One problem with splitting things up after the fact is that the code tends to have developed interconnections. Luckily it wasn't too hard to pass the necessary information.