Sunday, March 15, 2009

"Getting" Git

The Git version control system, originally developed by Linus Torvalds for the Linux kernel, has been getting pretty popular lately, especially in the Ruby on Rails crowd.

There's a preliminary Eclipse plugin and even a TortoiseGit project. SourceForge is even offering it already, whereas it took them forever to get Subversion.

At RailsConf last year I went to a session on Git Internals by Scott Chacon and was quite intrigued. You can see the slides with voice over.

Suneido has it's own version control system that has served us well, but it's pretty basic. Git's distributed style and easy branching would be nice to have.

At OSCON 2006 I had gone to a tutorial on Subversion API's with the idea that maybe we could replace Suneido's version control with an interface to Subversion. I was disappointed to find that the API's are all file oriented. I even talked to the presenter after the session to see if there was any way around this, but he couldn't suggest anything. As far as I can tell, Git is similarly file based. Probably a big reason for this is that these types of systems usually evolve from shell scripts that, understandably, work with files.

My interest in Git was raised by reading Pragmatic Version Control Using Git. On top of that, we've recently had some problems with Suneido's version control. Nothing major but still impetus to think about something better. I started to think about what it would take to implement Git like version control in Suneido. Having enjoyed his talk, I also bought Scott Chacon's PeepCode pdf book on Git Internals which expands on the same material.

The basic ideas behind Git are pretty straightforward, although quite different from most other version control systems. The way it works is attractively elegant and simple. Yet it supports powerful uses.

One area where Suneido's version control has problems (as do other version control systems) is with moving and renaming. Git supposedly handled this, but I couldn't figure out how. There didn't seem to be anything in the data structures to track this. Strangely, neither the Pragmatic book or the Git Internals pdf talked much about this.

But it didn't take long on the internet to find the answer. The Wikipedia article was surprisingly useful. It turned out to be an issue that had seen some discussion. For example this mailing list message and this one.

I was right, the data structures don't record any information about moves or renames. Instead, that information is heuristically determined after the fact, by looking at file contents.

One of the common uses of Suneido's version control is to check the history of a record. (Although we haven't gone as far as to implement a "blame" utility yet.) But the way Git works, this is not a simple task. You have to work back through each version of the source tree looking for changes to the record. But no doubt there are ways to speed this up, maybe by caching which records were involved in each commit.

I worked up enough interest and understanding to take a stab at implementing something this weekend. I have a prototype (a few hundred lines of code) that can save a source tree in a repository and retrieve it again.

Performance is a bit of an issue. Git works by comparing source trees. We have about 15,000 records (think source files) in our main application. Building a Git style tree of the working copy takes some time. But it turns out most of the time is not reading the records, it's calculating the hashes. It shouldn't be too hard to make a cache to speed that up. (You could store the hashes in the libraries but that would be dangerous since something might update content without updating the hash.)

There's obviously still a ways to go before it would be able to replace Suneido's existing version control, but it's enough to convince me that it would be feasible. Just what I needed, another project :-)

Leaky Abstractions

Originally, Suneido memory mapped the whole database file at once. This is a simple, fast way to access a file. (Note: This doesn't load the whole file into memory. It uses the virtual memory system to load pages of the file on demand.)

Unfortunately, 32 bit cpu's have a limited amount of address space. Windows only allows a maximum of 2 gb and often you have to share that space with other things.

To allow databases bigger than the address space, I changed Suneido to memory map the database file in sections. Once you've filled up the address space, if you want to map another section, you have to unmap one of the previous ones.

This is a lot like how virtual memory works, and you have the same question of which section to unmap. What you really want is to unmap the section that you're not going to need for the longest time into the future. Unfortunately, there's no way to predict the future. Instead, various heuristics are used, a common one is to unmap the least recently used (LRU) section. In other words, the mapped sections form an LRU cache into the database file.

I used an efficient approximation of LRU called the clock algorithm. We've been using this for a long time and it seemed to work reasonably well.

But recently, we started to get occasional errors from loading large databases. Often this kind of thing is caused by memory or disk errors or corrupted data. But the strange thing was that the error was consistently reproducible. We could bring the dump file back to our office and try to load it and get the same error. And the database wasn't corrupt as far as we could tell - later dumps would load fine.

I started to suspect something to do with the memory mapping. One clue was that we only had this problem on large databases.

Eventually, I discovered the problem. The rest of the database doesn't do anything to "lock" sections in memory while it works on them. Instead, it simply relies on LRU not unmapping anything that's been recently used.

But I was using an approximation of LRU. In certain rare cases the clock algorithm can end up unmapping a recently used page. (That's why it's only an approximation of LRU.) Normally that's not a problem. But because I was depending on LRU, when this rare case happened, it resulted in the error we were getting.

To fix it, I simply implemented a "true" LRU algorithm. It'll be a little slower than the clock algorithm, but given how it's used I don't think this will have a noticeable affect. And the code is actually simpler, which is always nice.

The moral of the story is I got burnt by what Joel Spolsky calls a leaky abstraction.

PS. With jSuneido, I'm considering going back to the old system of mapping the whole database file at once. This would limit the size of the database on 32 bit systems, but it's probably not unreasonable, especially in the future, to require a 64 bit system if you want a big database. This would simplify the code and eliminate bugs like the one above.

Friday, March 13, 2009

String Concatenation

Working on jSuneido, I noticed that I hadn't implemented Suneido's optimizations for concatenating strings.

In most languages this kind of code is grossly inefficient:

s = ""
for each line
  s = s + line // concatenate on the end

That's because each concatenation is allocating a new string, a little bit longer each time. The normal solution is to use a different mechanism, e.g. a StringBuffer in Java.

Suneido optimizes this case by constructing a linked list instead of allocating a new string each time. Later operations then "flatten" the list into a simple string.

The downside is that if you concatenate a lot of little strings the linked list representation uses a lot more memory than a simple string. i.e. we've traded memory for speed

Reading the 2nd edition of Programming Lua, I came across the usual discussion of why it's bad to concatenate like this. But they recommend a different solution - to keep a stack/list of pieces with the invariant that you can't put a bigger piece on top of a smaller piece. Instead you combine them, and if the newly combined piece is bigger than the one below it, you combine them, and so on.

It took me a minute to think this through. If you're adding random sized pieces, then 50% of the time you'll be adding one bigger than the last one, and therefore combining them. The end result, if I'm thinking this through properly, is that, on the average, each piece in the stack/list will be half as big as the one before. So a list of N entries will hold roughly 2 ^ N characters. i.e. you can build a big string with a short stack/list.

The Lua book is suggesting doing this in your application code, but my thought is to replace Suneido's current algorithm with this one. I'd be interested to see how it compares. It will do more concatenation and allocation, but will use less memory when dealing with a lot of pieces. It also does the concatenation more "incrementally" rather than one big concatenation to flatten the list like cSuneido currently does.

High-Tech Storytelling

This is a cool video. It's about 10 min and best viewed full screen (button just left of the "vimeo" at the bottom).


World Builder from Bruce Branit on Vimeo.

Tuesday, March 10, 2009

Digital Photography

One of the interesting trends in digital photography is combining multiple images to overcome limitations of sensors and lenses. This is primarily a software trend, that seems to be gradually moving from desktop software to in-camera features. Of course, the desktop software isn't standing still either - the latest Photoshop has some amazing features, and special purpose software does even more.

One of the first uses of multiple shots was for simple panoramas. But people are also using this to get high resolution images from lower resolution cameras. For a while, cameras have had modes to help with this, showing you the previous shot while you line up the next one (and keeping the exposure the same). Now the new Sony HX1 has a "sweep" mode where you just hold down the shutter button and "wave" the camera. It decides what individual shots to take. Cool. (It also has a 20x zoom lens and shoots 1080p hi-definition video!)

Multi-shot techniques are also being used for high dynamic range (HDR), where you take multiple shots at different exposures and blend them together. (examples) You can do this with software like Photoshop, but it's also moving into cameras like the Ricoh CX1.

And multiple shots can be used to overcome limited depth of focus. (You can also go the other direction and simulate a smaller depth of focus.)

The Sony HX1 will even take multiple shots and keep the one where the person's eyes are open.

None of this would be practical without a variety of technological advances - faster image capture, bigger/cheaper memory, more processing power in camera and in computer.

These techniques are sometimes called "computational photography".

There are so many cameras out these days that it's impossible to keep track. A couple that have caught my eye lately are the Canon PowerShot SX200 IS which is a descendant of the S3 I have but much smaller and with 12 mp and 720p video. It looks perfect for travel. And the Olympus Stylus Tough-8000 which is waterproof (to 10m/33ft), shockproof, freezeproof, and crushproof. Can't beat that for adventure sports.

I used to shake my head at my father's collection of cameras, but if I'm not careful I'll end up with my own. The sad difference is that his film cameras stayed relevant a lot longer than any camera you'll buy today, which will be out of date before you get it home.

Sunday, March 08, 2009

jSuneido Language Implementation

As the saying goes, "be careful what you wish for".

I was getting tired of the somewhat mechanical porting of code from C to Java, and I was looking forward to the language implementation because I knew it would involve some thinking.

Of course, now that I'm in the middle of thrashing around trying to figure out the best way to implement things, I'm thinking maybe there's something to be said for having a clear path!

"Mapping" one complex object (Suneido's language) onto another (the JVM) is complicated. There are a lot of choices, each of them with their pro's and con's. And you never have complete information either about how things work, or what their performance characteristics are.

Some of the choices won't be easy to reverse if they turn out to be wrong. Often, you're not even sure they were "wrong". They might turn out to have awkward parts, but then again, any design choice is likely to have awkward aspects.

One of the things to remember is that the Java language and the Java Virtual Machine (JVM) are different. Some features are implemented at the Java compiler level, and so they're not directly available at the JVM level. You have to implement them in your own compiler if you want them.

I'm also having trouble fighting premature optimization. On one hand, I want a fast implementation. On the other hand, it's better to get something simple working and then optimize it.

For example, I'd like to map Suneido local variables to Java local variables. But to implement Suneido's blocks (think closures/lexical scoping) you need a way to "share" local variables between functions. The easiest way to do that is to use an array to store local variables. But that's slower and bigger code. One option is to only use an array when you've got blocks, and use Java local variables the rest of the time. Or even more fine grained, put variables referenced by blocks in an array, and the rest in Java local variables. While faster and smaller, this is also more complicated to implement. To make it worse, when the compiler first sees a variable, it doesn't know if later it will be used in a block. So you either have to parse twice, or parse to an abstract syntax tree first. (Rather than the single pass code generation during parse that cSuneido does.)

One of the things that complicates the language implementation is that Suneido allows you to Eval (call) a method or function in the "context" of another object, "as if" it were a method of that object. This is relatively easy if you're explicitly managing "this", but it's trickier if you're trying to use the Java "this".

I'd prefer to map "directly" to JVM features like local variables and inheritance because the resulting code will be smaller and presumably faster (since JVM features will have had a lot of optimization). And presumably the closer the match, the easier it will be to interface Suneido code with Java code.

Despite the initial thrashing, I think I'm settling into a set of decisions that should work out. I already have code that will compile a minimal Suneido function and output java bytecode (in a .class file).

So far, ASM has been working well. Another useful tool has been the Java Decompiler so I can turn my generated bytecode back to Java to check it.

Friday, March 06, 2009

Computer Vision

Knowing in theory that things are possible is a far cry from actually seeing them.

I'm not sure what the practical applications of this technology are, but it's sure cool!

Wednesday, March 04, 2009

jSuneido Progress

Yesterday I worked on re-implementing the database request and query parsers.

I got the request one done and a good portion of the query one. I'd hoped to finish it in one day, but I didn't quite make it. Probably another half day or so.

It slowed me down a bit trying to share code with the language lexer and parser, yet keep the code independent.

The only difference in the lexers is in the keywords. Unfortunately, there's no way (that I know of) to extend enums. Separate Token classes would have meant duplicating the common stuff - yuck!

Instead, I decided to make the lexer include both sets of keywords. The extra coupling is unfortunate, but I figure it's better than duplication.

Luckily, the way the parsers work, keywords are not "reserved". So it doesn't hurt either parser to have extra keywords. (It wouldn't have been hard to "enable" only ones needed in a particular case, but it would have been one more complication.)

I also ran into complications with generics. The parsers take a type parameter so they can be reused with different "generators" e.g. to generate code, or execute directly, or build a syntax tree. There's a lot of actions shared between the different parsers (e.g. for expressions). The parsers also call each other (e.g. queries call the expression parser) so the generators have to be type compatible. I ended up with code like:

But compared to some of the twisted C++ template code, it doesn't seem too bad.

(I ended up having to use an image for that line of code because Blogger kept getting messed up by the angle brackets, even if I entered them as & lt ; in Edit Html mode.)

Java Enums

When Java (1.)5 came out with the new enum facilities, I thought it was cool, but I wasn't sure how practical it was. Allowing members and methods on enums? It seemed weird. Of course, this was coming from a C(++) background where an enum is just a thinly disguised integer.

But now that I'm programming in Java I'm actually finding these facilities quite useful. For example, in the lexical scanner Token class I attach various information to the tokens:
  • the string "name" in the case of keywords
  • the "matching" token for parenthesis, braces, and brackets
  • whether it's an infix or assignment operator
I use a separate TokenFeature enum for the infix and assignment features, and store them using an EnumSet, which represents them as bits in an integer (for small sets).

The only awkward part is having to define a bunch of different constructors.

Having this information attached to the tokens simplifies the code quite a bit, avoids duplicating the information in several places, and makes it easy to maintain.

PS. Looking back at the code, I think it could be simplified. All assignment operators are infix (at least for my purposes) so I don't really need a set of features, the infix method could just check for INFIX or ASSIGN. Sigh ... sometimes it seems like the refactoring never ends!

Robots

Check out these photographs:

http://www.boston.com/bigpicture/2009/03/robots.html

Monday, March 02, 2009

Sunday, March 01, 2009

cSuneido Query Improvement

Recently we've been working on fixing some of the "slow" queries in our applications. Usually this means adding appropriate composite (multi-field) indexes.

But in one case, that didn't completely fix the problem. We added the index necessary to select the right data but then we summarizing by different fields and that was requiring a temporary index.

Thinking about this, I realized that it doesn't make much sense to create a temporary index so summarize can read the data in the right order to process it sequentially. Better to just read the data in any order and accumulate the results. That means you have to process all the data before you can produce any output, but that's no different than having to create the temporary index first. And (in most cases) the accumulated results will be smaller that the temporary index, since you're summarizing.

It didn't seem like this should be too hard to add. (typical programmer optimism)

But then I started looking at the code and I realized that although the query operations choose strategies, they weren't actually using the strategy pattern. Instead they just have a bunch of if's to alter their behavior based on the chosen strategy.

It's funny that hasn't struck me before. It seems so obvious. And it's not like I just learned about the strategy pattern. I read Design Patterns when it first came out in 1994, 15 years ago. I'd made each query operation a separate class, but for some reason I'd stopped there.

So I decided to refactor the summarize code to use separate strategy classes. Which again turned out to be a bigger job than I thought.

I started out with just using my text editor (Scite) but I've been spoilt with all my IDE usage. So I tried using NetBeans (6.5). But that didn't work too well. I could build from NetBeans but I couldn't double click on the errors to go to the offending source. (Even with Scite I can do that.) And I couldn't seem to configure it to my indenting style (Whitesmith's).

So I switched to Eclipse. It's not as easy to import a project into Eclipse as it is in NetBeans, but after a few false starts I get it going. It does support Whitesmith's style. And you can double click on the errors. But C++ support in NetBeans or Eclipse isn't anywhere close to as nice as Java support.

One problem with splitting things up after the fact is that the code tends to have developed interconnections. Luckily it wasn't too hard to pass the necessary information.

Friday, February 27, 2009

Shortcomings of Test Coverage

It didn't surprise me, but some people might wonder how I could have a bunch of bugs left in the parsing code, when I had pretty complete test coverage.

It wasn't because the bugs were in the few lines that weren't executed by the tests.

The problem is that just executing every line doesn't means you've tested all the possible "paths" through the code. And it certainly doesn't mean you've tested every possible input.

For something like the parser with many branches (if's and loop's) and deeply recursive, it's pretty much impossible to test every possible "path".

I still think it's worthwhile aiming for good test coverage. If you don't have good coverage that means there are parts of your code that haven't ever been run by your tests. And we all hate the bugs that seem so blatant that people say "didn't you run this?".

So just don't let good test coverage give you a false sense of security. There can still be lots of bugs hiding in those untested execution paths.

Thursday, February 26, 2009

Another jSuneido Milestone

jSuneido now parses Suneido code, at least the 11 mb of source code that I threw at it.

Note: I'm only checking that it parse successfully - I'm not checking the results of the parse so there could still be problems. For example, in ambiguous cases it might be picking the wrong choice. But I don't think it's too likely because the wrong choices almost always lead to later syntax errors.

It didn't take long to set up the framework to read cSuneido dump files since I already had done this for loading stdlib into the database.

I initially expected a lot of failures but I was still a little surprised when it failed to parse a single record from stdlib! But I was still fresh and the sun was shining (although it was -30!) so I could laugh.

I soon found it was because all the stdlib records start with a copyright comment. Of course, none of my tests had an initial comment. This was easy to fix.

After that it was just slogging, fixing errors one by one. Sometimes programming seems like mountain climbing. Finishing a project and getting to the top are great, but you'd better enjoy the process as well, since that's where you spend most of your time.

I was really hoping to get the parsing working by the end of the day, but by late afternoon it seemed like I might not make it. Then I fixed one more bug and ... no more errors! It threw me off for a minute. I'd grown so accustomed to the error output that it took me a minute to realize it had succeeded.

I wish I could say I was done with parsing, but I still have to go back and rewrite the query parsing to not use Antlr. Oh well, that shouldn't take too long.

Then on to actually generating Java byte code! I'm going to try using ASM to start. It looks promising and even has an Eclipse plugin, but if it doesn't work out, there are other alternatives.

Wednesday, February 25, 2009

jSuneido Progress

I pretty much have the parsing port finished. I think of Suneido as a "small" language, but it's big enough when you're trying to parse it! I have over 90% test coverage (as reported by EclEmma). The only parts that aren't well tested are the error handling. I should probably add a few more tests for that.

The tests are more "functional" than "unit" tests. If something broke they wouldn't be much help locating the exact source of the problem. But I think it would be hard to write true unit tests for something like a recursive descent parser. I'm happy enough with the functional tests.

There are probably a few things I've missed, especially to do with optional semicolons. Next I plan to try parsing all our Suneido source code. That should flush out any remaining issues.

I also want to go back and rewrite the query parsing so I can eliminate the remaining dependencies on Antlr. This shouldn't be too hard. The lexical scanning is the same as the language, just with a different set of keywords, so I don't have to write that.

Then I'll be on to the fun challenge of actually generating Java byte code.

In the long run I'd prefer not to support two implementations of Suneido. Assuming jSuneido works out for the server, I'd like to move on to figuring out how to implement the client side. The hard part here is the UI. Currently, Suneido's UI calls the Win32 API directly.

One short term solution might be to look at whether Java can call the Win32 API. If so, this would allow a quick port, although it would limit the client to Windows.

Another obvious, but longer term, approach would be to rewrite the UI in Swing or SWT.

A less obvious, but interesting approach would be to go with a browser type approach using HTML, CSS, Javascript. This is a really messy, ugly approach, but it has some interesting properties. For example, it's the approach being taken by the new Palm webOS.

In any case, this is all speculation. I've still got plenty to do to get the server part of jSuneido working.

Tuesday, February 24, 2009

Eclipse Tip for Windows and Mac Users

Switching back and forth between Windows and Mac, my fingers find it hard to remember whether copy/cut/paste/undo/redo are with the control key or the command key.

In Eclipse you can configure the keyboard so you can add the control versions of these (as well as the command versions).

I also added home and end (Line Start/End).

One nice feature of Windows under Parallels on the Mac is that it lets you use either command or control.

Sunday, February 22, 2009

There's Always a Better Way

I'm working on the parser for jSuneido and it's going well.

In a few places the parser needs to "look ahead" at future tokens.

In cSuneido I did this by looking ahead directly in the source string.

When I was porting this code it seemed a little ugly. For example, it handled skipping whitespace but not comments. (Meaning you couldn't put a comment in certain places in the code.)

My next thought was to save the state, scan ahead, then restore the state. Easy enough, especially since it didn't have to be nested.

But it still seemed a little ugly.

Then I realized, I could just clone the lexer, and read ahead in the clone without disturbing the original. (I didn't use the Java clone stuff since it's a little twisted. I just made a "copy" constructor that takes another lexer.)

I'm a lot happier with this solution. I bet there's an even better solution out there, but this one will do for now.

I think one of the big weaknesses in beginning programmers is that they tend to settle for the first thing that works. But there are always other solutions and the first one you come up with is unlikely to be the best.

Of course, this is also what refactoring is about - improving your code, not settling for the first draft.

The hard part in this porting is having to read all the old C++ code and see all kinds of ways it could be improved. But I have to resist refactoring the C++ code or I'll never make any progress on jSuneido. One thing at a time. It's enough of a challenge to make the Java code as clean as I can.

PS. I'm back to using Eclipse. I could be happy with NetBeans but I find Eclipse a little better. Of course, part of that is just that I've spent more time with Eclipse.

Thursday, February 19, 2009

NetBeans

In a totally unwarranted diversion from the job at hand I decided to have a "quick" look at NetBeans.

Download and install were easy and quick.

A minor quibble was that it asked me to register and I would have, but it just took me to a web site to create an account. No thanks, I've got better things to spend my time on than creating yet another account.

NetBeans can import Eclipse projects so I had my project open in no time.

I ran all my tests and ... one failed! It complained that I was using the wrong assertEquals to compare floating point numbers. I didn't actually want to compare floating point numbers - I had used Math.signum instead of Integer.signum. Easily fixed. Strange that Eclipse didn't catch that. Maybe a newer version of JUnit.

This was a good opportunity to use NetBeans' out-of-the-box Subversion. Or not. It told me my Subversion client was too old and refused to do anything. So much for out-of-the-box.

As with Eclipse, there appear to be different ways to connect to Subversion. One of them is with JavaHL - part of Subclipse. Another is through the command line client.

I updated my plugins but that didn't help.

I tested the command line client and it was indeed old. (Apple seems to like to stick to old versions.) One recommendation was to install a newer version from Collabnet. Which meant I had to sign up for a Collabnet account.

Then I had to give NetBeans the path to it. Now my Subversion appears to work. But again, like with Subclipse, it was definitely not as easy as it should be. Presumably every Mac + NetBeans + Subversion user (that hasn't manually upgraded their SVN client) is going to have the same problem.

Another minor quibble is that I didn't like the font in NetBeans (Monospaced) as much as the one in Eclipse (Monaco) but that's easily adjusted in the Preferences. Probably just a matter of what I'm used to.

So I start programming in NetBeans. I add a class, oops, meant to add a Junit test. Try to delete the class (that I just created) and it won't let me. The error message is ultra helpful "cannot delete file". But I can delete it fine from the Finder, and then it disapppears from NetBeans. Very strange but I'm not going to worry about it. (until the next time!)

One feature that I'm missing already is that NetBeans doesn't seem to offer to create missing methods like Eclipse does. This is great for "programming by intention" where you just write your code, calling functions that you haven't written yet, and then go back and create them. (NetBeans will automatically create unimplemented methods from interfaces.)

I do like how Find is a bar at the bottom of the editor, like Firefox, and what I recently added to Suneido. But Replace is still a dialog :-( as it is currently in Suneido (but I am planning to change Suneido's Replace to a bar as well)

Wow! I just realized that the HOME and END keys are working "properly" (i.e. beginning and end of line like Windows) in NetBeans! To get these working in Eclipse I had to customize the keyboard mapping. On the Mac HOME and END normally scroll to the top and bottom, but for programming or even just writing, I more often want to go to the beginning or end of the line.

Despite the messing around I managed to get a fair bit of the parser ported. I'm glad I'm separating the parsing from the code generation so I can work on (and test) one thing at a time. It's also turned out to be quite easy to split the parser into several parts (constants, expressions, statements). This should let me re-use the expression parser in queries. (Suneido's query expression have always been a subset of the language expressions because I never got around to doing everything in two places.)

I looked for a metrics plugin for NetBeans so I could see how many lines of code I'd written today (roughly 300 plus tests) but the only one I found needed to be built from source and I couldn't be bothered right now. There do seem to be more plugins for Eclipse than for NetBeans.

Eclipse + Subversion

One of the problems with third party add-on/plug-in architectures, like Eclipse or Firefox is that installing the software isn't enough, you also have to reinstall all your add-on's.

You would think that since this is such a common issue that there would be tools to ease this process but I haven't encountered any. You should at least be able to export a list of add-on's from one installation and import them into another. You can sync your bookmarks and passwords in Firefox (with add-ons) but you can't sync your add-on's.

After reinstalling Eclipse (eclipse-java-ganymede-SR1-macosx-carbon.tar), one of the first things I needed was access to Subversion. Eclipse comes with support for CVS but not Subversion, even though most people moved from CVS to SVN long ago, and are now moving from SVN to Git.

I've been using Subclipse but I thought I'd check if that was still the best option. I found there was also Subversive. It's came from Polarion but it's now a part of the Eclipse project so at first I thought that would be better. But a little research gave mixed results. There's obviously some squabbling between the two projects. Apparently Subclipse didn't join the Eclipse project due to license issuings. Subversive got around the licensing issues by not providing all the components, requiring you to separately install some parts.

The overall feeling I got from my quick research was that both plug-ins provide much the same functionality, but people seemed to have fewer problems with Subclipse. There are enough problems with both that some people recommend just using TortoiseSVN outside Eclipse. But TortoiseSVN is only available on Windows (where I do use it).

So I tried to install Subclipse. And got errors about dependencies. Sometimes I just have to shake my head at the state of our technology. This is a fresh install of Eclipse and a fresh install of Subclipse. You can't get much simpler than that. Wouldn't that be the first, most obvious test case? To be fair, maybe it's something to do with being on Mac OS X or something specific to my setup. You never know. But more research (what did we do before the internet and search engines?) showed that I wasn't the only one with this problem.

One suggested fix was to uncheck some of the optional components - that got rid of the errors.

When I had added Subclipse, it had automatically added SVNkit (from Eclipse itself). One of the optional components in Subclipse is the SVNkit adaptor. When you uncheck this, it unchecks the whole SVNkit plug-in. It appears that it's the SVNkit that "causes" the errors, not the optional Subclipse components.

The "best" part, is that when you proceed to the install, it appears that it is installing SVNkit after all - despite it being unchecked. I'm not sure what's going on, but it seems to work so I'm not going to waste any more time on it.

Here's a screen shot to illustrate:


Maybe I should be trying NetBeans. It doesn't have Antlr support, but now that I'm not using Antlr, that's not an issue. And it does have official support for Subversion!

Monday, February 16, 2009

Computer Frustrations

Sometimes computers really piss me off.

I thought I was finally figuring out Antlr. Then my grammar worked in AntlrWorks, but not in Eclipse. AntlrWorks is using Antlr 3.1 but Eclipse was using Antlr 3.0. In the Eclipse preferences I see I can choose 3.1 but when I do, Antlr quits working. Ok, I guess I need to upgrade. But I expected this would be a problem because I haven't been able to update in Eclipse for quite a while. I try to update anyway, but it just tells me I can't, for some cryptic reason that doesn't help me at all.

This roadblock makes me think about whether I should be using Antlr. I've already spent quite a few days struggling with it and progress is very slow. I'm pretty sure I could have ported the C++ code to Java by now. One of the big reasons to use a tool like Antlr is to work at a higher level that's easier to modify. But the more I add to my grammar, the more brittle it seems to get. Small changes tend to break the whole thing with a cascade of vague messages.

In a way, it's understandable. I'm trying to port a grammar from a hand written recursive descent parser. In a hand written parser you can do all kinds of things that aren't possible in a more structured system like Antlr. And things like optional semicolons are not the easiest to handle with tools like Antlr.

The ironic part is that Antlr works great to do simple stuff (like Suneido's query language) but for more complex stuff it becomes increasingly difficult, at least if you're not an expert with it. Of course, that makes it almost useless - complex stuff is where you need it. If you were developing a little language from scratch it would be fine because you could just tailor the language to what Antlr handled easily.

I hate to give up, especially on an interesting challenge like this, but I have to remind myself that the goal is to port Suneido, not to spend endless hours struggling to learn and use some tool. And even if I got it to work, I'm then stuck with a dependency on Antlr. Which is already causing me problems and will just make it harder for anyone else who might want to build or modify jSuneido. And heaven forbid if they want to change the grammar - if they touch it, they're more than likely to break it, with no idea how to fix it.

To make a long story short, I'm giving up on Antlr and I'm going to have to start over with a fresh install of Eclipse in order to be able to update.

But that's just the beginning of my current computer frustrations. My irritation level is already quite high because of way too much obligatory family stuff this weekend. But before I started back on jSuneido I figured I'd better fulfil one last duty and download the photos Shelley took during the family stuff. Easy, just fire up Lightroom, import the photos, and copy them over to the file server so Shelley can look at them.

Wrong. Lightroom hangs up part way through the import. And not only hangs up Lightroom, but the whole of OS X. I end up having to force the power off and reboot. I thought it was because I was messing around while I was waiting for the import to finish. So I try again, this time not touching anything while it's working. It gets a little farther, but again it locks up the whole computer. WTF! I force the power off again and reboot. But this time it corrupted the Lightroom catalog and it can't repair it. Grrrrrr.

I have backups, but they don't include my last session in Lightroom because Lightroom backs up at the beginning of a session. That seems totally backwards to me. First, I want to backup what I do in a session. I may go weeks before I use it again, during which time my last session hasn't been backed up. Second, checking the catalog and backing up takes quite a while. Having to wait for that when I go to do something with Lightroom is really annoying, which means half the time I skip it, meaning my backups are liable to be even more out of date. I'd much rather it did the backup when I quit, since it wouldn't be holding me up and I wouldn't be so tempted to skip it.

And there's yet another frustration in this. My first instinct was to recover the file with Time Machine, since it's saved me a few times. I don't use it very often so I thought I was doing something wrong when I couldn't get it to work. Then I realized the Lightroom folder is inside the Pictures folder, and I don't back up the Pictures folder since it's so big. And I keep a mirror of it on the file server anyway so Shelley can access it from her Windows machine.

There also used to be problems with Time Machine having trouble backing up the Lightroom catalog when it was open, but I think/hope that's fixed. (???)

I start a Time Machine backup to get the Lightroom folder backed up. It takes forever to "prepare the backup" and then it says it's backing 16 gb. WTF again. There's no friggin way I have added / modified 16 gb of stuff since the last Time Machine backup. Then I realize that the Lightroom folder also contains the preview cache, which I see has grown to be 13 gb! Ok, stop the backup (which also takes forever, for some reason), exclude the previews, restart the backup, wait for it to "prepare",

So I move my Lightroom folder out of the Pictures folder so it will get backed up. I start up Lightroom, point it at the new catalog location, and ... it crashes. Now I'm getting really pissed off. I mess around with various ways of starting Lightroom, with various catalogs and eventually I get it running.

Meanwhile, half my morning is gone and I haven't accomplished a damn thing other than to raise my blood pressure. Sometimes you have to wonder about these "time saving" machines.

I still need to import the photos but I'm a little gun shy - I don't really want to crash my whole computer again. I was trying to import directly from the memory cards (the way I always do). Maybe something in the low level memory card reader drivers was the problem, so I copy the files on to the hard drive and try importing from there. Success!

Then I notice that a bunch of the file names have "-1" on the end. Argh! That's because I already imported a bunch of them. They weren't in the Lightroom catalog because I restored to an old version, but the actual files were still there. So now I have a bunch of duplicate files. (Lightroom normally prevents this, but only if they're in the catalog.)

Lightroom has a function to synchronize the catalog with a folder, so I use that. And ... it hangs. This is not my day. Lightroom normally runs very smoothly. Of course, I'm off the "happy path", and I'm thrashing around, which is always dangerous.

Maybe the Time Machine backup I've got running is interfering with Lightroom. It still hasn't finshed "preparing" but at least it stops fairly promptly when I tell it to.

I go to force quit Lightroom and a dialog shows up. ARGH! again. It wasn't hung up, it was just asking what I wanted to do with the file differences. Being reminded of my own stupidity doesn't help my mood. Besides, I need to synch after I clean up the duplicate files, not before.

I use a shell to rm *-1*. No easy way to do that from the Finder or from Lightroom. If you didn't know how to use the shell you'd be stuck with dragging files to the trash can one by one. GUI's aren't always so great.

I synch again, this time waiting for the dialog to pop up. Finally, I have the photos imported.

I start up ChronoSync and update the mirror of my photos on the file server. (I don't use its scheduler to do this automatically because it always seemed to want to run at the worst possible times and most of the runs did nothing because I hadn't imported anything new.)

Time for coffee. Then maybe I can actually do something productive. If you can call it "productive" to abandon multiple days of work with a sophisticated tool and go back to implementing by hand.

Saturday, February 14, 2009

New Add-On System for Suneido

I just added an article on the Suneido web site about a new add-on system.

I'm currently reading Clean Code (recommended) so I was pleased with the small methods in the add-on code. However, I was half way through writing the code before I realized I wasn't writing tests, let alone tests first. Oops.

When the methods get so small I start to resent all the lines with just curly braces and start to wonder if a Python indentation based style might not be better suited.

Thursday, February 12, 2009

jSuneido Antlr Parsing

More Antlr grammar fun.

I'm still struggling with optional semicolons. I thought I had it figured out but I had overlooked some cases. You can "view" semicolons three ways:

separators, as in "x ; y ; z"

terminators, as in "x; y; z;"

empty statements, as in "while (x) ;"

I had them working as separators, but that didn't handle "x;".

That was easy enough to fix, but then it didn't handle "while (x) ;" And it treated "x;" as two statements "x" followed by an empty statement, which isn't what I want. (Although it might not matter for code generation.)

Taking the terminator approach, I want statements to end with either a newline or a semicolon, unless they are followed by a "}" or an else, or several other cases. Yuck.

Also, reading the Antlr book more, it seems that just turning on backtracking globally may not be the best approach since you no longer get any warnings and it can potentially make the parser very slow. The alternative is to rewrite the grammar to avoid the ambiguities, turn on backtracking selectively on certain rules, or add your own syntactic predicates (instead of letting backtracking add them automatically).

To further confuse things, the book recommends ignoring some warnings (such as the classic if-then-else ambiguity) rather than fixing them with syntactic predicates since the default resolution is the correct one and the fix is less efficient. But that goes against the normal recommendation to eliminate all warnings, otherwise you may miss valid warnings mixed in with the ones you're supposed to be ignoring. The other problem is that I'm not sure I know how to figure out whether the Antlr default is what I want.

I removed the backtrack option and got rid of all but one ambiguity by tweaking the grammar. It was complaining that conditional expressions were ambiguous. At first I thought it was the if-then-else problem, but because the ":" part isn't optional like "else", that's not the problem. I found if I removed my assignment rule it solved the problem. Aha, the ambiguity is probably between "(x = y) ? ..." and "x = (y ? ...". You wouldn't have this problem if assignments were statements rather than expressions.

But it still seems a little odd, because that seems like a precedence issue, which in top-down grammars is handled by the order of nesting of rules. Wait a minute, assignment has the lowest precedence, so it should be first. But I had it last because that's how cSuneido had it. Now my grammar compiles cleanly with no warnings.

[I hadn't used the AntlrWorks IDE for a while, but I started using it again to experiment.]

It appears I have an approach that will handle the optional semicolon issue. [source] Before I complete the grammar, there are a few other questions I'd like to answer.

cSuneido emits byte code directly during the parse; it doesn't build a tree first. I think I should be able to do the same thing with the Antlr parser. The advantage of this approach is performance. The disadvantage is that you don't have a tree to use for other purposes. It's also hard to test since the only output is byte code which isn't human readable. You have to translate it to a form of assembly language to get something readable.

I'd like to decouple the parser from the code generator in a way that would allow me to attach a tree builder instead of the code generator. That will keep the performance advantages of direct code generation but still let me get a tree. For testing I could write something that will just produce strings.

This sounds simple, but it's somewhat complicated because generating code requires actions at intermediate points within the rules. e.g. "if" requires outputting a conditional branch between the expression and the statement. On the other hand, building a tree requires returning partial tree results from rules.

Antlr has a built in system for building abstract syntax trees. It would be great if I could use this since it would simplify the code a lot. But when I'm generating byte code I don't want to actually generate a tree. I wonder if I could write my own Tree class that would generate byte code but not actually build a tree. I'd then have to augment this with the extra intermediate actions. Unfortunately, the Tree interface is somewhat complicated. The BaseTree that is described as having no "payload" still keeps a list of its children, which I want to avoid when generating byte code.

Despite the extra work, I think it may be simpler to roll my own rather than try to make the tree stuff work in ways it wasn't intended to.

I modified Language.g to call a Builder interface and then implemented a StringBuilder used for ParseTest. So far so good. I still have to add the intermediate actions but then shouldn't be too hard. Then I can start to implement a Builder that does code generation.

ARGH! Rereading what I had written, I realize I didn't solve the empty statement issue. For example, it can't parse "if (x) ;" I guess that's what I'll have to work on next!

Tuesday, February 10, 2009

jSuneido, Java, and Antlr

Back to working on jSuneido. I'm writing the Antlr lexer/parser for the Suneido language. In the process I'm discovering a few issues.

The return value from my language parser was wrapped in another class (constant_return). But that's not the case in my query parser - it returns the result value directly. Obviously the grammars are different but I can't see what would cause this difference. I dug around trying to find an answer but I gave up. If I wanted to spend the time on it I could start cutting down the grammars until I found what was doing it by a process of elimination. But it's easier to just extract the value out of the class. [Later: the problem went away when I added more rules.]

I wasn't handling explicit signs on numeric constants properly. I had an optional minus sign on the lexical rule for numbers, but that only worked if there was appropriate whitespace. "a - 5" was ok, but "a-5" got interpreted as "a -5" (no operator) and gave an error.

The fix led to ambiguities between whether "-5" was a numeric constant or an expression with a unary minus operator. I wanted it to be a numeric constant so I had to change my expression rules a bit.

I notice I don't seem to allow a plus sign on numeric constants in cSuneido. It's redundant, but still seems to be an oversight.

I wasn't handling errors in the lexers. I'd put in the recommended code to abort parsing on the first error, but I didn't realize I had to do something similar for the lexer. Eventually I found How can I make the lexer exit upon first lexical error? which seems overly twisted, but does the trick. I needed this in the query grammar as well.

SuDecimal.toString wasn't behaving like cSuneido. e.g. 1000 would come out as 1e3. BigDecimal.toPlainString works better in this case, giving "1000", but it doesn't switch to exponential notation for big exponents like cSuneido does. cSuneido switches to exponential notation for exponents bigger than 19. It's probably not critical to match this exactly but it should still switch to exponential notation for big exponents.

The parsers were accepting "extra" stuff on the end. You have to explicitly match EOF. You would think that would be the default. Or at least mentioned prominently. Yet EOF isn't even in the Antlr book index, and it's not in the examples.

One of the big issues with converting Suneido's language parser to Antlr is optional semicolons. I figure I'd better tackle this early in case it's a show stopper. Or if not a show stopper, something that will require a specific approach to writing the grammar.

Thankfully JavaScript (ECMAScript) also has optional semicolons so I could look at examples for it.

You can use Antlr's automatic backtracking, as in one example. But this example works by treating newlines as significant and optionally matching them everywhere applicable - a lot of places!

Backtracking is not used in another example. Instead it uses some tricky code to "promote" applicable newlines to be significant.

The first approach seems simpler, albeit more verbose.

One ambiguity is the return statement. "return 123" can be either "return; 123" or "return 123;". Strangely, Antlr doesn't complain about this issue. It takes "return \n 123" as a single statement, whereas cSuneido would take it as "return ; 123".

Of course, Suneido's handling of optional semicolons is not exactly the same as JavaScript. That would be too easy :-) I'm not even sure cSuneido is totally "correct". For example, cSuneido takes "5 \n -4" as "(5 - 4)" because it looks ahead to see if following lines start with a binary operator. But to me it seems like it "looks" more like "5; -4"

In certain contexts, newlines are not significant. For example, if you're inside a parenthesized expression. And "f() g()" should be legal i.e. two statements. But "(f() g()) should be illegal. This seems to be handled by Antlr with backtracking.

I've always felt a little guilty that I wrote some of the tests for cSuneido in stdlib rather than in C++ code. But it was a lot easier to write certain tests that way. Now that I'm writing jSuneido I find there is an advantage to that approach. To keep the C++ tests I have to rewrite them in Java. But the ones that are in stdlib work as is. It's making me think that it would be nice to have a "complete" set of tests for the executable in stdlib, much like Rubinius initiated an Rspec test suite that "defines" Ruby.

Of course, the reason I'm thinking of this now is that I don't have a clear definition of how optional semicolons are supposed to work in Suneido, other than the cSuneido implementation. A set of tests would be nice! The downside of tests in stdlib is that Suneido has to be fairly complete to run them. And my language implementation is barely started right now.

I had to temporarily kluge some string results from my rules in order to test the parsing, but with not too much messing around it seems to be working. There are some subtle parts: "return expr?" is different than "return | return expr". The first seems to be what I want in order to interpret "return expr" as one statement instead of two.

So far so good. It looks like it may not be too hard to get Antlr to do what I want.

Monday, February 09, 2009

Drag and Drop Files into Gmail with Firefox

A post by Tim Bray led me to a post by Mrgan whose main complaint is that you can't drag and drop files onto Gmail attachments.

I agree this is a shortcoming. But if you use Firefox you can fix it with the dragdropupload add-on.

And the advantage is that this add-on works with any web site/app, not just Gmail. Whereas, if one desktop app implemented drag and drop, that wouldn't help any other desktop app.

Mrgan's other main comment is that off-line already works with desktop mail apps. That's true, but I still think Gmail's new offline mode is better - you can use it on multiple machines, and it has a flaky connection mode. And unlike POP (IMAP is better), it keeps your email on the server so you don't have to worry about backing it up.

Even Tim Bray seems to think that email is better as a desktop app. Maybe that's because he uses a single machine (a laptop) for work, home, and travelling. For me, using multiple computers (and operating systems), I think Gmail is a lot better. I think I'd prefer it even if I used a single computer, but the advantage might not be as clear.

And, unlike Tim, I do think that Gmail's "conversations" are a big improvement. They take a little getting used to, so anyone who doesn't use Gmail won't necessarily see the advantage. But I find I really miss Gmail's conversations when I use other email apps.

Another thing I think Gmail should be applauded for is that it doesn't add any "junk" to my messages. I think it's pretty ugly when I get professional business communications with Microsoft or Yahoo ads (spam!) on the end of their emails.

Saturday, February 07, 2009

Forking Ruby

Dave Thomas gives an interesting keynote at RubyConf on why it might make sense to fork Ruby.

It's interesting how the new Ruby syntax for hashes is much closer to Suneido's, as is Dave's proposed syntax for blocks/closures.

Thursday, January 22, 2009

More Antlr Fun

When I started on the Antlr grammar for the Suneido language I realized I hadn't handled escape sequences in string literals in queries.

I thought this would be easy, but I thrashed around for quite a while on it.

There are lots of examples like:
STRING_LITERAL
: '"' (EscapeSequence | ~('"'))* '"'
;
fragment EscapeSequence
: '\\' ('b'|'t'|'n'|'f'|'r'|'v'|'\"'|'\''|'\\')
| '\\' 'x' HexDigit+
| OctalEscape
;
fragment OctalEscape
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
Looks straightforward. But ... there aren't many examples that show how to convert those escape sequences to the proper characters. There are a few like:
EscapeSequence
:'\\'!
( 'n' {$setText("\n");}
| 'r' {$setText("\r");}
| 't' {$setText("\t");}
Maybe that worked in Antlr 2, but it doesn't work in Antlr 3. The nearest thing to an explanation I could find was a cryptic response to a mailing list question. It seems to imply there is a way to do it, but if so, I couldn't figure it out. (I hope my responses to Suneido questions aren't as cryptic!)

I found a few references to this being an area where Antlr is weak. The book has very little material on lexical analysis.

I tried various combinations and permutations but I couldn't find any way to make it work. In the end I just handled it outside Antlr by writing a function that would convert the escape sequences in a string. So my grammar just has:
STRING
: '"' ( ESCAPE | ~('"'|'\\') )* '"'
| '\'' ( ESCAPE | ~('\''|'\\') )* '\''
;
fragment
ESCAPE : '\\' . ;
I guess I should have taken this approach from the start, but the examples seem to show handling more in the lexer. And like my last Antlr issue, it seems a little ugly to be able to recognize things in the grammar, but then have to reprocess them again yourself later.

jSuneido Milestone

I just fixed the last two failing stdlib tests. Now all the tests pass except, of course, for the ones that requires the language (rules and triggers).

One of the bugs was a couple of lines of code I'd somehow omitted during porting.

The other was using 0x2f as a bit mask instead of 0x3f. Oops. The "funny" part was that this was in the code for handling minutes and seconds in date/times so it would cause intermittent failures depending on the exact time when you ran the test. It got down to the point where I was getting 7 and I should have been getting 23. The difference is 16, one bit. Aha! The C++ code was correct so I could have found it by comparison, but for some reason my eyes didn't pick up on the one character difference.

I guess this is a pretty big milestone for this project. For some reason it doesn't feel like it. Too gradual, too much more to do.

I hadn't done any speed benchmarks up till now. For a quick comparison, it took 11 seconds to run the stdlib tests with jSuneido and 5 seconds with cSuneido. That's not a scientific or fair test, but I'm happy enough with the result. Considering I haven't done any optimization, or even worried about speed at all, it doesn't seem too bad. And this is just running jSuneido from within Eclipse. Presumably the server Java VM would be better since it does more optimization. And it's also crossing from Windows to OS X through Parallels which may add some overhead.

I've always been ready to accept a speed penalty for the Java version, as long as it was more than made up for by multi-threading on multiple cpus/cores. e.g. If it's 25% slower, but you can use 4 cores, you still end up 3.5 times faster overall. Of course, you don't achieve an N times speedup from N cpus because of locking etc.

It would have been nice if there was an in-house app that I could switch to jSuneido. But our main apps require rules and triggers. Maybe our version control server - I don't think it uses any rules or triggers. That's a little scary, though. It's a pretty critical component. Especially since we generate the automatic nightly updates for our clients directly from version control. Might be wise to do a bit more testing first :-) I'm sure my staff would agree!

Tuesday, January 20, 2009

jSuneido Hung up on Antlr

Views were one of the last things to implement in the straight database server part of jSuneido.

This meant extending the Antlr grammars for requests (to define views) and queries (to expand views). I thought defining views would be easy and expanding them would be tricky. It turned out to be the opposite.

The trick with defining views is that I didn't want to parse the view definition, especially since the view definition follows the query grammar, not the request grammar. I just wanted to slurp up the remainder of the input as the definition. This sounds simple, but I thrashed for 3 or 4 hours on it.

I had something like:
request : VIEW id '=' VIEWDEF ;

VIEWDEF : .* ;
But you can't do that. What makes it trickier is that although this looks like a single grammar it actually produces separate lexer and parser classes. (lower case rules are for parsing, upper case for lexing.) The lexer operates independently so you can't define VIEWDEF as .* since that'll match anything/everything.

I tried predicates based on some example I found for implementing C macros:
request : VIEW id '=' {viewdef = true; } VIEWDEF ;

VIEWDEF : {viewdef}? .* ;
But this doesn't work either, probably for a variety of reasons, but one main one is that the parser and lexer are separate so the parser can't control the lexer (at least not easily).

So I decided to do the work all in the lexer:
request : VIEWDEF ;

VIEWDEF : VIEW ID '=' .* ;
But then I got "alternatives are unreachable". Eventually I found an explanation that .* is not greedy. That works well for something like '/*' .* '*/' but not so well for what I wanted. The suggested solution was to use ('\u0000'..'\uFFFE')* instead. Here's the end result:
request : VIEWDEF
{
String s = $VIEWDEF.text;
int i = s.indexOf('=');
iRequest.view(s.substring(0, i), s.substring(i + 1).trim());
}

VIEWDEF : VIEW WHITE+ NAME WHITE* '=' ANY EOF
{ setText($NAME.text + "=" + $ANY.text); };

fragment
ANY : ('\u0000'..'\uFFFE')+ ;
This seems a little ugly, but appears to work. It's a little roundabout to lexically split the view definition, then concatenate it back together, only to split it apart in the parser.

There could be a better way to do this but I couldn't find it. This will do for now.

I did find a bug in the standard library code while I was working on the views. The QueryCount function was trying to optimize the case where the query was just a table name. But something that looks like just a table name could actually be a view, in which case the optimization wouldn't work. I removed the optimization. It should really have been in the database anyway.

I thought this would fix the last of the test failures, but when I ran them all I found there were still a couple more failing. Hopefully one more day will clean them up.

As powerful as Antlr is, it's not the easiest system to work with. One of the next things I want to work on is compiling the Suneido language. I have a feeling I'll be putting in many more hours getting Antlr to do what I want. I could always go back to porting the hand written C++ recursive descent parser, but I think Antlr is better if it's not too painful to get working.

Monday, January 19, 2009

Trace Based JIT Compiling

Mozilla recently announced their new TraceMonkey Javascript implementation using trace based JIT compiling.

It's pretty cool stuff. For more information you can download Andreas Gal's thesis, or read HotpathVM: An Effective JIT Compiler for Resource-constrained Devices or Incremental Dynamic Code Generation with Trace Trees

I almost wish I was still working on Suneido's virtual machine :-)

TDM MinGW GCC 4

For some reason the MinGW project has been very slow to release a GCC 4 version. At first they said they were waiting for 4.1 but that arrived and still nothing. They eventually did release an experimental build but there hasn't been much action on this.

Checking the status recently, I searched for MinGW 4 and discovered the TDM versions of MinGW, including 4.3, and they seem to be active with it.

In a way this is a "fork" of the MinGW project - one of the benefits of open source. But it's not a fork in terms of going in a different direction, only in terms of releasing sooner. I really wonder why the main MinGW project does not want to release version 4, maybe they have a good reason, but if so they don't seem to be communicating it very well. Obviously they could still maintain version 3 if they feel that is a better choice for some things.

I built Suneido with the latest TDM version. GCC 4 has a lot of improvements to optimization, but sadly it's still slower than Visual C++ (about 10% slower than VC9, which in turn is about 10% slower than VC7)

New Computer


As I mentioned in Double Speed, I ordered a new Acer Veriton L460-ED8400 for at work. As you can see, it's certainly tiny. (about the same size as the book included for comparison) I ordered from Frontier PC who were very helpful figuring out what memory and hard drive I needed to go with it.

The only minor complaint about the Acer is that the connector between the optical drive and the motherboard is very flimsy. When we opened it up to put in the extra memory and the bigger hard drive a tiny fragment of plastic broke off the clip that holds the cable. We ended up having to replace it with a binder clip!

Otherwise, the Acer works great - it's small, fast, quiet. Vista seems to work well, including all the Aero effects.

I've gotten quite fond of the current Mac keyboard that I use at home. Compared to it, PC keyboards are incredibly clunky. They all seem to think the more bells and whistles the better. Personally, I want a minimum of bells and whistles. I wondered if a Mac keyboard would work on a Windows PC. I took my Mac keyboard to work and it functioned perfectly. So I went and bought another Mac keyboard. I use the wired version since I find the wireless one too small.


On the other hand, I'm not really fond of the Mac mouse. I'm not sure if it's the hardware or the driver, but right clicking is painfully unreliable. It's almost as if they finally gave in to having a right click, but they made it really crappy because they still didn't think you should be using it. It's especially painful running Parallels because Windows relies on right-clicking so much. So I bought a Logitech LX8 mouse. I liked the cordless receiver, especially since I could plug it into the front of the Acer. (It also would have been harder to use a Mac wireless mouse since they are Bluetooth and most PC's, including the Acer, don't have Bluetooth built in. (Macs do)

I kept the same monitor I've had for a while - a 24" Samsung SyncMaster 245BW that I'm quite happy with. I haven't gone to dual monitors, prefering a single large monitor. Partly maybe because I prefer to focus on one thing at a time. The same reason I don't keep my email running all the time. If I was buying a new monitor I think I'd be tempted by the new Apple 24-inch LED Cinema Display.

I'd say it was the easiest switch-over I've had. Despite moving more and more to on-line apps, I still install and use quite a lot of software. I wish there was a way to migrate Windows apps to a new computer. On the other hand, it's also nice to start fresh with clean installs.

Here's the main stuff:

- Kaspersky (we have been using AVG but we're thinking of switching)
- Firefox
- Thunderbird (for our separate internal email system)
- Open Office
- Scite
- Winmerge
- Gmail Notifier (so email links go to gmail, but not running all the time)
- Snagit
- PDFCreator
- KeePass
- LogMeIn
- TortoiseSVN
- MinGW G++
- Visual C++ 2003 (vc7)
- Visual C++ 2008 express (vc9)
- 7Zip
- GnuWin32 - grep, findutils, diffutils, fileutils/coreutils
- Google Earth
- Google Picasa
- Canvas

Plus all my Firefox add-ons:

- Adblock Plus
- Delicious Bookmarks (classic mode)
- Distrust
- dragdropupload
- Firebug
- FireFTP
- Foxmarks
- Google Gears
- Google Notebook
- IE View
- S3 Firefox Organizer
- Web Developer
- Yslow

I used Foxmarks to sync my bookmarks and passwords (more for the passwords since most of my bookmarks are in Delicious). Too bad you can't sync the add-ons themselves.

Friday, January 16, 2009

jSuneido Progress

I had a good day working on jSuneido yesterday - got a bunch more tests passing. Several of the bugs fixed multiple tests which was nice. None of the bugs required a lot of effort to track down. A number of them were simply things I hadn't implemented yet, either deliberately or accidentally.

Only two tests are left failing (of the ones that don't involve rules or triggers) and both of them involve views, which I just haven't implemented yet. Views are handled by simple text substitution - replacing view names with their definitions. cSuneido did this "on the fly" during parsing but I'm not sure whether I'll be able to do that using Antlr. (No doubt it's theoretically possible, but I don't want to spend too much time becoming an Antlr expert!)

Another benefit is that as I fix the bugs, the IDE (Windows cSuneido client connected to jSuneido server) has, not surprisingly, gotten a lot more stable.

It's nice to see "the light at the end of the tunnel" i.e. the end of this stage. I'm looking forward to starting the next stage (probably compiling the language to Java byte code).

Wednesday, January 14, 2009

Stupid Programmer Tricks

In this case, the stupid programmer is me :-(

For the last week or so I've been working (off and on) on tracking down the difference in query optimization between jSuneido and cSuneido that I mentioned earlier.

Along the way I found and fixed a few minor bugs in cSuneido, which was good. One of them was caused by re-using an instance variable for two different purposes. Of course, this eventually caused a problem. I can't remember when I did this, but it seems incredibly short sighted. It was probably just the "easiest" thing at the time. Sigh.

You'd think by now we would have run into any bugs in cSuneido. But when the only affect is a slight difference in optimization, it's not at all obvious.

To find the optimization difference between cSuneido and jSuneido I ended up having to insert debugging into both cSuneido and jSuneido and compare internal intermediate results until I found where the difference was coming from.

As usual, it turned out to be something stupid. When I was porting the code from C to Java I had made a "minor" improvement. I don't even remember doing it. But that minor change was enough to throw things off. It wouldn't have been so bad if I'd made the same change in the C code, but I didn't. When will I learn that I shouldn't change things during porting!

Oh well, one more down, a bunch more to go.

Tuesday, January 13, 2009

Parallels Clipboard Problem

I don't often copy and paste between Windows under Parallels and OS X. But it seems like every time I try, it doesn't work.

I've discovered one fix is to restart Windows. (Just suspending and resuming is not sufficient.)

I'm guessing the problem is that the Parallels Tools that run within Windows are getting messed up.

This is annoying because restarting Windows is not the speediest process.

This is on Parallels 4.0.3540 November 22, 2008 running Vista.

Another Query Optimization

Database query optimization in Suneido is a collection of heuristics. Most of them have come from finding queries that aren't executed efficiently and then figuring out ways to improve it.

The latest example was a query like:
(table1 leftjoin table2) where table2_field = "stuff"
Joins are implemented by reading sequentially through the left hand table (or sub-query) and doing lookups on the right hand side. This means you can read through the left hand side in any order. The right hand side requires an index on the join fields (to do the lookups efficiently).

Normal join's are symmetrical, but the implementation isn't, so part of the optimization process is deciding whether to reverse the order of a join.

But leftjoin's preserve all the records from the left hand side even where there are no matching records on the right. This means leftjoin's are not reversible, which limits optimization and sometimes leads to inefficient execution.

The trick is that this case:
table1 leftjoin table2 where table2_field = "stuff"
is actually equivalent to:
table1 join table2 where table2_field = "stuff"
because table2_field can only be "stuff" if there are matching records.

Once we see that it's equivalent to a normal join, then it's reversible, and there are more opportunities for optimization.

You might wonder why the programmer wouldn't just see this and write the query with a join instead of a leftjoin. In this example, the leftjoin was in a view definition, with the where added later. So the programmer only saw:
myview where table2_field = "stuff"
Besides, programmers don't always catch these things. It's a lot better if the system handles it.

Suneido's query optimization has two main phases - transform which moves and merges operations, and optimize which chooses strategies and index use.

This new optimization will be in the transform phase. Suneido already tracks "fixed" fields that are restricted by where's to a constant value. So if we see a fixed (and non-empty) field on the right hand side of a leftjoin, we can convert to a regular join.

Note: Suneido already moves where's as close to the tables as possible, so even if the query starts out as:
(table1 leftjoin table2) where table2_field = "stuff"
it will be transformed to:
table1 leftjoin (table2 where table2_field = "stuff")
This makes it easier to "see" the fixed field on the right hand side.

This shouldn't be too hard too implement. Especially since I've been working on the join code fixing a bug I found recently (while working on jSuneido).

[I'm not sure how much sense this will make if you're not familiar with this stuff. I wrote it for my own notes as much as anything.]

Friday, January 09, 2009

Misc. News

Google Books Settlement Agreement
This sounds like good news - better accessibility for out of print books, and the ability to "buy" books to read on-line.
iTunes 8 "drops" DRM
I prefer to buy music digitally rather than physical cd's. But I've stayed away from DRM protected music. Not because I want to pirate it (I'm trying to pay for it!), but because it's hard enough to switch players and computers and operating systems etc. without having to deal with DRM issues.

But I still ended up buying cd's because iTunes didn't have the iTunes Plus DRM free version. Amazon has been selling DRM free music in the US and had promised to make this available in Canada in 2008 but that didn't happen.

I'm still running into some music that's only available in DRM versions on iTunes, but according to Wikipedia they plan to be 100% DRM free by April.
Picasa for Mac
I used to use Picasa on Windows to manage my photos. When I switched to using a Mac at home I really wished they had a Mac version. They've finally released it, but meanwhile I've got hooked on Lightroom so I probably won't use it much.

I still recommend Picasa to Windows users looking for something simple to manage their photos. It easy to use and does a good job. And with version 3 it's got even more tools for tweaking your photos.

It's hard to say how it'll do against iPhoto on the Mac.
Bug Labs Announces Five New Modules
These guys have a cool product and these new modules make it even better. I'm itching to get this to play with, but I know I don't really have time for it, so I've resisted so far.

Tuesday, January 06, 2009

FindBugs Finds Bugs

I installed the Eclipse plugin for FindBugs without any problems.

I ran it and got about 80 warnings/errors.

It found one more place where I needed .equals - I'd missed it because I didn't think to look for != as well as == (oops!)

It found a bunch of places where I had compareTo without equals or equals without hashCode.

I fixed about half. The remainder are either minor or can wait. Ideally I'd like to clean them all up so I don't have to keep looking for real problems among the ones I'm ignoring.

The minimal effort to install and run FindBugs was definitely worth it.

Then I moved on to a test failure where jSuneido is optimizing a query slightly differently than cSuneido. Both ways work and there isn't an obvious "better" way. I could just ignore it. But in theory they're running the same algorithm so they should be coming up with the same answer. In my experience these kinds of anomolies usually lead to actual problems. (Where there's smoke there's fire.)

Along the way I have found a few minor bugs, even a couple that also exist in cSuneido. But so far I can't figure out why the difference in behavior. It's frustrating but I'll get it eventually.

Java Equals Issues

I tracked down a bug in jSuneido to a string comparison that was failing because I was using == instead of .equals

From The Java Programming Language:
The equality operators (==, !=) test for reference identity, not object equivalence. Two references are identical if they refer to the same object; two objects are equivalent if they logically have the same value.
and:
If you use == to compare string objects, you are actually comparing to see if they refer to the same object, not testing if the strings have the same contents.
I know this, but it's obviously not as ingrained into my thinking as it needs to be. When I see str1 == str2 I don't immediately see a problem. (It worked fine in C++.)

To complicate matters, it's ok to use == for characters, numbers, enums, and for comparing to null. It's even ok for strings if they're interned or if they came from literals (since they're interned).

After finding this bug, I reviewed every == in my code. I found a half dozen more that should have been .equals (and got sore eyes).

If I had been writing the code from scratch I probably would not have had as much problem with this. But porting the code from C++ was worse because in C++ you handle this by redefining operator== (And I didn't know enough to review every ==)

To add insult to injury, switching to .equals opens up the potential for another error - null values. If you do x.equals(y) and x happens to be null, then you'll get a run-time exception. (Which you wouldn't have gotten with x == y.) This is why you should write "string".equals(x) instead of x.equals("string") because "string" can't be null.

(Interesting that it's called a NullPointerException. I thought Java didn't have pointers :-) Another leaky abstraction.)

So potentially I've introduced new bugs in the process of fixing the ones caused by == instead of .equals

It looks like FindBugs can catch this problem, I'll have to give it a try.

Maybe I'll learn to like it, but right now this part of the Java language seems pretty ugly.

Monday, January 05, 2009

Apture

I've been trying out Apture on my blogs.

I first noticed it on the O'Reilly Radar blog on a post about Wendell Berry. Notice the little book icons next to some of the links. If you hover your mouse over one of these links, the Wikipedia article pops up. I thought that was pretty cool so I followed the "Get Apture for Your Website" link at the bottom of one of the pop ups.

It was easy enough to install - just add a widget to your Blogger layout.

It took me a while to "get" how it works. Their introductory video was good, and explained how to use it, but not how it works.

I thought you'd use it when you were writing a post, but you actually use it when you're just viewing. It must be storing the enhancements on Apture's servers and dynamically merging them into the page when you view it.

This is a bit like the services that let you add "comments" to third party web sites. The difference is that for those services, to see the comments, you had to be running some special software.

For Apture, it's the widget you add to your blog that merges the extra content. So viewers don't need any special software.

It seems a little like magic. And it's almost a little frightening since this gives Apture the ability to rewrite my blog entries. In theory they could add ads or security exploits or other nasty stuff. Not that it would make sense for them to do that, but the possibility exists. And I wonder if some third party couldn't take control of this "back door".

One big weakness is that this magic only works when you go to the blog web site. If you use a feed reader like Google Reader it probably won't work. On the positive side, a lot of traffic to blogs comes from search engines, which will take people to the web pages.

I can't see spending much time adding content with Apture. However, some of the things it does automatically, like adding pop ups for Wikipedia, don't require any extra work, and do enhance the experience if you do go to the web site.

For example, here's a link to the Suneido Wikipedia article. In a reader it's probably just going to be a regular link. But if you go to the web page for this post, you should see the little icon and get the pop up.

I may or may not keep using Apture, but it's an interesting technology.

Sunday, January 04, 2009

Suneido Design Flaws

These are things where, in retrospect, I think I made the wrong choices designing Suneido.

The challenge with changing any of these is not breaking existing code, or at least doing it in such a way that the broken code is obvious.

One approach is to support both the old and the new behavior during a "transition" period. The old way would be "deprecated", ideally in a way that can be automatically detected. Even if it can't be automatically detected, the programmer can still gradually switch code to the new way and then run tests to hopefully detect any problems.

Objects and Records Should Not Be Separate

I've talked about this before.

One of the reasons they're separate is that I didn't want the overhead of rules on every object. But the overhead is very small and this was likely premature optimization.

One problem with combining them is the record default value (see below).

Another problem is the global attachment of rules (see below). This could potentially "break" existing code.

Class Instances and Objects Should Not Be Combined

I think I did this mostly for ease of implementation. I may have also thought it reduced the number of concepts in the language.

However, class instances still have different behavior from regular objects so it didn't really simplify the language.

This means that class instances have all the same functionality as objects, like for-each and Map! and Sort!. But most of these don't or at least shouldn't be applied to class instances. If anything, this tempts programmers to "take advantage" of this, leading to unclear code.

On top of this, certain object methods are "basic" and can't be redefined/overridden. This was done both for performance (probably minor) and so that people reading the code would know the meaning of the basic methods without having to determine if they had been redefined.

One resulting drawback, is that a class can't define a method with the same name as a basic method. e.g. a Stack class can't define a Size method. You can use a different name e.g. Tally, but that means less consistent method names. It also means someone can mistakenly call e.g. Size instead of Tally and not realize the mistake since they won't get an error.

Objects also support vector/array data, which isn't needed for classes.

Internally, class instances could still be objects, but it probably makes more sense to separate them since you wouldn't end up using much of the object code.

Changing class instances to not have all the same functionality as objects shouldn't break too much existing code.

Records Should Not Have a Default Value

record.foobar will have a value of "" even if foobar is not a member of the record, similar to an object where you had Set_default("")

One of the reasons for this is that database records may be physically "missing" fields. One way this can happen is if you create new fields on a table. Existing records are not updated, they are just missing these fields. Having a default value handles this.

It is also a small memory gain when you have many empty ("") fields because they don't need to be stored in the object.

The downside is that if you use the wrong field name or make a mistake in the field name, you don't get an error and the mistake can go unnoticed.

One way to keep the advantages and fix the downside would be to have records "know" their valid fields. For these fields, they would still return "", but invalid fields could throw an exception (like objects throw "member not found").

This would break existing code that depends on this default value.

Rules Should Not be Global

When you create a rule, e.g. Rule_foobar, it will be used anywhere that you reference record.foobar

Like anything "global" this means that changes in one place, e.g. adding a rule, could break code in another place.

It also means that the only way to have a different rule applied, or to apply a rule only some of the time, is to use different field names. Which can be confusing, and can lead to extra complexity in other rules that have to deal with different field names at different times.

You should be able to attach/detach rules on a record.

For rules you want "all the time" you could attach them in the table schema so whenever you read a record from that table it would have the rule attached.

This would break existing code that creates a record (not from the database) and expects rules to apply.

Charlie Rose - An hour with Bill Gates

I just got around to watching Charlie Rose interview Bill Gates. It's an interesting show and although I'm not always a big Microsoft fan, I can't help but be impressed by what Bill has done in the past, what he's doing currently, and what he plans to do in the future.

I was led to this show from Dale Dougherty's post on O'Reilly Radar - Admiring Bill Gates.

Bill and I started software companies roughly the same time and we're similar ages so I feel a certain connection, although obviously Bill is in a whole different league.