Thursday, January 22, 2009

More Antlr Fun

When I started on the Antlr grammar for the Suneido language I realized I hadn't handled escape sequences in string literals in queries.

I thought this would be easy, but I thrashed around for quite a while on it.

There are lots of examples like:
STRING_LITERAL
: '"' (EscapeSequence | ~('"'))* '"'
;
fragment EscapeSequence
: '\\' ('b'|'t'|'n'|'f'|'r'|'v'|'\"'|'\''|'\\')
| '\\' 'x' HexDigit+
| OctalEscape
;
fragment OctalEscape
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
Looks straightforward. But ... there aren't many examples that show how to convert those escape sequences to the proper characters. There are a few like:
EscapeSequence
:'\\'!
( 'n' {$setText("\n");}
| 'r' {$setText("\r");}
| 't' {$setText("\t");}
Maybe that worked in Antlr 2, but it doesn't work in Antlr 3. The nearest thing to an explanation I could find was a cryptic response to a mailing list question. It seems to imply there is a way to do it, but if so, I couldn't figure it out. (I hope my responses to Suneido questions aren't as cryptic!)

I found a few references to this being an area where Antlr is weak. The book has very little material on lexical analysis.

I tried various combinations and permutations but I couldn't find any way to make it work. In the end I just handled it outside Antlr by writing a function that would convert the escape sequences in a string. So my grammar just has:
STRING
: '"' ( ESCAPE | ~('"'|'\\') )* '"'
| '\'' ( ESCAPE | ~('\''|'\\') )* '\''
;
fragment
ESCAPE : '\\' . ;
I guess I should have taken this approach from the start, but the examples seem to show handling more in the lexer. And like my last Antlr issue, it seems a little ugly to be able to recognize things in the grammar, but then have to reprocess them again yourself later.

jSuneido Milestone

I just fixed the last two failing stdlib tests. Now all the tests pass except, of course, for the ones that requires the language (rules and triggers).

One of the bugs was a couple of lines of code I'd somehow omitted during porting.

The other was using 0x2f as a bit mask instead of 0x3f. Oops. The "funny" part was that this was in the code for handling minutes and seconds in date/times so it would cause intermittent failures depending on the exact time when you ran the test. It got down to the point where I was getting 7 and I should have been getting 23. The difference is 16, one bit. Aha! The C++ code was correct so I could have found it by comparison, but for some reason my eyes didn't pick up on the one character difference.

I guess this is a pretty big milestone for this project. For some reason it doesn't feel like it. Too gradual, too much more to do.

I hadn't done any speed benchmarks up till now. For a quick comparison, it took 11 seconds to run the stdlib tests with jSuneido and 5 seconds with cSuneido. That's not a scientific or fair test, but I'm happy enough with the result. Considering I haven't done any optimization, or even worried about speed at all, it doesn't seem too bad. And this is just running jSuneido from within Eclipse. Presumably the server Java VM would be better since it does more optimization. And it's also crossing from Windows to OS X through Parallels which may add some overhead.

I've always been ready to accept a speed penalty for the Java version, as long as it was more than made up for by multi-threading on multiple cpus/cores. e.g. If it's 25% slower, but you can use 4 cores, you still end up 3.5 times faster overall. Of course, you don't achieve an N times speedup from N cpus because of locking etc.

It would have been nice if there was an in-house app that I could switch to jSuneido. But our main apps require rules and triggers. Maybe our version control server - I don't think it uses any rules or triggers. That's a little scary, though. It's a pretty critical component. Especially since we generate the automatic nightly updates for our clients directly from version control. Might be wise to do a bit more testing first :-) I'm sure my staff would agree!

Tuesday, January 20, 2009

jSuneido Hung up on Antlr

Views were one of the last things to implement in the straight database server part of jSuneido.

This meant extending the Antlr grammars for requests (to define views) and queries (to expand views). I thought defining views would be easy and expanding them would be tricky. It turned out to be the opposite.

The trick with defining views is that I didn't want to parse the view definition, especially since the view definition follows the query grammar, not the request grammar. I just wanted to slurp up the remainder of the input as the definition. This sounds simple, but I thrashed for 3 or 4 hours on it.

I had something like:
request : VIEW id '=' VIEWDEF ;

VIEWDEF : .* ;
But you can't do that. What makes it trickier is that although this looks like a single grammar it actually produces separate lexer and parser classes. (lower case rules are for parsing, upper case for lexing.) The lexer operates independently so you can't define VIEWDEF as .* since that'll match anything/everything.

I tried predicates based on some example I found for implementing C macros:
request : VIEW id '=' {viewdef = true; } VIEWDEF ;

VIEWDEF : {viewdef}? .* ;
But this doesn't work either, probably for a variety of reasons, but one main one is that the parser and lexer are separate so the parser can't control the lexer (at least not easily).

So I decided to do the work all in the lexer:
request : VIEWDEF ;

VIEWDEF : VIEW ID '=' .* ;
But then I got "alternatives are unreachable". Eventually I found an explanation that .* is not greedy. That works well for something like '/*' .* '*/' but not so well for what I wanted. The suggested solution was to use ('\u0000'..'\uFFFE')* instead. Here's the end result:
request : VIEWDEF
{
String s = $VIEWDEF.text;
int i = s.indexOf('=');
iRequest.view(s.substring(0, i), s.substring(i + 1).trim());
}

VIEWDEF : VIEW WHITE+ NAME WHITE* '=' ANY EOF
{ setText($NAME.text + "=" + $ANY.text); };

fragment
ANY : ('\u0000'..'\uFFFE')+ ;
This seems a little ugly, but appears to work. It's a little roundabout to lexically split the view definition, then concatenate it back together, only to split it apart in the parser.

There could be a better way to do this but I couldn't find it. This will do for now.

I did find a bug in the standard library code while I was working on the views. The QueryCount function was trying to optimize the case where the query was just a table name. But something that looks like just a table name could actually be a view, in which case the optimization wouldn't work. I removed the optimization. It should really have been in the database anyway.

I thought this would fix the last of the test failures, but when I ran them all I found there were still a couple more failing. Hopefully one more day will clean them up.

As powerful as Antlr is, it's not the easiest system to work with. One of the next things I want to work on is compiling the Suneido language. I have a feeling I'll be putting in many more hours getting Antlr to do what I want. I could always go back to porting the hand written C++ recursive descent parser, but I think Antlr is better if it's not too painful to get working.

Monday, January 19, 2009

Trace Based JIT Compiling

Mozilla recently announced their new TraceMonkey Javascript implementation using trace based JIT compiling.

It's pretty cool stuff. For more information you can download Andreas Gal's thesis, or read HotpathVM: An Effective JIT Compiler for Resource-constrained Devices or Incremental Dynamic Code Generation with Trace Trees

I almost wish I was still working on Suneido's virtual machine :-)

TDM MinGW GCC 4

For some reason the MinGW project has been very slow to release a GCC 4 version. At first they said they were waiting for 4.1 but that arrived and still nothing. They eventually did release an experimental build but there hasn't been much action on this.

Checking the status recently, I searched for MinGW 4 and discovered the TDM versions of MinGW, including 4.3, and they seem to be active with it.

In a way this is a "fork" of the MinGW project - one of the benefits of open source. But it's not a fork in terms of going in a different direction, only in terms of releasing sooner. I really wonder why the main MinGW project does not want to release version 4, maybe they have a good reason, but if so they don't seem to be communicating it very well. Obviously they could still maintain version 3 if they feel that is a better choice for some things.

I built Suneido with the latest TDM version. GCC 4 has a lot of improvements to optimization, but sadly it's still slower than Visual C++ (about 10% slower than VC9, which in turn is about 10% slower than VC7)

New Computer


As I mentioned in Double Speed, I ordered a new Acer Veriton L460-ED8400 for at work. As you can see, it's certainly tiny. (about the same size as the book included for comparison) I ordered from Frontier PC who were very helpful figuring out what memory and hard drive I needed to go with it.

The only minor complaint about the Acer is that the connector between the optical drive and the motherboard is very flimsy. When we opened it up to put in the extra memory and the bigger hard drive a tiny fragment of plastic broke off the clip that holds the cable. We ended up having to replace it with a binder clip!

Otherwise, the Acer works great - it's small, fast, quiet. Vista seems to work well, including all the Aero effects.

I've gotten quite fond of the current Mac keyboard that I use at home. Compared to it, PC keyboards are incredibly clunky. They all seem to think the more bells and whistles the better. Personally, I want a minimum of bells and whistles. I wondered if a Mac keyboard would work on a Windows PC. I took my Mac keyboard to work and it functioned perfectly. So I went and bought another Mac keyboard. I use the wired version since I find the wireless one too small.


On the other hand, I'm not really fond of the Mac mouse. I'm not sure if it's the hardware or the driver, but right clicking is painfully unreliable. It's almost as if they finally gave in to having a right click, but they made it really crappy because they still didn't think you should be using it. It's especially painful running Parallels because Windows relies on right-clicking so much. So I bought a Logitech LX8 mouse. I liked the cordless receiver, especially since I could plug it into the front of the Acer. (It also would have been harder to use a Mac wireless mouse since they are Bluetooth and most PC's, including the Acer, don't have Bluetooth built in. (Macs do)

I kept the same monitor I've had for a while - a 24" Samsung SyncMaster 245BW that I'm quite happy with. I haven't gone to dual monitors, prefering a single large monitor. Partly maybe because I prefer to focus on one thing at a time. The same reason I don't keep my email running all the time. If I was buying a new monitor I think I'd be tempted by the new Apple 24-inch LED Cinema Display.

I'd say it was the easiest switch-over I've had. Despite moving more and more to on-line apps, I still install and use quite a lot of software. I wish there was a way to migrate Windows apps to a new computer. On the other hand, it's also nice to start fresh with clean installs.

Here's the main stuff:

- Kaspersky (we have been using AVG but we're thinking of switching)
- Firefox
- Thunderbird (for our separate internal email system)
- Open Office
- Scite
- Winmerge
- Gmail Notifier (so email links go to gmail, but not running all the time)
- Snagit
- PDFCreator
- KeePass
- LogMeIn
- TortoiseSVN
- MinGW G++
- Visual C++ 2003 (vc7)
- Visual C++ 2008 express (vc9)
- 7Zip
- GnuWin32 - grep, findutils, diffutils, fileutils/coreutils
- Google Earth
- Google Picasa
- Canvas

Plus all my Firefox add-ons:

- Adblock Plus
- Delicious Bookmarks (classic mode)
- Distrust
- dragdropupload
- Firebug
- FireFTP
- Foxmarks
- Google Gears
- Google Notebook
- IE View
- S3 Firefox Organizer
- Web Developer
- Yslow

I used Foxmarks to sync my bookmarks and passwords (more for the passwords since most of my bookmarks are in Delicious). Too bad you can't sync the add-ons themselves.

Friday, January 16, 2009

jSuneido Progress

I had a good day working on jSuneido yesterday - got a bunch more tests passing. Several of the bugs fixed multiple tests which was nice. None of the bugs required a lot of effort to track down. A number of them were simply things I hadn't implemented yet, either deliberately or accidentally.

Only two tests are left failing (of the ones that don't involve rules or triggers) and both of them involve views, which I just haven't implemented yet. Views are handled by simple text substitution - replacing view names with their definitions. cSuneido did this "on the fly" during parsing but I'm not sure whether I'll be able to do that using Antlr. (No doubt it's theoretically possible, but I don't want to spend too much time becoming an Antlr expert!)

Another benefit is that as I fix the bugs, the IDE (Windows cSuneido client connected to jSuneido server) has, not surprisingly, gotten a lot more stable.

It's nice to see "the light at the end of the tunnel" i.e. the end of this stage. I'm looking forward to starting the next stage (probably compiling the language to Java byte code).

Wednesday, January 14, 2009

Stupid Programmer Tricks

In this case, the stupid programmer is me :-(

For the last week or so I've been working (off and on) on tracking down the difference in query optimization between jSuneido and cSuneido that I mentioned earlier.

Along the way I found and fixed a few minor bugs in cSuneido, which was good. One of them was caused by re-using an instance variable for two different purposes. Of course, this eventually caused a problem. I can't remember when I did this, but it seems incredibly short sighted. It was probably just the "easiest" thing at the time. Sigh.

You'd think by now we would have run into any bugs in cSuneido. But when the only affect is a slight difference in optimization, it's not at all obvious.

To find the optimization difference between cSuneido and jSuneido I ended up having to insert debugging into both cSuneido and jSuneido and compare internal intermediate results until I found where the difference was coming from.

As usual, it turned out to be something stupid. When I was porting the code from C to Java I had made a "minor" improvement. I don't even remember doing it. But that minor change was enough to throw things off. It wouldn't have been so bad if I'd made the same change in the C code, but I didn't. When will I learn that I shouldn't change things during porting!

Oh well, one more down, a bunch more to go.

Tuesday, January 13, 2009

Parallels Clipboard Problem

I don't often copy and paste between Windows under Parallels and OS X. But it seems like every time I try, it doesn't work.

I've discovered one fix is to restart Windows. (Just suspending and resuming is not sufficient.)

I'm guessing the problem is that the Parallels Tools that run within Windows are getting messed up.

This is annoying because restarting Windows is not the speediest process.

This is on Parallels 4.0.3540 November 22, 2008 running Vista.

Another Query Optimization

Database query optimization in Suneido is a collection of heuristics. Most of them have come from finding queries that aren't executed efficiently and then figuring out ways to improve it.

The latest example was a query like:
(table1 leftjoin table2) where table2_field = "stuff"
Joins are implemented by reading sequentially through the left hand table (or sub-query) and doing lookups on the right hand side. This means you can read through the left hand side in any order. The right hand side requires an index on the join fields (to do the lookups efficiently).

Normal join's are symmetrical, but the implementation isn't, so part of the optimization process is deciding whether to reverse the order of a join.

But leftjoin's preserve all the records from the left hand side even where there are no matching records on the right. This means leftjoin's are not reversible, which limits optimization and sometimes leads to inefficient execution.

The trick is that this case:
table1 leftjoin table2 where table2_field = "stuff"
is actually equivalent to:
table1 join table2 where table2_field = "stuff"
because table2_field can only be "stuff" if there are matching records.

Once we see that it's equivalent to a normal join, then it's reversible, and there are more opportunities for optimization.

You might wonder why the programmer wouldn't just see this and write the query with a join instead of a leftjoin. In this example, the leftjoin was in a view definition, with the where added later. So the programmer only saw:
myview where table2_field = "stuff"
Besides, programmers don't always catch these things. It's a lot better if the system handles it.

Suneido's query optimization has two main phases - transform which moves and merges operations, and optimize which chooses strategies and index use.

This new optimization will be in the transform phase. Suneido already tracks "fixed" fields that are restricted by where's to a constant value. So if we see a fixed (and non-empty) field on the right hand side of a leftjoin, we can convert to a regular join.

Note: Suneido already moves where's as close to the tables as possible, so even if the query starts out as:
(table1 leftjoin table2) where table2_field = "stuff"
it will be transformed to:
table1 leftjoin (table2 where table2_field = "stuff")
This makes it easier to "see" the fixed field on the right hand side.

This shouldn't be too hard too implement. Especially since I've been working on the join code fixing a bug I found recently (while working on jSuneido).

[I'm not sure how much sense this will make if you're not familiar with this stuff. I wrote it for my own notes as much as anything.]

Friday, January 09, 2009

Misc. News

Google Books Settlement Agreement
This sounds like good news - better accessibility for out of print books, and the ability to "buy" books to read on-line.
iTunes 8 "drops" DRM
I prefer to buy music digitally rather than physical cd's. But I've stayed away from DRM protected music. Not because I want to pirate it (I'm trying to pay for it!), but because it's hard enough to switch players and computers and operating systems etc. without having to deal with DRM issues.

But I still ended up buying cd's because iTunes didn't have the iTunes Plus DRM free version. Amazon has been selling DRM free music in the US and had promised to make this available in Canada in 2008 but that didn't happen.

I'm still running into some music that's only available in DRM versions on iTunes, but according to Wikipedia they plan to be 100% DRM free by April.
Picasa for Mac
I used to use Picasa on Windows to manage my photos. When I switched to using a Mac at home I really wished they had a Mac version. They've finally released it, but meanwhile I've got hooked on Lightroom so I probably won't use it much.

I still recommend Picasa to Windows users looking for something simple to manage their photos. It easy to use and does a good job. And with version 3 it's got even more tools for tweaking your photos.

It's hard to say how it'll do against iPhoto on the Mac.
Bug Labs Announces Five New Modules
These guys have a cool product and these new modules make it even better. I'm itching to get this to play with, but I know I don't really have time for it, so I've resisted so far.

Tuesday, January 06, 2009

FindBugs Finds Bugs

I installed the Eclipse plugin for FindBugs without any problems.

I ran it and got about 80 warnings/errors.

It found one more place where I needed .equals - I'd missed it because I didn't think to look for != as well as == (oops!)

It found a bunch of places where I had compareTo without equals or equals without hashCode.

I fixed about half. The remainder are either minor or can wait. Ideally I'd like to clean them all up so I don't have to keep looking for real problems among the ones I'm ignoring.

The minimal effort to install and run FindBugs was definitely worth it.

Then I moved on to a test failure where jSuneido is optimizing a query slightly differently than cSuneido. Both ways work and there isn't an obvious "better" way. I could just ignore it. But in theory they're running the same algorithm so they should be coming up with the same answer. In my experience these kinds of anomolies usually lead to actual problems. (Where there's smoke there's fire.)

Along the way I have found a few minor bugs, even a couple that also exist in cSuneido. But so far I can't figure out why the difference in behavior. It's frustrating but I'll get it eventually.

Java Equals Issues

I tracked down a bug in jSuneido to a string comparison that was failing because I was using == instead of .equals

From The Java Programming Language:
The equality operators (==, !=) test for reference identity, not object equivalence. Two references are identical if they refer to the same object; two objects are equivalent if they logically have the same value.
and:
If you use == to compare string objects, you are actually comparing to see if they refer to the same object, not testing if the strings have the same contents.
I know this, but it's obviously not as ingrained into my thinking as it needs to be. When I see str1 == str2 I don't immediately see a problem. (It worked fine in C++.)

To complicate matters, it's ok to use == for characters, numbers, enums, and for comparing to null. It's even ok for strings if they're interned or if they came from literals (since they're interned).

After finding this bug, I reviewed every == in my code. I found a half dozen more that should have been .equals (and got sore eyes).

If I had been writing the code from scratch I probably would not have had as much problem with this. But porting the code from C++ was worse because in C++ you handle this by redefining operator== (And I didn't know enough to review every ==)

To add insult to injury, switching to .equals opens up the potential for another error - null values. If you do x.equals(y) and x happens to be null, then you'll get a run-time exception. (Which you wouldn't have gotten with x == y.) This is why you should write "string".equals(x) instead of x.equals("string") because "string" can't be null.

(Interesting that it's called a NullPointerException. I thought Java didn't have pointers :-) Another leaky abstraction.)

So potentially I've introduced new bugs in the process of fixing the ones caused by == instead of .equals

It looks like FindBugs can catch this problem, I'll have to give it a try.

Maybe I'll learn to like it, but right now this part of the Java language seems pretty ugly.

Monday, January 05, 2009

Apture

I've been trying out Apture on my blogs.

I first noticed it on the O'Reilly Radar blog on a post about Wendell Berry. Notice the little book icons next to some of the links. If you hover your mouse over one of these links, the Wikipedia article pops up. I thought that was pretty cool so I followed the "Get Apture for Your Website" link at the bottom of one of the pop ups.

It was easy enough to install - just add a widget to your Blogger layout.

It took me a while to "get" how it works. Their introductory video was good, and explained how to use it, but not how it works.

I thought you'd use it when you were writing a post, but you actually use it when you're just viewing. It must be storing the enhancements on Apture's servers and dynamically merging them into the page when you view it.

This is a bit like the services that let you add "comments" to third party web sites. The difference is that for those services, to see the comments, you had to be running some special software.

For Apture, it's the widget you add to your blog that merges the extra content. So viewers don't need any special software.

It seems a little like magic. And it's almost a little frightening since this gives Apture the ability to rewrite my blog entries. In theory they could add ads or security exploits or other nasty stuff. Not that it would make sense for them to do that, but the possibility exists. And I wonder if some third party couldn't take control of this "back door".

One big weakness is that this magic only works when you go to the blog web site. If you use a feed reader like Google Reader it probably won't work. On the positive side, a lot of traffic to blogs comes from search engines, which will take people to the web pages.

I can't see spending much time adding content with Apture. However, some of the things it does automatically, like adding pop ups for Wikipedia, don't require any extra work, and do enhance the experience if you do go to the web site.

For example, here's a link to the Suneido Wikipedia article. In a reader it's probably just going to be a regular link. But if you go to the web page for this post, you should see the little icon and get the pop up.

I may or may not keep using Apture, but it's an interesting technology.

Sunday, January 04, 2009

Suneido Design Flaws

These are things where, in retrospect, I think I made the wrong choices designing Suneido.

The challenge with changing any of these is not breaking existing code, or at least doing it in such a way that the broken code is obvious.

One approach is to support both the old and the new behavior during a "transition" period. The old way would be "deprecated", ideally in a way that can be automatically detected. Even if it can't be automatically detected, the programmer can still gradually switch code to the new way and then run tests to hopefully detect any problems.

Objects and Records Should Not Be Separate

I've talked about this before.

One of the reasons they're separate is that I didn't want the overhead of rules on every object. But the overhead is very small and this was likely premature optimization.

One problem with combining them is the record default value (see below).

Another problem is the global attachment of rules (see below). This could potentially "break" existing code.

Class Instances and Objects Should Not Be Combined

I think I did this mostly for ease of implementation. I may have also thought it reduced the number of concepts in the language.

However, class instances still have different behavior from regular objects so it didn't really simplify the language.

This means that class instances have all the same functionality as objects, like for-each and Map! and Sort!. But most of these don't or at least shouldn't be applied to class instances. If anything, this tempts programmers to "take advantage" of this, leading to unclear code.

On top of this, certain object methods are "basic" and can't be redefined/overridden. This was done both for performance (probably minor) and so that people reading the code would know the meaning of the basic methods without having to determine if they had been redefined.

One resulting drawback, is that a class can't define a method with the same name as a basic method. e.g. a Stack class can't define a Size method. You can use a different name e.g. Tally, but that means less consistent method names. It also means someone can mistakenly call e.g. Size instead of Tally and not realize the mistake since they won't get an error.

Objects also support vector/array data, which isn't needed for classes.

Internally, class instances could still be objects, but it probably makes more sense to separate them since you wouldn't end up using much of the object code.

Changing class instances to not have all the same functionality as objects shouldn't break too much existing code.

Records Should Not Have a Default Value

record.foobar will have a value of "" even if foobar is not a member of the record, similar to an object where you had Set_default("")

One of the reasons for this is that database records may be physically "missing" fields. One way this can happen is if you create new fields on a table. Existing records are not updated, they are just missing these fields. Having a default value handles this.

It is also a small memory gain when you have many empty ("") fields because they don't need to be stored in the object.

The downside is that if you use the wrong field name or make a mistake in the field name, you don't get an error and the mistake can go unnoticed.

One way to keep the advantages and fix the downside would be to have records "know" their valid fields. For these fields, they would still return "", but invalid fields could throw an exception (like objects throw "member not found").

This would break existing code that depends on this default value.

Rules Should Not be Global

When you create a rule, e.g. Rule_foobar, it will be used anywhere that you reference record.foobar

Like anything "global" this means that changes in one place, e.g. adding a rule, could break code in another place.

It also means that the only way to have a different rule applied, or to apply a rule only some of the time, is to use different field names. Which can be confusing, and can lead to extra complexity in other rules that have to deal with different field names at different times.

You should be able to attach/detach rules on a record.

For rules you want "all the time" you could attach them in the table schema so whenever you read a record from that table it would have the rule attached.

This would break existing code that creates a record (not from the database) and expects rules to apply.

Charlie Rose - An hour with Bill Gates

I just got around to watching Charlie Rose interview Bill Gates. It's an interesting show and although I'm not always a big Microsoft fan, I can't help but be impressed by what Bill has done in the past, what he's doing currently, and what he plans to do in the future.

I was led to this show from Dale Dougherty's post on O'Reilly Radar - Admiring Bill Gates.

Bill and I started software companies roughly the same time and we're similar ages so I feel a certain connection, although obviously Bill is in a whole different league.