The Software Life: antlr

Showing posts with label antlr. Show all posts

Monday, February 16, 2009

Computer Frustrations

Sometimes computers really piss me off.

I thought I was finally figuring out Antlr. Then my grammar worked in AntlrWorks, but not in Eclipse. AntlrWorks is using Antlr 3.1 but Eclipse was using Antlr 3.0. In the Eclipse preferences I see I can choose 3.1 but when I do, Antlr quits working. Ok, I guess I need to upgrade. But I expected this would be a problem because I haven't been able to update in Eclipse for quite a while. I try to update anyway, but it just tells me I can't, for some cryptic reason that doesn't help me at all.

This roadblock makes me think about whether I should be using Antlr. I've already spent quite a few days struggling with it and progress is very slow. I'm pretty sure I could have ported the C++ code to Java by now. One of the big reasons to use a tool like Antlr is to work at a higher level that's easier to modify. But the more I add to my grammar, the more brittle it seems to get. Small changes tend to break the whole thing with a cascade of vague messages.

In a way, it's understandable. I'm trying to port a grammar from a hand written recursive descent parser. In a hand written parser you can do all kinds of things that aren't possible in a more structured system like Antlr. And things like optional semicolons are not the easiest to handle with tools like Antlr.

The ironic part is that Antlr works great to do simple stuff (like Suneido's query language) but for more complex stuff it becomes increasingly difficult, at least if you're not an expert with it. Of course, that makes it almost useless - complex stuff is where you need it. If you were developing a little language from scratch it would be fine because you could just tailor the language to what Antlr handled easily.

I hate to give up, especially on an interesting challenge like this, but I have to remind myself that the goal is to port Suneido, not to spend endless hours struggling to learn and use some tool. And even if I got it to work, I'm then stuck with a dependency on Antlr. Which is already causing me problems and will just make it harder for anyone else who might want to build or modify jSuneido. And heaven forbid if they want to change the grammar - if they touch it, they're more than likely to break it, with no idea how to fix it.

To make a long story short, I'm giving up on Antlr and I'm going to have to start over with a fresh install of Eclipse in order to be able to update.

But that's just the beginning of my current computer frustrations. My irritation level is already quite high because of way too much obligatory family stuff this weekend. But before I started back on jSuneido I figured I'd better fulfil one last duty and download the photos Shelley took during the family stuff. Easy, just fire up Lightroom, import the photos, and copy them over to the file server so Shelley can look at them.

Wrong. Lightroom hangs up part way through the import. And not only hangs up Lightroom, but the whole of OS X. I end up having to force the power off and reboot. I thought it was because I was messing around while I was waiting for the import to finish. So I try again, this time not touching anything while it's working. It gets a little farther, but again it locks up the whole computer. WTF! I force the power off again and reboot. But this time it corrupted the Lightroom catalog and it can't repair it. Grrrrrr.

I have backups, but they don't include my last session in Lightroom because Lightroom backs up at the beginning of a session. That seems totally backwards to me. First, I want to backup what I do in a session. I may go weeks before I use it again, during which time my last session hasn't been backed up. Second, checking the catalog and backing up takes quite a while. Having to wait for that when I go to do something with Lightroom is really annoying, which means half the time I skip it, meaning my backups are liable to be even more out of date. I'd much rather it did the backup when I quit, since it wouldn't be holding me up and I wouldn't be so tempted to skip it.

And there's yet another frustration in this. My first instinct was to recover the file with Time Machine, since it's saved me a few times. I don't use it very often so I thought I was doing something wrong when I couldn't get it to work. Then I realized the Lightroom folder is inside the Pictures folder, and I don't back up the Pictures folder since it's so big. And I keep a mirror of it on the file server anyway so Shelley can access it from her Windows machine.

There also used to be problems with Time Machine having trouble backing up the Lightroom catalog when it was open, but I think/hope that's fixed. (???)

I start a Time Machine backup to get the Lightroom folder backed up. It takes forever to "prepare the backup" and then it says it's backing 16 gb. WTF again. There's no friggin way I have added / modified 16 gb of stuff since the last Time Machine backup. Then I realize that the Lightroom folder also contains the preview cache, which I see has grown to be 13 gb! Ok, stop the backup (which also takes forever, for some reason), exclude the previews, restart the backup, wait for it to "prepare",

So I move my Lightroom folder out of the Pictures folder so it will get backed up. I start up Lightroom, point it at the new catalog location, and ... it crashes. Now I'm getting really pissed off. I mess around with various ways of starting Lightroom, with various catalogs and eventually I get it running.

Meanwhile, half my morning is gone and I haven't accomplished a damn thing other than to raise my blood pressure. Sometimes you have to wonder about these "time saving" machines.

I still need to import the photos but I'm a little gun shy - I don't really want to crash my whole computer again. I was trying to import directly from the memory cards (the way I always do). Maybe something in the low level memory card reader drivers was the problem, so I copy the files on to the hard drive and try importing from there. Success!

Then I notice that a bunch of the file names have "-1" on the end. Argh! That's because I already imported a bunch of them. They weren't in the Lightroom catalog because I restored to an old version, but the actual files were still there. So now I have a bunch of duplicate files. (Lightroom normally prevents this, but only if they're in the catalog.)

Lightroom has a function to synchronize the catalog with a folder, so I use that. And ... it hangs. This is not my day. Lightroom normally runs very smoothly. Of course, I'm off the "happy path", and I'm thrashing around, which is always dangerous.

Maybe the Time Machine backup I've got running is interfering with Lightroom. It still hasn't finshed "preparing" but at least it stops fairly promptly when I tell it to.

I go to force quit Lightroom and a dialog shows up. ARGH! again. It wasn't hung up, it was just asking what I wanted to do with the file differences. Being reminded of my own stupidity doesn't help my mood. Besides, I need to synch after I clean up the duplicate files, not before.

I use a shell to rm *-1*. No easy way to do that from the Finder or from Lightroom. If you didn't know how to use the shell you'd be stuck with dragging files to the trash can one by one. GUI's aren't always so great.

I synch again, this time waiting for the dialog to pop up. Finally, I have the photos imported.

I start up ChronoSync and update the mirror of my photos on the file server. (I don't use its scheduler to do this automatically because it always seemed to want to run at the worst possible times and most of the runs did nothing because I hadn't imported anything new.)

Time for coffee. Then maybe I can actually do something productive. If you can call it "productive" to abandon multiple days of work with a sophisticated tool and go back to implementing by hand.

Thursday, February 12, 2009

jSuneido Antlr Parsing

More Antlr grammar fun.

I'm still struggling with optional semicolons. I thought I had it figured out but I had overlooked some cases. You can "view" semicolons three ways:

separators, as in "x ; y ; z"

terminators, as in "x; y; z;"

empty statements, as in "while (x) ;"

I had them working as separators, but that didn't handle "x;".

That was easy enough to fix, but then it didn't handle "while (x) ;" And it treated "x;" as two statements "x" followed by an empty statement, which isn't what I want. (Although it might not matter for code generation.)

Taking the terminator approach, I want statements to end with either a newline or a semicolon, unless they are followed by a "}" or an else, or several other cases. Yuck.

Also, reading the Antlr book more, it seems that just turning on backtracking globally may not be the best approach since you no longer get any warnings and it can potentially make the parser very slow. The alternative is to rewrite the grammar to avoid the ambiguities, turn on backtracking selectively on certain rules, or add your own syntactic predicates (instead of letting backtracking add them automatically).

To further confuse things, the book recommends ignoring some warnings (such as the classic if-then-else ambiguity) rather than fixing them with syntactic predicates since the default resolution is the correct one and the fix is less efficient. But that goes against the normal recommendation to eliminate all warnings, otherwise you may miss valid warnings mixed in with the ones you're supposed to be ignoring. The other problem is that I'm not sure I know how to figure out whether the Antlr default is what I want.

I removed the backtrack option and got rid of all but one ambiguity by tweaking the grammar. It was complaining that conditional expressions were ambiguous. At first I thought it was the if-then-else problem, but because the ":" part isn't optional like "else", that's not the problem. I found if I removed my assignment rule it solved the problem. Aha, the ambiguity is probably between "(x = y) ? ..." and "x = (y ? ...". You wouldn't have this problem if assignments were statements rather than expressions.

But it still seems a little odd, because that seems like a precedence issue, which in top-down grammars is handled by the order of nesting of rules. Wait a minute, assignment has the lowest precedence, so it should be first. But I had it last because that's how cSuneido had it. Now my grammar compiles cleanly with no warnings.

[I hadn't used the AntlrWorks IDE for a while, but I started using it again to experiment.]

It appears I have an approach that will handle the optional semicolon issue. [source] Before I complete the grammar, there are a few other questions I'd like to answer.

cSuneido emits byte code directly during the parse; it doesn't build a tree first. I think I should be able to do the same thing with the Antlr parser. The advantage of this approach is performance. The disadvantage is that you don't have a tree to use for other purposes. It's also hard to test since the only output is byte code which isn't human readable. You have to translate it to a form of assembly language to get something readable.

I'd like to decouple the parser from the code generator in a way that would allow me to attach a tree builder instead of the code generator. That will keep the performance advantages of direct code generation but still let me get a tree. For testing I could write something that will just produce strings.

This sounds simple, but it's somewhat complicated because generating code requires actions at intermediate points within the rules. e.g. "if" requires outputting a conditional branch between the expression and the statement. On the other hand, building a tree requires returning partial tree results from rules.

Antlr has a built in system for building abstract syntax trees. It would be great if I could use this since it would simplify the code a lot. But when I'm generating byte code I don't want to actually generate a tree. I wonder if I could write my own Tree class that would generate byte code but not actually build a tree. I'd then have to augment this with the extra intermediate actions. Unfortunately, the Tree interface is somewhat complicated. The BaseTree that is described as having no "payload" still keeps a list of its children, which I want to avoid when generating byte code.

Despite the extra work, I think it may be simpler to roll my own rather than try to make the tree stuff work in ways it wasn't intended to.

I modified Language.g to call a Builder interface and then implemented a StringBuilder used for ParseTest. So far so good. I still have to add the intermediate actions but then shouldn't be too hard. Then I can start to implement a Builder that does code generation.

ARGH! Rereading what I had written, I realize I didn't solve the empty statement issue. For example, it can't parse "if (x) ;" I guess that's what I'll have to work on next!

Tuesday, February 10, 2009

jSuneido, Java, and Antlr

Back to working on jSuneido. I'm writing the Antlr lexer/parser for the Suneido language. In the process I'm discovering a few issues.

The return value from my language parser was wrapped in another class (constant_return). But that's not the case in my query parser - it returns the result value directly. Obviously the grammars are different but I can't see what would cause this difference. I dug around trying to find an answer but I gave up. If I wanted to spend the time on it I could start cutting down the grammars until I found what was doing it by a process of elimination. But it's easier to just extract the value out of the class. [Later: the problem went away when I added more rules.]

I wasn't handling explicit signs on numeric constants properly. I had an optional minus sign on the lexical rule for numbers, but that only worked if there was appropriate whitespace. "a - 5" was ok, but "a-5" got interpreted as "a -5" (no operator) and gave an error.

The fix led to ambiguities between whether "-5" was a numeric constant or an expression with a unary minus operator. I wanted it to be a numeric constant so I had to change my expression rules a bit.

I notice I don't seem to allow a plus sign on numeric constants in cSuneido. It's redundant, but still seems to be an oversight.

I wasn't handling errors in the lexers. I'd put in the recommended code to abort parsing on the first error, but I didn't realize I had to do something similar for the lexer. Eventually I found How can I make the lexer exit upon first lexical error? which seems overly twisted, but does the trick. I needed this in the query grammar as well.

SuDecimal.toString wasn't behaving like cSuneido. e.g. 1000 would come out as 1e3. BigDecimal.toPlainString works better in this case, giving "1000", but it doesn't switch to exponential notation for big exponents like cSuneido does. cSuneido switches to exponential notation for exponents bigger than 19. It's probably not critical to match this exactly but it should still switch to exponential notation for big exponents.

The parsers were accepting "extra" stuff on the end. You have to explicitly match EOF. You would think that would be the default. Or at least mentioned prominently. Yet EOF isn't even in the Antlr book index, and it's not in the examples.

One of the big issues with converting Suneido's language parser to Antlr is optional semicolons. I figure I'd better tackle this early in case it's a show stopper. Or if not a show stopper, something that will require a specific approach to writing the grammar.

Thankfully JavaScript (ECMAScript) also has optional semicolons so I could look at examples for it.

You can use Antlr's automatic backtracking, as in one example. But this example works by treating newlines as significant and optionally matching them everywhere applicable - a lot of places!

Backtracking is not used in another example. Instead it uses some tricky code to "promote" applicable newlines to be significant.

The first approach seems simpler, albeit more verbose.

One ambiguity is the return statement. "return 123" can be either "return; 123" or "return 123;". Strangely, Antlr doesn't complain about this issue. It takes "return \n 123" as a single statement, whereas cSuneido would take it as "return ; 123".

Of course, Suneido's handling of optional semicolons is not exactly the same as JavaScript. That would be too easy :-) I'm not even sure cSuneido is totally "correct". For example, cSuneido takes "5 \n -4" as "(5 - 4)" because it looks ahead to see if following lines start with a binary operator. But to me it seems like it "looks" more like "5; -4"

In certain contexts, newlines are not significant. For example, if you're inside a parenthesized expression. And "f() g()" should be legal i.e. two statements. But "(f() g()) should be illegal. This seems to be handled by Antlr with backtracking.

I've always felt a little guilty that I wrote some of the tests for cSuneido in stdlib rather than in C++ code. But it was a lot easier to write certain tests that way. Now that I'm writing jSuneido I find there is an advantage to that approach. To keep the C++ tests I have to rewrite them in Java. But the ones that are in stdlib work as is. It's making me think that it would be nice to have a "complete" set of tests for the executable in stdlib, much like Rubinius initiated an Rspec test suite that "defines" Ruby.

Of course, the reason I'm thinking of this now is that I don't have a clear definition of how optional semicolons are supposed to work in Suneido, other than the cSuneido implementation. A set of tests would be nice! The downside of tests in stdlib is that Suneido has to be fairly complete to run them. And my language implementation is barely started right now.

I had to temporarily kluge some string results from my rules in order to test the parsing, but with not too much messing around it seems to be working. There are some subtle parts: "return expr?" is different than "return | return expr". The first seems to be what I want in order to interpret "return expr" as one statement instead of two.

So far so good. It looks like it may not be too hard to get Antlr to do what I want.

Thursday, January 22, 2009

More Antlr Fun

When I started on the Antlr grammar for the Suneido language I realized I hadn't handled escape sequences in string literals in queries.

I thought this would be easy, but I thrashed around for quite a while on it.

There are lots of examples like:

STRING_LITERAL
   :    '"' (EscapeSequence | ~('"'))* '"'
   ;
fragment EscapeSequence
   :     '\\' ('b'|'t'|'n'|'f'|'r'|'v'|'\"'|'\''|'\\')
   |    '\\' 'x' HexDigit+
   |   OctalEscape
   ;
fragment OctalEscape  
   :    '\\' ('0'..'3') ('0'..'7') ('0'..'7')
   |    '\\' ('0'..'7') ('0'..'7')
   |    '\\' ('0'..'7')
   ;

Looks straightforward. But ... there aren't many examples that show how to convert those escape sequences to the proper characters. There are a few like:

EscapeSequence
   :'\\'!
   ( 'n' {$setText("\n");}
   | 'r' {$setText("\r");}
   | 't' {$setText("\t");}

Maybe that worked in Antlr 2, but it doesn't work in Antlr 3. The nearest thing to an explanation I could find was a cryptic response to a mailing list question. It seems to imply there is a way to do it, but if so, I couldn't figure it out. (I hope my responses to Suneido questions aren't as cryptic!)

I found a few references to this being an area where Antlr is weak. The book has very little material on lexical analysis.

I tried various combinations and permutations but I couldn't find any way to make it work. In the end I just handled it outside Antlr by writing a function that would convert the escape sequences in a string. So my grammar just has:

STRING
   : '"' ( ESCAPE | ~('"'|'\\') )* '"'
   | '\'' ( ESCAPE | ~('\''|'\\') )* '\''
   ;
fragment
ESCAPE : '\\' . ;

I guess I should have taken this approach from the start, but the examples seem to show handling more in the lexer. And like my last Antlr issue, it seems a little ugly to be able to recognize things in the grammar, but then have to reprocess them again yourself later.

Tuesday, January 20, 2009

jSuneido Hung up on Antlr

Views were one of the last things to implement in the straight database server part of jSuneido.

This meant extending the Antlr grammars for requests (to define views) and queries (to expand views). I thought defining views would be easy and expanding them would be tricky. It turned out to be the opposite.

The trick with defining views is that I didn't want to parse the view definition, especially since the view definition follows the query grammar, not the request grammar. I just wanted to slurp up the remainder of the input as the definition. This sounds simple, but I thrashed for 3 or 4 hours on it.

I had something like:

request : VIEW id '=' VIEWDEF ;

VIEWDEF : .* ;

But you can't do that. What makes it trickier is that although this looks like a single grammar it actually produces separate lexer and parser classes. (lower case rules are for parsing, upper case for lexing.) The lexer operates independently so you can't define VIEWDEF as .* since that'll match anything/everything.

I tried predicates based on some example I found for implementing C macros:

request : VIEW id '=' {viewdef = true; } VIEWDEF ;

VIEWDEF : {viewdef}? .* ;

But this doesn't work either, probably for a variety of reasons, but one main one is that the parser and lexer are separate so the parser can't control the lexer (at least not easily).

So I decided to do the work all in the lexer:

request : VIEWDEF ;

VIEWDEF : VIEW ID '=' .* ;

But then I got "alternatives are unreachable". Eventually I found an explanation that .* is not greedy. That works well for something like '/*' .* '*/' but not so well for what I wanted. The suggested solution was to use ('\u0000'..'\uFFFE')* instead. Here's the end result:

request : VIEWDEF
   {
   String s = $VIEWDEF.text;
   int i = s.indexOf('=');
   iRequest.view(s.substring(0, i), s.substring(i + 1).trim());
   }

VIEWDEF : VIEW WHITE+ NAME WHITE* '=' ANY EOF
   { setText($NAME.text + "=" + $ANY.text); };

fragment
ANY : ('\u0000'..'\uFFFE')+ ;

This seems a little ugly, but appears to work. It's a little roundabout to lexically split the view definition, then concatenate it back together, only to split it apart in the parser.

There could be a better way to do this but I couldn't find it. This will do for now.

I did find a bug in the standard library code while I was working on the views. The QueryCount function was trying to optimize the case where the query was just a table name. But something that looks like just a table name could actually be a view, in which case the optimization wouldn't work. I removed the optimization. It should really have been in the database anyway.

I thought this would fix the last of the test failures, but when I ran them all I found there were still a couple more failing. Hopefully one more day will clean them up.

As powerful as Antlr is, it's not the easiest system to work with. One of the next things I want to work on is compiling the Suneido language. I have a feeling I'll be putting in many more hours getting Antlr to do what I want. I could always go back to porting the hand written C++ recursive descent parser, but I think Antlr is better if it's not too painful to get working.