Tuesday, February 10, 2009

jSuneido, Java, and Antlr

Back to working on jSuneido. I'm writing the Antlr lexer/parser for the Suneido language. In the process I'm discovering a few issues.

The return value from my language parser was wrapped in another class (constant_return). But that's not the case in my query parser - it returns the result value directly. Obviously the grammars are different but I can't see what would cause this difference. I dug around trying to find an answer but I gave up. If I wanted to spend the time on it I could start cutting down the grammars until I found what was doing it by a process of elimination. But it's easier to just extract the value out of the class. [Later: the problem went away when I added more rules.]

I wasn't handling explicit signs on numeric constants properly. I had an optional minus sign on the lexical rule for numbers, but that only worked if there was appropriate whitespace. "a - 5" was ok, but "a-5" got interpreted as "a -5" (no operator) and gave an error.

The fix led to ambiguities between whether "-5" was a numeric constant or an expression with a unary minus operator. I wanted it to be a numeric constant so I had to change my expression rules a bit.

I notice I don't seem to allow a plus sign on numeric constants in cSuneido. It's redundant, but still seems to be an oversight.

I wasn't handling errors in the lexers. I'd put in the recommended code to abort parsing on the first error, but I didn't realize I had to do something similar for the lexer. Eventually I found How can I make the lexer exit upon first lexical error? which seems overly twisted, but does the trick. I needed this in the query grammar as well.

SuDecimal.toString wasn't behaving like cSuneido. e.g. 1000 would come out as 1e3. BigDecimal.toPlainString works better in this case, giving "1000", but it doesn't switch to exponential notation for big exponents like cSuneido does. cSuneido switches to exponential notation for exponents bigger than 19. It's probably not critical to match this exactly but it should still switch to exponential notation for big exponents.

The parsers were accepting "extra" stuff on the end. You have to explicitly match EOF. You would think that would be the default. Or at least mentioned prominently. Yet EOF isn't even in the Antlr book index, and it's not in the examples.

One of the big issues with converting Suneido's language parser to Antlr is optional semicolons. I figure I'd better tackle this early in case it's a show stopper. Or if not a show stopper, something that will require a specific approach to writing the grammar.

Thankfully JavaScript (ECMAScript) also has optional semicolons so I could look at examples for it.

You can use Antlr's automatic backtracking, as in one example. But this example works by treating newlines as significant and optionally matching them everywhere applicable - a lot of places!

Backtracking is not used in another example. Instead it uses some tricky code to "promote" applicable newlines to be significant.

The first approach seems simpler, albeit more verbose.

One ambiguity is the return statement. "return 123" can be either "return; 123" or "return 123;". Strangely, Antlr doesn't complain about this issue. It takes "return \n 123" as a single statement, whereas cSuneido would take it as "return ; 123".

Of course, Suneido's handling of optional semicolons is not exactly the same as JavaScript. That would be too easy :-) I'm not even sure cSuneido is totally "correct". For example, cSuneido takes "5 \n -4" as "(5 - 4)" because it looks ahead to see if following lines start with a binary operator. But to me it seems like it "looks" more like "5; -4"

In certain contexts, newlines are not significant. For example, if you're inside a parenthesized expression. And "f() g()" should be legal i.e. two statements. But "(f() g()) should be illegal. This seems to be handled by Antlr with backtracking.

I've always felt a little guilty that I wrote some of the tests for cSuneido in stdlib rather than in C++ code. But it was a lot easier to write certain tests that way. Now that I'm writing jSuneido I find there is an advantage to that approach. To keep the C++ tests I have to rewrite them in Java. But the ones that are in stdlib work as is. It's making me think that it would be nice to have a "complete" set of tests for the executable in stdlib, much like Rubinius initiated an Rspec test suite that "defines" Ruby.

Of course, the reason I'm thinking of this now is that I don't have a clear definition of how optional semicolons are supposed to work in Suneido, other than the cSuneido implementation. A set of tests would be nice! The downside of tests in stdlib is that Suneido has to be fairly complete to run them. And my language implementation is barely started right now.

I had to temporarily kluge some string results from my rules in order to test the parsing, but with not too much messing around it seems to be working. There are some subtle parts: "return expr?" is different than "return | return expr". The first seems to be what I want in order to interpret "return expr" as one statement instead of two.

So far so good. It looks like it may not be too hard to get Antlr to do what I want.

No comments: