Thursday, February 12, 2009

jSuneido Antlr Parsing

More Antlr grammar fun.

I'm still struggling with optional semicolons. I thought I had it figured out but I had overlooked some cases. You can "view" semicolons three ways:

separators, as in "x ; y ; z"

terminators, as in "x; y; z;"

empty statements, as in "while (x) ;"

I had them working as separators, but that didn't handle "x;".

That was easy enough to fix, but then it didn't handle "while (x) ;" And it treated "x;" as two statements "x" followed by an empty statement, which isn't what I want. (Although it might not matter for code generation.)

Taking the terminator approach, I want statements to end with either a newline or a semicolon, unless they are followed by a "}" or an else, or several other cases. Yuck.

Also, reading the Antlr book more, it seems that just turning on backtracking globally may not be the best approach since you no longer get any warnings and it can potentially make the parser very slow. The alternative is to rewrite the grammar to avoid the ambiguities, turn on backtracking selectively on certain rules, or add your own syntactic predicates (instead of letting backtracking add them automatically).

To further confuse things, the book recommends ignoring some warnings (such as the classic if-then-else ambiguity) rather than fixing them with syntactic predicates since the default resolution is the correct one and the fix is less efficient. But that goes against the normal recommendation to eliminate all warnings, otherwise you may miss valid warnings mixed in with the ones you're supposed to be ignoring. The other problem is that I'm not sure I know how to figure out whether the Antlr default is what I want.

I removed the backtrack option and got rid of all but one ambiguity by tweaking the grammar. It was complaining that conditional expressions were ambiguous. At first I thought it was the if-then-else problem, but because the ":" part isn't optional like "else", that's not the problem. I found if I removed my assignment rule it solved the problem. Aha, the ambiguity is probably between "(x = y) ? ..." and "x = (y ? ...". You wouldn't have this problem if assignments were statements rather than expressions.

But it still seems a little odd, because that seems like a precedence issue, which in top-down grammars is handled by the order of nesting of rules. Wait a minute, assignment has the lowest precedence, so it should be first. But I had it last because that's how cSuneido had it. Now my grammar compiles cleanly with no warnings.

[I hadn't used the AntlrWorks IDE for a while, but I started using it again to experiment.]

It appears I have an approach that will handle the optional semicolon issue. [source] Before I complete the grammar, there are a few other questions I'd like to answer.

cSuneido emits byte code directly during the parse; it doesn't build a tree first. I think I should be able to do the same thing with the Antlr parser. The advantage of this approach is performance. The disadvantage is that you don't have a tree to use for other purposes. It's also hard to test since the only output is byte code which isn't human readable. You have to translate it to a form of assembly language to get something readable.

I'd like to decouple the parser from the code generator in a way that would allow me to attach a tree builder instead of the code generator. That will keep the performance advantages of direct code generation but still let me get a tree. For testing I could write something that will just produce strings.

This sounds simple, but it's somewhat complicated because generating code requires actions at intermediate points within the rules. e.g. "if" requires outputting a conditional branch between the expression and the statement. On the other hand, building a tree requires returning partial tree results from rules.

Antlr has a built in system for building abstract syntax trees. It would be great if I could use this since it would simplify the code a lot. But when I'm generating byte code I don't want to actually generate a tree. I wonder if I could write my own Tree class that would generate byte code but not actually build a tree. I'd then have to augment this with the extra intermediate actions. Unfortunately, the Tree interface is somewhat complicated. The BaseTree that is described as having no "payload" still keeps a list of its children, which I want to avoid when generating byte code.

Despite the extra work, I think it may be simpler to roll my own rather than try to make the tree stuff work in ways it wasn't intended to.

I modified Language.g to call a Builder interface and then implemented a StringBuilder used for ParseTest. So far so good. I still have to add the intermediate actions but then shouldn't be too hard. Then I can start to implement a Builder that does code generation.

ARGH! Rereading what I had written, I realize I didn't solve the empty statement issue. For example, it can't parse "if (x) ;" I guess that's what I'll have to work on next!

No comments: