Wednesday, June 18, 2008

Antlr and Eclipse

I just finished reading The Definitive Antlr Reference published by Pragmatic Programmers. Antlr is a parser (& lexer) generator. It's written in Java but it supports other target languages like C++, Ruby, and Python. It looks like a pretty cool system.

I have an interest in this area that goes a long way back. In the early 80's I reverse engineered Yacc to create my own look-alike. (Since there was nothing available on MS-DOS.) I didn't have the source code so it was an interesting exercise to feed different inputs to a "black box" and see what the outputs were. I only had a crude understanding of how Yacc worked but I was successful enough that I was able to use my program to generate parsers for several applications. (Later, I switched to the version of Yacc that came with the MKS Toolkit.)

Although I'd seen some mentions of Antlr, I probably wouldn't have got very interested if the book hadn't come out. Although there's information on the web, I still find a good book the best way to learn something new. And a book adds a kind of "legitimacy" to things.

When I wrote Suneido I chose to hand write a recursive descent parser. Coincidentally, unlike Yacc (and the similar GNU Bison), Antlr generates ... recursive descent parsers.

I wondered if I could use Antlr to replace Suneido's hand written parser. I installed the AntlrWorks gui tool and quite quickly managed to create a grammar for Suneido's database language.

Next, I wanted to try to incorporate it into my Eclipse project. I found a plugin and installed it and added my grammar to the project ... and got a bunch of weird errors. Argh!

After a bunch of messing around I realized that the Eclipse plugin I was using was for the old version of Antlr. The next problem was how to uninstall the plugin. This isn't as obvious as you'd think. It's under Help > Software Updates > Manage Configuration. That gives you options for updating, disabling, and showing the properties, but not uninstall. After some further poking around I discovered Uninstall is on the right click context menu on the tree. I'm not sure why it's not listed under Available Tasks. Maybe because it doesn't work very well. I chose to Uninstall the Antlr plugin. But I couldn't get rid of the grammar errors. And when I looked, there were still Antlr related files hanging around. I cleaned up manually as best I could.

I found a new plugin and tried to install it. It had dependencies on another package - the Eclipse Dynamic Language Toolkit. I couldn't find an update site for the specific version it needed so I had to install manually. Then there was a dependency on the Graphical Editing Framework (GEF) so I had to install that. Then I could finally install the new Antlr plugin. And finally get my grammar to compile.

Even better, if I want this on my other computers, I have to go through the same process to install all the dependencies. Yuck.

All in all, it was a pretty frustrating process and it soured me a bit on Antlr, although it's really nothing to do with Antlr itself. I can blame some of it on Eclipse, but it's more just a result of a rich ecosystem of third party plugins. I guess you have to expect some problems.

The real test would be to implement the main Suneido language grammar in Antlr. I think this could be a little tricky because Suneido has optional semicolons, which make some newlines significant (but the rest of the time they're ignored). It was tricky enough kludging this in the hand written parser. The Antlr book does give an example of how to parse Python-like languages where indenting whitespace is significant, although that's not quite the same problem. There's also a simplified Ruby grammar on the Antlr website that might help. (Since Ruby has similar optional semicolons.)

I'm still not convinced that using Antlr is the way to go. Since it generates recursive descent grammars it's much easier to convert the grammar from the existing parser (than it would be to use Yacc or Bison). But it also wouldn't be hard to port the existing parser directly to Java. An explicit grammar is certainly easier to modify, but the language grammar doesn't change a lot.

No comments: