Thursday, January 22, 2009

More Antlr Fun

When I started on the Antlr grammar for the Suneido language I realized I hadn't handled escape sequences in string literals in queries.

I thought this would be easy, but I thrashed around for quite a while on it.

There are lots of examples like:
: '"' (EscapeSequence | ~('"'))* '"'
fragment EscapeSequence
: '\\' ('b'|'t'|'n'|'f'|'r'|'v'|'\"'|'\''|'\\')
| '\\' 'x' HexDigit+
| OctalEscape
fragment OctalEscape
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
Looks straightforward. But ... there aren't many examples that show how to convert those escape sequences to the proper characters. There are a few like:
( 'n' {$setText("\n");}
| 'r' {$setText("\r");}
| 't' {$setText("\t");}
Maybe that worked in Antlr 2, but it doesn't work in Antlr 3. The nearest thing to an explanation I could find was a cryptic response to a mailing list question. It seems to imply there is a way to do it, but if so, I couldn't figure it out. (I hope my responses to Suneido questions aren't as cryptic!)

I found a few references to this being an area where Antlr is weak. The book has very little material on lexical analysis.

I tried various combinations and permutations but I couldn't find any way to make it work. In the end I just handled it outside Antlr by writing a function that would convert the escape sequences in a string. So my grammar just has:
: '"' ( ESCAPE | ~('"'|'\\') )* '"'
| '\'' ( ESCAPE | ~('\''|'\\') )* '\''
ESCAPE : '\\' . ;
I guess I should have taken this approach from the start, but the examples seem to show handling more in the lexer. And like my last Antlr issue, it seems a little ugly to be able to recognize things in the grammar, but then have to reprocess them again yourself later.

1 comment:

Anonymous said...

In ANTLR 3 this construction seems to work instead $setText
: '\\'
( c='"' { $c.setText("\""); }
| c='\\' { $c.setText("\\"); }