Thursday, July 09, 2009

Java Regular Expression Issue

I'm still grinding away on getting all the standard library tests to succeed on jSuneido.

I just ran into a problem because "^\s*$" doesn't match an empty string!?

Nor does "^$"

Nor does "^" (although just "$" does).

I find if I don't enable multi-line mode, then all of those match, as I'd expect.

Pattern.compile("^").matcher("").find() => true

Pattern.compile("^", MULTILINE).matcher("").find() => false


But I need multi-line mode to make it work the same as cSuneido.

I've tried to find anything in the documentation or on the web to explain this, but haven't had any luck. It doesn't make much sense to me. The documentation says:
By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence.
The only thing I can think of is that it's applying "except at the end of input" even when there is no line terminator. I guess it depends whether you parse it as

(matches at the beginning of input) and (after any line terminator except at the end of input)

or

(matches at the beginning of input and after any line terminator) except at the end of input

To me, the first makes more sense, but it appears to be working like the second.

So far I've been able to handle the differences between Suneido regular expressions and Java regular expressions by translating and escaping the expressions. But that's tricky for this problem. I guess I could turn off multi-line mode if the string being matched doesn't have any newlines. Except I'm caching the compiled regular expressions so I'd have to cache two versions. And it also means an extra search of the string on every match. Yuck.

Of course, my other option is to port cSuneido's regular expression code, rather than using Java's. Ugh.

Backwards compatibility is really a pain!

1 comment:

Unknown said...

Give a .NET/MONO a chance, seriously. I am currently focused on python/django area, but everything which does microsoft development tools department IS definitelly very good. Look at C#4.0 with integrated DLR features (search for Anders Hejlsberg presentation about this, its awesome imho). I trust 'em a lot.