Saturday, April 19, 2014

CharMatcher in Go

The Guava library for Java has a CharMatcher that provides a way of composing character matching predicates plus functions that use those predicates.

For example, AnyOf("\r\n").Negate() creates a CharMatcher that matches any character except return or newline. You can then do things like cm.IndexIn(str) or cm.CountIn(str)

Some of this you can do directly with Go libraries. The unicode package provides some standard "matchers" like IsDigit and IsLetter. And the strings package has functions like IndexFunc and TrimFunc that take predicates.

But they don't do everything that CharMatcher does, so as an exercise I thought I'd try implementing something like CharMatcher in Go.

My first approach was basically an object-oriented style like I'd use in Java with CharMatch as an interface.

But when I started adding more matchers it seemed excessive to have to define three pieces for each - a struct, a match method for the struct, and a function to construct the struct.

My next thought was to get rid of the interface and have a generic struct containing a matching function as a member. This uses closures to store the matcher parameters rather than structs.

I was stuck on the idea of a struct so that I could define methods like Negate and IndexIn on it. Then I realized that in Go I could make CharMatch just a function, and still define methods on it. That led to this version:

I used InRange for DIGIT and AnyOf for SPACE as examples, these could also use the unicode package equivalents.

IndexIn is an example of a method that just wraps a strings package function, whereas CountIn has no strings equivalent.

The tests give some examples of how it's used.

One potential drawback of this approach is that the matcher parameters are "buried" in closures. This makes it impossible to do any processing or optimization (like the Guava CharMatcher precomputed method). For example, Is('a').Or(Is('b')) could be folded into AnyOf('ab'). If you wanted to do this, I think you'd have to go back to using structs (like my first approach).

No comments: