Saturday, April 19, 2014

CharMatcher in Go

The Guava library for Java has a CharMatcher that provides a way of composing character matching predicates plus functions that use those predicates.

For example, AnyOf("\r\n").Negate() creates a CharMatcher that matches any character except return or newline. You can then do things like cm.IndexIn(str) or cm.CountIn(str)

Some of this you can do directly with Go libraries. The unicode package provides some standard "matchers" like IsDigit and IsLetter. And the strings package has functions like IndexFunc and TrimFunc that take predicates.

But they don't do everything that CharMatcher does, so as an exercise I thought I'd try implementing something like CharMatcher in Go.

My first approach was basically an object-oriented style like I'd use in Java with CharMatch as an interface.

type CharMatch interface {
Match(c rune) bool
}
type is_cm struct {
c rune
}
func (cm is_cm) Match(c rune) bool {
return c == cm.c
}
func Is(c rune) CharMatch {
return is_cm{c}
}
view raw charmatch0.go hosted with ❤ by GitHub
But when I started adding more matchers it seemed excessive to have to define three pieces for each - a struct, a match method for the struct, and a function to construct the struct.

My next thought was to get rid of the interface and have a generic struct containing a matching function as a member. This uses closures to store the matcher parameters rather than structs.

type CharMatch struct {
fn func(rune) bool
}
func (cm CharMatch) Match(c rune) bool {
return cm.fn(c)
}
func Is(c rune) CharMatch {
return CharMatch{ func (c2 rune) bool { return c == c2 }}
}
view raw charmatch1.go hosted with ❤ by GitHub
I was stuck on the idea of a struct so that I could define methods like Negate and IndexIn on it. Then I realized that in Go I could make CharMatch just a function, and still define methods on it. That led to this version:

type CharMatch func(rune) bool
// Predefined CharMatch's
var (
SPACE CharMatch = AnyOf(" \t\r\n")
DIGIT CharMatch = InRange('0', '9')
LETTER CharMatch = unicode.IsLetter
LOWER CharMatch = unicode.IsLower
UPPER CharMatch = unicode.IsUpper
)
// Match returns true if the character matches, otherwise false
func (cm CharMatch) Match(c rune) bool {
return cm(c)
}
// Is returns a CharMatch that matches a specific character
func Is(c1 rune) CharMatch {
return func(c2 rune) bool { return c1 == c2 }
}
// AnyOf returns a CharMatch that matches any character in a string
func AnyOf(s string) CharMatch {
return func(c rune) bool { return strings.ContainsRune(s, c) }
}
// InRange returns a CharMatch that matches any character in the range (inclusive)
func InRange(from rune, to rune) CharMatch {
return func(c rune) bool { return from <= c && c <= to }
}
// Negate returns a CharMatch that matches any character not matched by this one
func (cm CharMatch) Negate() CharMatch {
return func(c rune) bool { return !cm.Match(c) }
}
// Or returns a CharMatch that matches any character that matches either CharMatch
func (cm1 CharMatch) Or(cm2 CharMatch) CharMatch {
return func(c rune) bool { return cm1.Match(c) || cm2.Match(c) }
}
// CountIn returns the number of characters in the string that match
func (cm CharMatch) CountIn(s string) int {
n := 0
for _, c := range s {
if cm.Match(c) {
n++
}
}
return n
}
// IndexIn returns the index of the first character that matches
// or -1 if no match is found
func (cm CharMatch) IndexIn(s string) int {
return strings.IndexFunc(s, cm)
}
view raw charmatch2.go hosted with ❤ by GitHub
I used InRange for DIGIT and AnyOf for SPACE as examples, these could also use the unicode package equivalents.

IndexIn is an example of a method that just wraps a strings package function, whereas CountIn has no strings equivalent.

The tests give some examples of how it's used.

One potential drawback of this approach is that the matcher parameters are "buried" in closures. This makes it impossible to do any processing or optimization (like the Guava CharMatcher precomputed method). For example, Is('a').Or(Is('b')) could be folded into AnyOf('ab'). If you wanted to do this, I think you'd have to go back to using structs (like my first approach).

No comments: