Sunday, April 23, 2023

New Regular Expression Implementation for gSuneido

Suneido has its own regular expression implementation. 25 years ago, when I wrote the original C++ cSuneido implementation, there weren't a lot of regular expression libraries, and I already had an implementation from a previous product, so that's what I used. With the Java version, I initially tried to use the Java standard regular expressions. But it was too hard to get exactly the same behavior, so I ended up using my own implementation. And I did the same thing with the Go version of Suneido.

But I haven't been totally happy with my implementation. It uses backtracking (like many other implementations) which means it has some bad worst cases. I had to put limits on the backtracking, which meant we couldn't use regular expressions in some places, which meant hand writing code to replace them. To make it worse, there was no easy explanation of what kinds of regular expressions were ok.

Recently I re-encountered this series of articles about implementing regular expressions.

Regular Expression Matching Can Be Simple And Fast
Regular Expression Matching: the Virtual Machine Approach
Regular Expression Matching in the Wild

I'd read them before but at that point I'd been satisfied with what I had. The articles made it seem easy to implement a non-backtracking approach to avoid the worst case behavior. Of course, it didn't turn out to be quite that easy to write a full implementation. Naturally, the articles didn't cover all the details.

It started out as a weekend project, but it ended up taking me longer than that, partly because I implemented optimizations for literal strings and single pass expressions.

Because Suneido is ASCII only (due to its age), the code is a little simpler and faster than the Go standard library version which handles UTF-8.

The Go regular expression package compiles to linked structs that are relatively large. I chose to compile to byte code stored in a string. I like to use strings because they are one of the few constant immutable types in Go. And string references are only 16 bytes compared to 24 bytes for byte slices. There's probably a bit more overhead processing the byte code, but it's a more compact chunk of memory so it'll be more cache friendly.

Humorously (to me) when I was fuzz testing my implementation against the Go standard library implementation I discovered a bug in the Go code. As expected, one of the initial responses was denial, blaming it on my lack of understanding. Thankfully, someone else (not on the Go team) confirmed the problem and even located a possible location of the bug. I thought a bug in something like regular expressions would be treated fairly seriously, but over a month later it's still just labeled as "NeedsInvestigation" rather than "NeedsFix". It was put under the Go1.21 milestone at least.

The new implementation is about 2000 lines of Go code. Probably half of that I was able to reuse from the previous version. As usual the code is on GitHub

No comments: