Sunday, April 23, 2023

New Regular Expression Implementation for gSuneido

Suneido has its own regular expression implementation. 25 years ago, when I wrote the original C++ cSuneido implementation, there weren't a lot of regular expression libraries, and I already had an implementation from a previous product, so that's what I used. With the Java version, I initially tried to use the Java standard regular expressions. But it was too hard to get exactly the same behavior, so I ended up using my own implementation. And I did the same thing with the Go version of Suneido.

But I haven't been totally happy with my implementation. It uses backtracking (like many other implementations) which means it has some bad worst cases. I had to put limits on the backtracking, which meant we couldn't use regular expressions in some places, which meant hand writing code to replace them. To make it worse, there was no easy explanation of what kinds of regular expressions were ok.

Recently I re-encountered this series of articles about implementing regular expressions.

Regular Expression Matching Can Be Simple And Fast
Regular Expression Matching: the Virtual Machine Approach
Regular Expression Matching in the Wild

I'd read them before but at that point I'd been satisfied with what I had. The articles made it seem easy to implement a non-backtracking approach to avoid the worst case behavior. Of course, it didn't turn out to be quite that easy to write a full implementation. Naturally, the articles didn't cover all the details.

It started out as a weekend project, but it ended up taking me longer than that, partly because I implemented optimizations for literal strings and single pass expressions.

Because Suneido is ASCII only (due to its age), the code is a little simpler and faster than the Go standard library version which handles UTF-8.

The Go regular expression package compiles to linked structs that are relatively large. I chose to compile to byte code stored in a string. I like to use strings because they are one of the few constant immutable types in Go. And string references are only 16 bytes compared to 24 bytes for byte slices. There's probably a bit more overhead processing the byte code, but it's a more compact chunk of memory so it'll be more cache friendly.

Humorously (to me) when I was fuzz testing my implementation against the Go standard library implementation I discovered a bug in the Go code. As expected, one of the initial responses was denial, blaming it on my lack of understanding. Thankfully, someone else (not on the Go team) confirmed the problem and even located a possible location of the bug. I thought a bug in something like regular expressions would be treated fairly seriously, but over a month later it's still just labeled as "NeedsInvestigation" rather than "NeedsFix". It was put under the Go1.21 milestone at least.

The new implementation is about 2000 lines of Go code. Probably half of that I was able to reuse from the previous version. As usual the code is on GitHub

Saturday, April 08, 2023

Full Text Search

Suneido uses full text search for its help and wiki. jSuneido used Lucene, a natural fit for Java. I needed something different for gSuneido. I looked at Bleve which seems to be the most common Go full text search. But it was big and slow, and the indexes were large. I had used Lunr.js on a previous project and it had worked well. It's small and fast and easy to use. It's JavaScript, but we display our help and wiki in a browser window so that seemed ok.

We (thanks Jatin) got it working, but it was a little ugly. Our customer systems have different combinations of applications and options so we built custom indexes on each system. But for Lunr we were using Node.js to build the indexes and we didn't want to install Node on all our customer systems. So we had to build an index containing "everything", ship that to our customers (every time we updated the help), and then filter the search results based on their configuration.

We only use our Wiki in-house, but with it there was another issue. People edit the wiki all the time. With Lucene, we could update the index with changes, but Lunr.js doesn't support that, you have to rebuild the whole index.

Eventually I got frustrated with the "friction" of using Lunr and decided to continue my long tradition of re-implementing the wheel. I'd considered writing my own full text index/search previously but had resisted the urge. But I had a weekend, and nothing urgent to do, so I decided to give it a try. I started with Let's build a Full-Text Search engine, which makes it sound easy, but mostly because the simple version it describes isn't complete.

Tokenizing is easy, although there are questions about exactly what constitutes a token. Obviously, sequences of letters, but what length? I decided to ignore single letters and also ignore anything over a certain length (e.g. 32 characters). Especially in our wiki we also wanted to search on numbers. There's also the issue of punctuation, which I'm currently ignoring.

I used the Go Snowball (Porter2) stemmer mentioned in the article. A stemmer simplifies words. For example, fishing, fished and fisher are reduced to the base form (stem) fish. That reduces the number of terms in the index and makes searching more forgiving.

I got sidetracked into looking at bitmap data structures like roaring bitmaps (as suggested in the article), but then I started looking at how to implement search scoring and found that bitmaps weren't sufficient, I needed counts per term per document, not just true/false.

I decided to use BM25 scoring like Lucene and Lunr. I found a good article that explained it and even gave some examples I could check my results against.

It came together surprisingly easily and by the end of the weekend I had about 1000 lines of Go code that could create indexes and search them. It was fast and the indexes were small (compared to Lunr). The results seemed comparable. I felt a little guilty because it meant throwing out all the work that went into trying to use Lunr. For a change I felt like I should have written my own version sooner.