Saturday, April 08, 2023

Full Text Search

Suneido uses full text search for its help and wiki. jSuneido used Lucene, a natural fit for Java. I needed something different for gSuneido. I looked at Bleve which seems to be the most common Go full text search. But it was big and slow, and the indexes were large. I had used Lunr.js on a previous project and it had worked well. It's small and fast and easy to use. It's JavaScript, but we display our help and wiki in a browser window so that seemed ok.

We (thanks Jatin) got it working, but it was a little ugly. Our customer systems have different combinations of applications and options so we built custom indexes on each system. But for Lunr we were using Node.js to build the indexes and we didn't want to install Node on all our customer systems. So we had to build an index containing "everything", ship that to our customers (every time we updated the help), and then filter the search results based on their configuration.

We only use our Wiki in-house, but with it there was another issue. People edit the wiki all the time. With Lucene, we could update the index with changes, but Lunr.js doesn't support that, you have to rebuild the whole index.

Eventually I got frustrated with the "friction" of using Lunr and decided to continue my long tradition of re-implementing the wheel. I'd considered writing my own full text index/search previously but had resisted the urge. But I had a weekend, and nothing urgent to do, so I decided to give it a try. I started with Let's build a Full-Text Search engine, which makes it sound easy, but mostly because the simple version it describes isn't complete.

Tokenizing is easy, although there are questions about exactly what constitutes a token. Obviously, sequences of letters, but what length? I decided to ignore single letters and also ignore anything over a certain length (e.g. 32 characters). Especially in our wiki we also wanted to search on numbers. There's also the issue of punctuation, which I'm currently ignoring.

I used the Go Snowball (Porter2) stemmer mentioned in the article. A stemmer simplifies words. For example, fishing, fished and fisher are reduced to the base form (stem) fish. That reduces the number of terms in the index and makes searching more forgiving.

I got sidetracked into looking at bitmap data structures like roaring bitmaps (as suggested in the article), but then I started looking at how to implement search scoring and found that bitmaps weren't sufficient, I needed counts per term per document, not just true/false.

I decided to use BM25 scoring like Lucene and Lunr. I found a good article that explained it and even gave some examples I could check my results against.

It came together surprisingly easily and by the end of the weekend I had about 1000 lines of Go code that could create indexes and search them. It was fast and the indexes were small (compared to Lunr). The results seemed comparable. I felt a little guilty because it meant throwing out all the work that went into trying to use Lunr. For a change I felt like I should have written my own version sooner.

No comments: