Tuesday, February 01, 2022

A Better LRU Cache for gSuneido

When I was working on gSuneido performance I discovered that the LRU cache in the standard library was one of the hot spots.

It was a simple design - a hash map of keys to values, plus an array of keys to track which was least recently used. When I wrote it, many years ago, it was just a quick addition. I had no idea it would be a heavily used bottleneck. Usage increased a lot when we added the easy ability to memoize a function.

I actually had to work backwards to find that LruCache was a bottleneck. What I first discovered (with the Go memory profiler) was that deepEqual was doing the majority of allocation. It was being called by SuObject Equal. The allocation was to track which comparisons were in progress to avoid handle self referential data. I was able to tweak deepEqual to greatly reduce the allocation. That removed the majority of the bottleneck.

But why were there so many object comparisons? I added debugging to print the Suneido call stack every 100 comparisons. It was coming from object.Remove which was being called by LruCache. To maintain its LRU list, it would move items to the most-recently-used end of the list by removing them and re-adding them. Since this was just an array, to find the item to remove it, it did a linear search. i.e. a lot of comparisons. Originally, caches had mostly been small, as you could tell from the default size of 10. But now the most common size was 200. 10 is reasonable for linear search, 200 is not so good.

In addition, originally keys were simple values like strings. But keys often ended up being several values in an object. The combination of linear searches through long arrays of composite values is what led to the bottleneck.

I decided to make LruCache built-in i.e. written in Go. But I also needed a new design that would avoid the linear searches. The normal way to track LRU is with a double linked list. I started in that direction but linked lists are not a good fit for modern CPU’s. Because the entries are individually allocated they are scattered in memory which leads to CPU cache misses. Contiguous arrays are much better because memories are optimized for sequential access. You could store the linked list entries in a contiguous block, but you still have the overhead of pointers (or indexes) and you still don't have locality or sequential access.

I ended up with the following design.

  • an unordered array of key, value pairs (entries)
  • a hash map of keys to entry indexes (hmap)
  • an array of entry indexes in LRU order

A successful lookup becomes slightly more complex - look in the hash table for the entry index, and then get the value from the entries array. Finally, we have to find that entry index in the lru array (a linear search but of small integers) and move it to the most recently used end.

The hash table lookup will still require comparing multiple argument objects, but much fewer than a linear search.

To replace the oldest item, we take the entry index from the least recently used end of the lru array. From that entry we get the key to remove it from the hash table. And we reuse that slot in the entries array to store the new key and value.

The new built-in LruCache was about 100 times faster than the old one. Overall, it improved the speed of our application test suite by about 5%. That’s pretty good for a couple days work.

ASIDE: There are two main aspects to a cache. The focus is usually on which items to evict i.e. least recently used or least frequently used. The other aspect is which items to add. The normal assumption is that you add everything. But actually, you really only want to add items you’re going to be asked for again. Otherwise you’re evicting potentially useful items. We can’t predict the future, but we can assume that items we see twice are more likely to be seen again. One approach is to use a bloom filter to track which keys you’ve seen and only add to the cache ones you’ve seen before.