Tuesday, January 16, 2018

Small Size Data Structure Optimization

The usual approach for a expandable "vector" data structure is to use a dynamically allocated array that doubles in size as necessary. That's how C++ std::vector and Java ArrayList and Go slice/append work.

But that means the time and space overhead of a heap allocation. And it means indirection and poor locality for CPU caches. (The array itself is contiguous, but you have to chase a pointer to get to it.)

If you have a lot of small lists, one possible optimization is to "embed" a small array inside the list structure. That's not really feasible in Java, but easily done in C++ or Go. One example of this is the small string optimization (SSO) in C++.

But what happens when you grow beyond that small embedded array? The simplest solution is to just stop using it. But that seems wasteful, both in space and in locality.

Another option is to treat the embedded array as the start of the larger overall array, at the cost of complicating the indexing a little. That fixes the space wastage, but we still don't get much benefit from locality since the additions to a vector go at the end.

What if we treat the embedded array as following the used elements of the larger dynamically allocated array? Additions go into the embedded array until it fills up, and then we spill the contents into the larger dynamically allocated array.

If the embedded array holds e.g. 4 elements, then 3/4 of the adds only access the embedded array with good locality. Every 4th addition will block move the entire contents of the embedded array (4 elements) into the larger dynamic array (which will occasionally double in size). So the embedded array acts as a "buffer".


Merge trees and buffered Btrees have a somewhat similar pattern in that most of the activity takes place in a small area which periodically spills into the rest of the data structure.

I did some quick benchmarking using Go. My initial result was that it was slightly slower, but did about half as many allocations (because of the embedded small array). The speed was a little disappointing, but I was comparing against a simple Go slice append so my code was admittedly more complex.

Of course, the locality part of it is harder to measure. A small test that fits entirely in cache will not benefit much from locality. Sure enough, if I made the test create many more vectors, then the new version beat a simple Go slice append by about 30%.

The allocation advantage is entirely dependent on how big the vectors are. If they are small enough to fit in the embedded array, then there are zero allocations. If the vectors are very large then the number of allocations approaches the same as the slice append. Of course, the whole reason for the embedded array is that you have a lot of little objects.

One drawback of the embedded array is that it makes the data structure larger. A vector can be as small as a pointer and two integers or pointers which is reasonable to pass by value. Once you add an embedded array, you probably don't want to pass it by value or store it by value in other vectors or maps. If that means you have to allocate it on the heap, then you won't get as much advantage. However, it you're using it locally or embedding it in a larger data structure then it's fine.

No comments: