Tuesday, August 04, 2020

Sorted List

Let's say you want to add a large number (e.g. up to millions) of values to a list, one at a time, in no particular order, and at the end you want to iterate through them in sorted order. There are a number of ways you can accomplish this.

Most languages have some sort of list data structure. In C++ you could use a vector, in Java an ArrayList, in Go a slice. Most of these grow by doubling (or possibly some other factor) in size as required. At the end you can simply sort the list. There are several drawbacks to this approach. One is that you have wasted space. Assuming doubling, the list is between 50% and 100% full, so you waste 25% on average. You also allocate roughly double the amount of space you end up with. (e.g. to get 8 elements you allocate 1, 2, 4, and then 8) In a garbage collected language this puts extra pressure on the garbage collector. And each time the size doubles you have to copy all the values to the new space.

One way around the space, allocation, and copying overhead is to split the list up into fixed size blocks and grow the list by adding more blocks.

You could use an ordered set data structure (assuming the values are unique) like a Java SortedSet or a C++ set. These are often implemented with something like a red-black tree which is not very efficient space-wise since each value has one or more pointers associated with it.

Log structured merge (LSM) trees are another option. If you just use lists for each level, the space overhead is minimal. Like the simple list, you can split the levels into blocks to reduce space and allocation overhead.

But we don't really need an ordered data structure since we don't need the order till all the values have been added. So an ordered set or LSM tree is doing more work than necessary.

My current implementation in jSuneido evolved from an LSM tree. It uses a list of blocks and at the end it sorts each block individually and does a series of two-way merges to combine them. Deciding what to use for gSuneido, it struck me that it would probably be more efficient to just sort the list as a whole since the two-way merges were merging (compare & copy) each value log2(n) times. It was relatively easy to modify my jSuneido implementation to try that. Curiously, it was slower. That seems a bit odd since the overall sort should do less compares and copies. I have to assume that once again modern cpu design is making intuition invalid. I would guess that sorting the blocks individually is fast because they fit into memory cache. And merging is fast because it's reading and writing sequentially which is optimal for cache usage.

You could merge all the blocks at once with an N-way merge, rather than a sequence of two way merges. This would mean less comparisons and copies, but it doubles the space usage during the merge. A nice aspect of a block list and two way merges, is that you can recycle the blocks. As soon as you finish merging a block, you can put it on a free list to reuse. This means the space overhead for two way merging is only a couple of blocks instead of double the size. (which reduces allocation, which reduces GC)

One interesting aspect of this approach is that the sorting and merging can be done incrementally. As soon as you fill up a block you can sort it. And as soon as you have two (or power of two) blocks you can merge them. This doesn't change the amount of work required, it just redistributes it over time. Spreading work more evenly and making it less bursty can be beneficial but it's not a clear win.

But then I realized you could do the sorting and merging concurrently with building the list. Again, it doesn't change the amount of work, it just allows some of it to be done in parallel. There are various ways to do this. I created a worker goroutine and triggered it via a channel. You could just start a new goroutine each time, but that's a bit more overhead. In Java I'd be thinking about a thread pool, but goroutines are lightweight enough that it's probably not necessary.

I ended up writing multiple Go versions so I could benchmark them. The results were pretty much what I expected. Simple slice and sort is fast, but lots of allocation. Block list is much less allocation but slightly slower overall. Merging is slightly faster than a complete sort. Block list with concurrent sort/merge was the winner. It had the lower memory usage of the block list, and was 10 to 20% faster due to the concurrency.

BenchmarkSimple  30 35164847 ns/op 664934 B/op 23 allocs/op
BenchmarkBlocks  26 43607473 ns/op 131139 B/op  6 allocs/op
BenchmarkMerge   27 37056507 ns/op 197306 B/op 31 allocs/op
BenchmarkConcur  37 28624720 ns/op 197768 B/op 34 allocs/op


This was with a block size of 4096 and adding 16,384 items.

The initial code is in Github (this points to the original commit so the link doesn't break, in the future the latest version is probably a better bet)

No comments: