Sunday, March 31, 2013

String Building Internals

Recently we ran into a problem where a service using cSuneido was running out of memory. I confidently switched the service to use jSuneido, and got stack overflows. So much for my confidence!

We tracked it down to calling Base64.Encode on big strings (e.g. 1mb) (This came from sending email with large attachments.)

Base64.Encode was building the result string a character at a time by concatenating. In many languages this would be a bad idea for big strings. (e.g. In Java you would use StringBuilder instead.) But Suneido does not require you to switch to a different approach for big strings, the implementation is intended to handle it automatically.

Up till now, Suneido has handled this by deferring concatenation of larger strings. If the length of the result is greater than a threshold (64 bytes in cSuneido, 256 bytes in jSuneido) then the result is a "Concat" that simply points to the two pieces. This defers allocating a large result string and copying into it until some operation requires the value of the result. Basically, Suneido makes a linked list of the pieces.

For example, if you built a 1000 character string by adding one character at a time, Suneido would only allocate and copy the result once, not 1000 times. (It still has to create 1000 Concat instances but these are small.)

In practice, this has worked well for many years.

I thought I had implemented the same string handling in jSuneido, but actually I'd taken a shortcut, and that was what caused the stack overflows.

The obvious way to convert the tree of Concat's to a regular string is a recursive approach, first copying the left side of each concat, and then copying the right side, each of which could be either a simple string or another Concat. The problem is that the tree is not balanced. The most common case is appending on the end, which leads to a tree that looks like:

But with many more levels. If your tree has a million levels, then recursing on each side (in particular the left side) will cause stack overflow. In cSuneido I had carefully iterated on the left side and recursed on the right. (Basically doing manual tail-call elimination.) But when I ported the code to Java I had unknowingly simplified to recurse on both sides.

Once I knew what the problem was, it was easy enough to fix, although a little trickier due to using StringBuilder.

Thinking about the problem I realized in this case, there was a better way to write Base64.Encode. I could use the form of string.Replace that takes a function to apply to each match, and simply replace every three characters with the corresponding four characters in Base64. Since string.Replace uses a buffer that doubles in size as required (via StringBuilder in jSuneido), this is quite efficient. (In the process I discovered string.Replace in cSuneido choked on nul's due to remnants of nul terminated string handling. Easy enough to fix.)

But that left the problem of why cSuneido ran out of memory. I soon realized that the whole Concat approach was hugely wasteful of memory, especially when concatenating very small  strings. Each added piece requires one memory block for the string, and another for the Concat. Most heaps have a minimum size for a heap block e.g. 32 bytes. So in the case of appending a single character, it was using 64 bytes. So building a 1 mb string a character at a time would use something like 64 mb. Ouch!

In theory, cSuneido could still handle that, but because it uses conservative garbage collection, spurious references into the tree could end up with large amounts of garbage accumulating.

Maybe my whole Concat scheme wasn't ideal. To be fair, it has worked well up till now, and still works well for moderate string sizes. It just wasn't designed for working with strings that are millions of characters long.

I dreamed up various complicated alternative approaches but I wasn't excited about implementing them. It seemed like there had to be a simpler solution.

Finally, I realized that if you only worried about the most common case of appending on the end, you could use a doubling buffer approach (like StringBuilder). The result of a larger concatenation would be a StrBuf (instead of a concat). Because strings are immutable, when you appended a string onto a StrBuf, it would add to the existing buffer (sharing it) if there was room, or else allocate a new buffer (twice as big).

Theoretically, this is slightly slower than the Concat approach because it has to copy the right hand string. But in the common case of repeatedly appending small strings, it's not bad.

This approach doesn't appear to  be noticeably faster or slower, either in large scale (our entire application test suite) or in micro-benchmarks (creating a large string by repeatedly appending one character). But, as expected, it does appear to use significantly less memory (roughly half as much heap space).

One disadvantage of this approach is that it doesn't help with building strings in reverse order. (i.e. by repeatedly adding to the beginning of a string) That's unfortunate, but in practice I don't think it's too common.

A nice side benefit is that this approach is actually simpler to implement than the old approach. There is no tree, no recursion, and no need for manual tail-call elimination.

Unfortunately, in Java (where I've been working on this) you still have to convert the StringBuilder to a string when required, which makes yet another copy. In cSuneido I should be able to simply create a string wrapping the existing buffer.

Thinking about it some more, I wasn't happy with the extra copying - a lot more than the previous approach. I realized I could combine the two approaches - use a doubling array, but storing references to the pieces rather than the actual characters. This eliminates the copying while still reducing memory usage.

Note: Concats and Pieces cannot be combined because Pieces is shared - appending to a Concats will result in a new Concats that (usually) shares the same Pieces. Concats is effectively immutable, Pieces is mutable and synchronized. (It would be nice if Java let you combine a class and an array to make variable sized objects, but it doesn't.)

However, in the worst case of building a large string one character at a time, this would still end up requiring a huge array, plus all the one character strings. But given the array of strings, it's relatively easy to merge small strings and "compact" the array, reducing the size of the array and the overhead of small strings.

I didn't want to compact too often as this would add more processing overhead. I realized that the obvious time to compact was when the array was full. If there were contiguous small strings that could be merged, then you could avoid growing the array.

I was disappointed to discover that this new approach was about 1/2 as fast as the old approach (when building a 1mb string one character at a time). On the positive side, it resulted in a heap about 1/4 the size (~200mb versus ~800mb). It was a classic space/time tradeoff - I could make it faster by doing less merging, but then it used more memory.

Interestingly, the test suite (a more typical workload) ran about 10% faster. I can't explain that, but I'm not complaining!

It still nagged me that the new approach was slower building huge strings. I tried running it longer (more repetitions) to see if garbage collection would have more of an effect, but that didn't make much difference. Nor did building a 100kb string instead of 1mb (and again doing more repetitions). Then I thought that since my new approach was specifically aimed at building huge strings, maybe I should try increasing the size. Sure enough, building a 10mb string was 4 times faster with the new approach. Finally, the results I was hoping for! Of course, in practice, it's rare to build strings that big.

I initially wrote special code for adding to the beginning of a concatenation and for joining two concatenations. But then I added some instrumentation and discovered that (at least in our test suite) these cases are rare enough (~1%) that they're not worth optimizing. So I removed the extra code - simpler is better.

There are a number of "thresholds" in the code. I played around a bit with different settings but performance was not particularly sensitive to their values.

I probably spent more time on this than was really justified. But it was an interesting problem!

As usual, the code is in Mercurial on SourceForge. Or go directly to this version of Concats.