Thursday, April 17, 2014

Overlooking the Simple Solution

For no particular reason I've been thinking about concatenating strings. (I know, get a life, but this is at least part of my life.) It was partly prompted by thinking about Go and how it might work to implement different facets of Suneido. 

Most languages warn you about the poor performance of building a large string by repeated concatenation. They recommend using a StringBuilder (Java) or its equivalent. 

But Suneido explicitly optimizes repeated concatenation internally so the programmer doesn't have to worry about when to concatenate and when to switch to "building". 

In cSuneido (the original C++ implementation) concatenation just creates a linked list of substrings, deferring the actual allocation and copying. 

Originally, I ported that approach to jSuneido (the Java version). But I cut a few corners that I thought were safe to cut. That came back to haunt me. Rather than fix the problems I looked for a better solution. (There are some issues when the linked list gets too big.) I considered some kind of merge tree but decided that was more complex than necessary.

Instead of a linked list I used an array of pieces which was expanded as needed. If the array got big it would merge small pieces. That has been working fine. 

Analyzing the problem from a more theoretical basis I figured the worst approach is repeated naive concatenation and the best (?) is something like StringBuilder that uses a buffer that expands e.g. by doubling in size. My array approach is somewhere in between, as is a merge tree. 

At that point it struck me that I could just use a StringBuilder rather than my array approach. Duh!

That eliminated about a hundred lines of code and ran about 20% faster. 

I feel stupid for not thinking of this sooner. It seems so obvious (now!) But I was stuck on the idea of deferring concatenating. 

And now I feel even more stupid because I just searched my own blog and found that I did consider a buffer approach but decided it required too much copying. (String Building Internals) Using a StringBuilder does mean copying into it and then eventually copying the result back out. But considering that Java compiles string concatenation into using StringBuilder, the overhead can't be too big. (Go, on the other hand, compiles string concatenation into calls to a runtime function that allocates a new string and copies directly into it, without any intermediate buffer.)
One advantage of my original linked list approach is that everything is immutable and therefore threadsafe without locking. That's attractive. Both the array and the StringBuilder are mutable and require locking. That's not a big deal on Java since the locks will almost always be uncontested and therefore very fast. And the locking is at the leaves of the call tree so they should be safe from issues like deadlock. 

But in Go, locking is less acceptable and an immutable solution would be nice. I have an idea for an immutable merge tree approach - stay tuned :-)

No comments: