Friday, March 13, 2009

String Concatenation

Working on jSuneido, I noticed that I hadn't implemented Suneido's optimizations for concatenating strings.

In most languages this kind of code is grossly inefficient:

s = ""
for each line
  s = s + line // concatenate on the end

That's because each concatenation is allocating a new string, a little bit longer each time. The normal solution is to use a different mechanism, e.g. a StringBuffer in Java.

Suneido optimizes this case by constructing a linked list instead of allocating a new string each time. Later operations then "flatten" the list into a simple string.

The downside is that if you concatenate a lot of little strings the linked list representation uses a lot more memory than a simple string. i.e. we've traded memory for speed

Reading the 2nd edition of Programming Lua, I came across the usual discussion of why it's bad to concatenate like this. But they recommend a different solution - to keep a stack/list of pieces with the invariant that you can't put a bigger piece on top of a smaller piece. Instead you combine them, and if the newly combined piece is bigger than the one below it, you combine them, and so on.

It took me a minute to think this through. If you're adding random sized pieces, then 50% of the time you'll be adding one bigger than the last one, and therefore combining them. The end result, if I'm thinking this through properly, is that, on the average, each piece in the stack/list will be half as big as the one before. So a list of N entries will hold roughly 2 ^ N characters. i.e. you can build a big string with a short stack/list.

The Lua book is suggesting doing this in your application code, but my thought is to replace Suneido's current algorithm with this one. I'd be interested to see how it compares. It will do more concatenation and allocation, but will use less memory when dealing with a lot of pieces. It also does the concatenation more "incrementally" rather than one big concatenation to flatten the list like cSuneido currently does.

No comments: