Sunday, May 05, 2013

Optimizing Tr

Suneido has a string.Tr function, similar to the Unix tr command. Recently, I was looking at the stdlib Base64 code. Decode was using string.Tr to strip any newlines. However, there may not be any newlines. Which made me wonder what Tr did in this case - did it still make a copy of the string? Sure enough, it did.

My first reaction was to add a guard:

if s.Has?('\n')
    s = s.Tr('\n')

Then I started to wonder where else we should be doing this. But that was ugly. It made more sense to build it into Tr.

But implementing it as above would mean doing an extra scan of the source string. Since the code was already scanning, it would be more efficient to just defer copying until found somewhere where you needed to make a change.

I implemented this in the C++ code. It complicated it a little, but not too bad.

Then I tried to implement the same thing in the Java code. Not being able to "cheat" with macros like in the C++ version meant I had to create an instance to share the variables. This defeated part of the purpose (to avoid allocation) and seemed ugly.

Instead I decided to do an initial scan to find the first character to be changed, and then either return or continue from that point. This still avoided redundant scanning, without complicating the main part of the code.

In the process, I discovered there were a few other cases I could short circuit - if the source string is empty, or if the "from" character set is empty then you can just return the source string unchanged. Also, if there are no ranges in the from set or the to set, then you don't need to expand the sets, avoiding more allocation and copying, albeit only on the sets. I also simplified the code a little. (It was a good thing I had tests since I "simplified" a little too aggressively a few times!) See the code.

I also decided to add a cache for expanding sets with ranges. I'm not sure that's justified, but it's similar to what I do with regular expressions. The regular expression code had been using a home-brew LruCache, but I switched to Google Guava's caching facilities. That also allowed time based expiry so I could make the cache bigger without wasting space if it wasn't needed.

Then I went back and revised the C++ tr code using the same approach as the Java code.  For some reason I had originally use std::vector for the sets. I switched to making them gcstring's to avoid making a new set if there are no ranges to expand. I also added a cache, using the existing CacheMap that was being used for regular expressions.

Although it sometimes bugs me to have to update two implementations of Suneido, it does have its benefits. It means if I want to use the same approach I can't take too much advantage of specific language features (e.g. macros). This might mean I can't use some fancy feature, but in many cases that's not the wisest anyway. By the time I've written the code in two different languages, I think the end result is usually better than if I'd just written it once.

Again, I'm guilty of optimizing without real proof of whether it's justified. Tr is fairly heavily used in some areas, and the changes will probably speed up those areas. But whether that makes much difference in the bigger picture is questionable.
An amusing historical aside - I originally ported this code from Software Tools by Kernighan and Plauger, which uses Ratfor (rational Fortran). The code's origins are still visible in some of the names and structure. This book was an inspiration in my early programming days. I still have my original copy, although it's falling apart. It's pretty cool that it's still in print 37 years later.

See also previous posts: String Building Internals and An Amusing Bug

No comments: