Sunday, May 05, 2013

Optimizing Tr

Suneido has a string.Tr function, similar to the Unix tr command. Recently, I was looking at the stdlib Base64 code. Decode was using string.Tr to strip any newlines. However, there may not be any newlines. Which made me wonder what Tr did in this case - did it still make a copy of the string? Sure enough, it did.

My first reaction was to add a guard:

if s.Has?('\n')
    s = s.Tr('\n')

Then I started to wonder where else we should be doing this. But that was ugly. It made more sense to build it into Tr.

But implementing it as above would mean doing an extra scan of the source string. Since the code was already scanning, it would be more efficient to just defer copying until found somewhere where you needed to make a change.

I implemented this in the C++ code. It complicated it a little, but not too bad.

Then I tried to implement the same thing in the Java code. Not being able to "cheat" with macros like in the C++ version meant I had to create an instance to share the variables. This defeated part of the purpose (to avoid allocation) and seemed ugly.

Instead I decided to do an initial scan to find the first character to be changed, and then either return or continue from that point. This still avoided redundant scanning, without complicating the main part of the code.

In the process, I discovered there were a few other cases I could short circuit - if the source string is empty, or if the "from" character set is empty then you can just return the source string unchanged. Also, if there are no ranges in the from set or the to set, then you don't need to expand the sets, avoiding more allocation and copying, albeit only on the sets. I also simplified the code a little. (It was a good thing I had tests since I "simplified" a little too aggressively a few times!) See the code.

I also decided to add a cache for expanding sets with ranges. I'm not sure that's justified, but it's similar to what I do with regular expressions. The regular expression code had been using a home-brew LruCache, but I switched to Google Guava's caching facilities. That also allowed time based expiry so I could make the cache bigger without wasting space if it wasn't needed.

Then I went back and revised the C++ tr code using the same approach as the Java code.  For some reason I had originally use std::vector for the sets. I switched to making them gcstring's to avoid making a new set if there are no ranges to expand. I also added a cache, using the existing CacheMap that was being used for regular expressions.

Although it sometimes bugs me to have to update two implementations of Suneido, it does have its benefits. It means if I want to use the same approach I can't take too much advantage of specific language features (e.g. macros). This might mean I can't use some fancy feature, but in many cases that's not the wisest anyway. By the time I've written the code in two different languages, I think the end result is usually better than if I'd just written it once.

Again, I'm guilty of optimizing without real proof of whether it's justified. Tr is fairly heavily used in some areas, and the changes will probably speed up those areas. But whether that makes much difference in the bigger picture is questionable.
An amusing historical aside - I originally ported this code from Software Tools by Kernighan and Plauger, which uses Ratfor (rational Fortran). The code's origins are still visible in some of the names and structure. This book was an inspiration in my early programming days. I still have my original copy, although it's falling apart. It's pretty cool that it's still in print 37 years later.

See also previous posts: String Building Internals and An Amusing Bug

Friday, May 03, 2013

An Amusing Bug

I recently realized that the block form of Suneido's string.Replace was a more efficient way to "map" over strings.

As I mentioned in my recent post on String Building Internals, I also discovered cSuneido's string.Replace didn't handle nul's. I fixed this problem and we sent out the new suneido.exe to our beta customers.

And we started to get support calls that PDF attachments weren't being sent properly. Sure enough it was the new version of Base64.Encode that used string.Replace.

But I had tests, and we had also tested manually and it worked fine. We got one of the problem files from a customer and sure enough it failed.

Digging into it, I could see that it was only encoding part of the file. That seemed a bit like the previous nul problem, which led me in the wrong direction for a while.

More testing revealed it was encoding the first 39996 characters of the file. That seemed like an odd number. My first thought was that it was in the general vicinity of SHORT_MAX or 32767. When I first wrote the C++ version of Suneido I was still trying to use short int's when possible. This has led to a number of issues since SHORT_MAX isn't very big in modern terms. But the relevant code wasn't using any short int's.

But I noticed a magic number of 9999 in the code. Base64 encode outputs groups of 4 characters. 4 x 9999 - 39996. Aha!

string.Replace takes an optional argument for how many replacements to do. Usually, you either want 1 or all. When not specified, the count was defaulting to 9999. For "normal" usage, that's plenty. But when using replace to map large strings, it obviously isn't.

I changed it to INT_MAX and that fixed the problem. Out of curiosity  I went and checked the jSuneido code. (It did not have the same issue.) Frustratingly, it already had INTMAX. I don't think I foresaw this issue when I ported the code, it probably just bugged me to have a magic number.

I'm not sure why this strikes me as amusing. It just seems funny that the "bug" was not something obscure, just a badly chosen limit, that no doubt seemed entirely reasonable at the time.