Monday, July 07, 2025

Super Instructions

The overhead of an interpreter depends on the size of the instructions compared to the dispatching. So one of the ways to reduce the overhead is to increase the size of the instructions. One way to do that is “super instructions” - combining several instructions into one. This caught my interest and I decided to see whether it would help with Suneido. The Suneido byte code instructions themselves are mostly simple, which means a higher overhead, but most of the time is spent in built-in functions like database queries, which means a lower overhead.

The first step was to determine if there were combinations of instructions that were common enough to justify optimizing them. I added some quick temporary instrumentation to the interpreter. First I looked at sequences of two instructions. The most common accounted for roughly 7% of the instructions. The top 10 made up 40% of the instructions. That was actually higher than I expected. I also looked at sequences of 3 and 4 instructions. The most common was 3% so I decided to keep it simple and stick to 2 instruction sequences.

% op1 op2
7.0 Load Value
6.7 Value CallMethNoNil
5.2 Value Get
4.8 Load Load
4.5 This Value
2.7 Store Pop
2.5 This Load
2.3 Get Value
2.2 Pop Load
2.2 Global CallFuncNoNil
40.1 Total

The compiler code generator funnels through a single emit function so it was easy to modify that to detect the sequences and replace them with combined instructions. And it was easy to add the new instructions to the interpreter itself. If I’d stopped there it would have been a quick and easy one day project. But, of course, I had to benchmark it. At first that went well. Micro benchmarks showed clear improvement as I expected. But running our application test suite showed a slight slowdown of 1 or 2%. That puzzled me.

One possibility was that it slowed down the compiler. Running the test suite requires compiling all the code. The changes did slow down the compiling slightly but compiling is fast enough that it doesn’t even show up in profiling. I optimized the code slightly but that wasn’t the problem.

That left the interpreter. gSuneido uses a giant switch to dispatch instructions. Go doesn’t allow some of the other dispatching methods you could use in e.g. C(++). I had added 10 cases to the switch. Did that slow it down? How did Go implement switches? Searches didn’t turn up a lot, but it sounded like it used a binary search. Running the code in the debugger confirmed this. Maybe adding more cases had added another level to the binary search? I tried rewriting the switch as a hand written binary search using if-else, optimized for the most frequent instructions. That quickly turned into an ugly mess. More searching found Go issue 5496 to implement switches with jump tables. The issue appeared to be completed. I used Compiler Explorer to test and sure enough, Go compiled dense integer switches as jump tables. But not my interpreter switch??? I tried simplifying the switch, ordering the cases, removing fallthrough, etc. But the debugger still showed a binary search. I found the implementation in the Go compiler source code. It seemed straightforward; there were no special requirements. I tried one of my simple examples from Compiler Explorer inside my gSuneido project. That didn't show as a jump table in the debugger either. What the heck!

Back to the Go compiler source code. There was a condition on base.Flag.N, what was that? Oh crap. That's the -N command line option that disables optimization, which is used by the debugger. So whenever I was using the debugger to look at the assembly language code, I was seeing the unoptimized version. A little digging and I figured out how to use gdb to disassemble the production code and found it had been using a jump table all along. Argh!

Back to the original question - why was it slower? Maybe the extra code in the interpreter was affecting inlining? I looked at the compiler inlining logging but everything still seemed to be getting inlined correctly.

Looking at the cpu profile, the interpreter time decreased as I'd expect. The only noticeable increase was in garbage collector time. But the changes don't do any allocation. They would actually decrease the size of compiled code slightly. The allocation from the test suite was slightly higher. But the memory profile didn't show anything.

In the end, I gave up. I can't figure out why the overall test suite seems slightly slower. There doesn't seem to be any evidence that it's a result of the changes. Everything else shows an improvement so I'm going to keep it. I can only guess that the slowdown is a result of perturbing the code, affecting layout, or caching, or branch prediction, or timing. Modern cpu's are so complex that they are somewhat non-deterministic.

Code is in Github as usual.

No comments: