Wednesday, March 11, 2020

Go Bug

Go 1.14 came out recently. I updated without really thinking about it, expecting a painless experience like all the other versions. I rebuilt gSuneido (the Go version of Suneido), ran it, and ... it crashed, with an unfamiliar internal Go error.

At first I just thought the build hadn't worked properly or I had a mix of versions. Or maybe Go hadn't installed quite right.

But re-installing and re-building didn't solve the problem. It still crashed. Annoyingly, all my tests passed. On the positive side (at least in terms of debugging) it crashed consistently, every time, the same way. And if I built exactly the same source code with the previous version (1.13.8) it worked fine.

The crash call stack had some references to defer. (Go's equivalent to try-finally.) And one of the big improvements in Go 1.14 was to ... defer. That was very suspicious, but I wasn't quite ready to blame the crash on Go yet.

One big question was "why me?". (In this kind of situation it is actually a valid question.) If it was a bug in Go why weren't other people running into it? I'm not that much of a unique snowflake. Yet, despite searching the Go issues and the web in general, I couldn't find anyone reporting a similar issue. However, it was possible that I was (ab)using panic/defer/recover in a way other people aren't. For better or worse, the Suneido language has exceptions (like Java and C++). And the easiest way for me to implement exceptions was to use panic/defer/recover. Go strongly discourages this kind of thing. I'm pretty sure Go true believers would regard what I was doing as an abomination. (Another reason I was reluctant to submit a bug report.)

The win32 interface in gSuneido uses unsafe and C code, so that was the first thing to be suspicious of. It was quite possible there was a bug in that code that hadn't surfaced before. Or perhaps an unintentional dependency on some internal feature that had inadvertently changed in 1.14.

Thankfully, it didn't take long to recreate the problem without running any of the win32 interface. Which also allowed me to recreate it on Mac OS X.

The issue seemed to be related to defer, recover, and re-panic'ing. (equivalent to catching and re-throwing an exception in other languages) But strangely it seemed to require one sequence of defer/recover/panic first, and then a second sequence would crash. It was as if the first sequence was corrupting something or leaving something behind that caused the second sequence to crash. 

At this point, after several days of debugging, I was starting to lean towards a bug in Go 1.14. I was hesitant to submit a bug report because I didn't have a small example to reproduce the problem. So based on what I'd discovered so far I tried to create a standalone test case. In retrospect, I think I was close, but none of my test programs would crash.

I decided to submit the bug anyway. At least if anyone else ran into it, there would be something to find when they searched. And maybe someone would have some suggestions on how to debug it further.

crash on 1.14 with unexpected return pc, fatal error: unknown caller pc

I got some responses right away which was nice. One of the first suggestions was to try building with -gcflags=-N which turns off various optimizations, including the new defer handling. Confusingly, it still crashed. Which seemed to indicate it wasn't the defer changes. Luckily at that point the author of the defer changes, Dan Scales, offered to help debug it. I was able to come up with a package of files to allow him to recreate the crash. It wasn't a small example, it was the whole Suneido source code, but at least he could reproduce the crash. It turned out that -gcflags=-N did actually fix the problem. The catch was that setting that flag didn't force recompilation, so I was still running old code. I should have used -gcflags="all=-N". (These "internal" options aren't well documented.)

So Dan confirmed that the bug was in was his changes, and he quickly came up with a fix.
He also suggested a workaround. The defer optimizations didn't kick in if the defer was in a loop. So I could put a one iteration loop around the relevant defer's. (That seems a little odd - I would have thought the compiler would optimize away one iteration loops. Maybe that happens in a later stage.) The question was which defer's. At first I only added the workaround to defer's that re-panic'ed. But it still crashed sometimes (albeit less often). I ended up adding the workaround to all the defer's with recovers and that seemed to fix the problem.

One of the problems is that this is happening in gSuneido's byte code interpreter. So it's impossible to statically determine which code might call which code, which is what's critical to this bug.

I was hoping that the fix would be cherry picked and back ported to 1.14.1 rather than having to wait for 1.15 which is six months away. Thankfully, it looks like that is going to happen.

I was impressed by the response of the Go team/community. They took my bug report seriously and gave me good feedback. No one hassled me about my lack of a small example or my abuse of panic/defer/recover. And the problem was rapidly tracked down and fixed. You can't ask for much more than that. (I also had a good experience with the other bug I've reported and was fixed - 35492)