Tuesday, January 17, 2017

Bug Tale

Yesterday, towards the end of the day, one of my staff came to me with a bug in cSuneido. Actually it was the second time he'd brought it to me. The first time I sent him away with a request for more details. This time he brought me a printed screen shot of the error. Perhaps that seemed more tangible.

Sometimes I get the impression that my staff gets a certain amount of secret enjoyment out of pointing out my bugs. Some of that's probably just my imagination and frustration. But even if it's true, it's understandable. They have to put up with me critiquing their code all the time.

I wasn't happy about it. It had been a busy day with the usual parade of interruptions. I didn't feel like I'd accomplished anything, and now this steaming turd had been deposited on my desk. I started to work on the bug, although I was pretty sure there wouldn't be enough time left in the day to get anywhere on it. The only advantage of starting late was that the interruptions had died out now that most people had left for the day.

On the positive side, the bug was repeatable. On the negative side it wasn't simple to reproduce - you had to run client-server, open our application, and then run a bunch of queries for over a minute. After thinking I'd fixed it several times I realized that it also only happened the first time you did this after starting the client. Subsequent runs worked fine.

The error itself was an access violation. The more I work in "safe" languages like Java or JavaScript or Go, the more I hate unsafe languages like C++. An access violation could be anything - a dangling pointer, a garbage collection issue, an uninitialized variable, an invalidated reference ...

On top of this, the error occurred inside a background fiber. Even better, it didn't occur when I ran the debug version of cSuneido.

As I expected, I didn't get anywhere before I headed home for the day, pissed off.

After supper I debated whether to work on it some more. If I could make a little progress, even just get a clue or narrow it down, then I'd end the day feeling a lot better. On the other hand, if I just ended up banging my head on the wall, I'd probably feel even worse.

I took the gamble, of course. Programmers are nothing if not eternal optimists. They have to be. But I hedged my bet by not "getting serious" and firing up the big iMac. I just sat in the living room with my laptop. That way if I failed I could tell myself it was because I'd just been poking around.

I didn't find the bug, but I did narrow it down enough that I felt I was hot on the trail. I could recreate the problem with a much simpler test case, and I'd found that I could recreate it in the debugger as long as I used the release build. It's harder to debug in the optimized version but being able to use the debugger at all was a big help.

It turned out the only significance of the queries running for over a minute was that during that minute several timer tasks got queued and ran concurrently when the queries ended. I could get the same result by just starting two fibers "at the same time".

Thankfully, the next day I was working at home and could focus on the problem. It was quite puzzling at first. I could see (in the debugger) exactly where the problem was, but the code looked correct, and almost all the time it worked correctly. I even resorted to looking at the assembly language and register contents, something I haven't done for a long time.

Stepping through the code I found there was a fiber context switch in the middle of the problem lines. And from looking at the assembler it was pretty obvious the compiler was caching the results of some common subexpressions, which it probably wasn't doing in the debug version. But I couldn't see how that caused a problem.

With fibers being cooperative and not pre-emptive, you don't really need to worry about concurrency issues. But this turned out to be a concurrency issue after all.

The problem lines were:

tls().thedbms = dbms_remote_async(server_ip);

tls() was getting evaluated and cached. But if dbms_remote_async "blocked" waiting to connect, then another fiber would run, and if that other fiber created a new fiber, and this caused the fibers vector to grow (reallocate), then the cached value of tls() would be invalid, causing the access violation.

Sure enough, if I called reserve on the vector to pre-grow it, then the problem went away.

It only happened the first time because after that the vector wouldn't need to grow and the tls() reference would stay valid.

I was grateful that the problem was so repeatable. If this had been real threads it would have been much more erratic. One of the advantages of fibers is that they are deterministic.

One local fix was to rewrite it as:

auto dbms = dbms_remote_async(server_ip);
tls().thedbms = dbms;

But where else in the code did I have this potential problem? And what would stop me from reintroducing the problem in future code.

My next thought was that I needed to tell the compiler that tls() was "volatile", i.e. it could change concurrently. But that wasn't really the problem. Even in single threaded code inserting into a vector invalidates any references, that's part of its contract.

One option was to use Windows fiber local storage. This didn't exist back when I rolled my own.

Another option was to dynamically allocate the tls structure so it didn't reside inside the vector.

However, there could potentially be other reference into the vector. I'd had problems with this in the past and "solved" them by using indexes into the vector rather than pointers. But it was still something I had to continually be on the lookout for.

Instead, I decided to switch from a vector to a simple fixed size static array. cSuneido isn't designed for huge numbers of fibers anyway. References into a fixed size static array were safe, nothing can invalidate them.

Problem solved (fingers crossed) and my mood has distinctly improved :-)

If you're interested in the nitty gritty, the change is on Github

1 comment:

Larry Reid said...

Programming is hard! Especially concurrent programming. One can understand why Ruby MRI is single-threaded. I'm amazed at how fast you can track down problems like this, although at this point I'm not surprised anymore.