Tuesday, January 17, 2017

Bug Tale

Yesterday, towards the end of the day, one of my staff came to me with a bug in cSuneido. Actually it was the second time he'd brought it to me. The first time I sent him away with a request for more details. This time he brought me a printed screen shot of the error. Perhaps that seemed more tangible.

Sometimes I get the impression that my staff gets a certain amount of secret enjoyment out of pointing out my bugs. Some of that's probably just my imagination and frustration. But even if it's true, it's understandable. They have to put up with me critiquing their code all the time.

I wasn't happy about it. It had been a busy day with the usual parade of interruptions. I didn't feel like I'd accomplished anything, and now this steaming turd had been deposited on my desk. I started to work on the bug, although I was pretty sure there wouldn't be enough time left in the day to get anywhere on it. The only advantage of starting late was that the interruptions had died out now that most people had left for the day.

On the positive side, the bug was repeatable. On the negative side it wasn't simple to reproduce - you had to run client-server, open our application, and then run a bunch of queries for over a minute. After thinking I'd fixed it several times I realized that it also only happened the first time you did this after starting the client. Subsequent runs worked fine.

The error itself was an access violation. The more I work in "safe" languages like Java or JavaScript or Go, the more I hate unsafe languages like C++. An access violation could be anything - a dangling pointer, a garbage collection issue, an uninitialized variable, an invalidated reference ...

On top of this, the error occurred inside a background fiber. Even better, it didn't occur when I ran the debug version of cSuneido.

As I expected, I didn't get anywhere before I headed home for the day, pissed off.

After supper I debated whether to work on it some more. If I could make a little progress, even just get a clue or narrow it down, then I'd end the day feeling a lot better. On the other hand, if I just ended up banging my head on the wall, I'd probably feel even worse.

I took the gamble, of course. Programmers are nothing if not eternal optimists. They have to be. But I hedged my bet by not "getting serious" and firing up the big iMac. I just sat in the living room with my laptop. That way if I failed I could tell myself it was because I'd just been poking around.

I didn't find the bug, but I did narrow it down enough that I felt I was hot on the trail. I could recreate the problem with a much simpler test case, and I'd found that I could recreate it in the debugger as long as I used the release build. It's harder to debug in the optimized version but being able to use the debugger at all was a big help.

It turned out the only significance of the queries running for over a minute was that during that minute several timer tasks got queued and ran concurrently when the queries ended. I could get the same result by just starting two fibers "at the same time".

Thankfully, the next day I was working at home and could focus on the problem. It was quite puzzling at first. I could see (in the debugger) exactly where the problem was, but the code looked correct, and almost all the time it worked correctly. I even resorted to looking at the assembly language and register contents, something I haven't done for a long time.

Stepping through the code I found there was a fiber context switch in the middle of the problem lines. And from looking at the assembler it was pretty obvious the compiler was caching the results of some common subexpressions, which it probably wasn't doing in the debug version. But I couldn't see how that caused a problem.

With fibers being cooperative and not pre-emptive, you don't really need to worry about concurrency issues. But this turned out to be a concurrency issue after all.

The problem lines were:

tls().thedbms = dbms_remote_async(server_ip);

tls() was getting evaluated and cached. But if dbms_remote_async "blocked" waiting to connect, then another fiber would run, and if that other fiber created a new fiber, and this caused the fibers vector to grow (reallocate), then the cached value of tls() would be invalid, causing the access violation.

Sure enough, if I called reserve on the vector to pre-grow it, then the problem went away.

It only happened the first time because after that the vector wouldn't need to grow and the tls() reference would stay valid.

I was grateful that the problem was so repeatable. If this had been real threads it would have been much more erratic. One of the advantages of fibers is that they are deterministic.

One local fix was to rewrite it as:

auto dbms = dbms_remote_async(server_ip);
tls().thedbms = dbms;

But where else in the code did I have this potential problem? And what would stop me from reintroducing the problem in future code.

My next thought was that I needed to tell the compiler that tls() was "volatile", i.e. it could change concurrently. But that wasn't really the problem. Even in single threaded code inserting into a vector invalidates any references, that's part of its contract.

One option was to use Windows fiber local storage. This didn't exist back when I rolled my own.

Another option was to dynamically allocate the tls structure so it didn't reside inside the vector.

However, there could potentially be other reference into the vector. I'd had problems with this in the past and "solved" them by using indexes into the vector rather than pointers. But it was still something I had to continually be on the lookout for.

Instead, I decided to switch from a vector to a simple fixed size static array. cSuneido isn't designed for huge numbers of fibers anyway. References into a fixed size static array were safe, nothing can invalidate them.

Problem solved (fingers crossed) and my mood has distinctly improved :-)

If you're interested in the nitty gritty, the change is on Github

Tuesday, January 10, 2017

Windows Overlapped Socket IO with Completion Routines

Up till now, cSuneido has used WSAAsyncSelect to do non-blocking socket IO. But WSAAsyncSelect is deprecated and it's not the nicest approach anyway. cSuneido needs non-blocking socket IO for background fibers, the main fiber uses synchronous blocking IO. (Although that means the main fiber will block background fibers.) Note: Windows fibers are single threaded, cooperative multi-tasking, coroutines. The advantage of fibers is that because they are single threaded and you control the context switches, you don't have the concurrency issues you would with "real" preemptive threads.

I thought that the WSAAsyncSelect code was the source of some failures we were seeing so I decided to rewrite it. My first rewrite used a polling approach. I know that's not scalable, but cSuneido doesn't do a lot of background processing so I figured it would be ok. Basically, I put the sockets in non-blocking mode, and whenever an operation returned WSAWOULDBLOCK the fiber would give up the rest of its time slice (e.g. 50ms) This was quite simple to implement and seemed to work fine.

However, I soon found it was too slow for more than a few requests. For example, one background task was doing roughly 400 requests. 400 * 50 ms is 20 seconds - ouch!

Back to the drawing board. One option was to use WSAEventSelect, but it looked complicated and I wasn't quite sure how to fit it in with the GUI event message loop.

Then I saw that WSARecv and WSASend allowed completion routines, a bit like XMLHttpRequest or Node's non-blocking IO. This seemed like a simpler approach. The fiber could block (yielding to other fibers) and the completion routine could unblock it.

At first I thought I had to use WSASocket and specify overlapped, but it turned out that the regular socket function sets overlapped mode by default. That's ok because it has no effect unless you use WSARecv or WSASend in overlapped mode.

Sending was the easy part since there was no need to block the sending fiber. It could just "fire and forget". One question was whether it would always do the full transfer or whether it might just do a partial transfer and require calling WSASend again (from the completion routine) to finish the transfer. I couldn't find a definitive answer for this. I found several people saying that in practice, unless there is a major issue (like running out of memory), it will always do the full transfer. Currently I just have an assert to confirm this.

Receiving is trickier. You may need to block until the data is available. And the completion routine may get called for partial data in which case you need to call WSARecv again for the remainder. (Although this complicates the code, it's actually a good thing since it allows you to request larger amounts of data and receive it as it arrives.)

WSASend and WSARecv can succeed immediately. However, the completion routine will still be called later. And for WSARecv at least, "success" may only be a partial transfer, in which case you still need to block waiting for the rest.

One complication to this style of overlapped IO is that completion routines are only called when you're in an "alertable" state. There are only a handful of functions that are alertable. I used MsgWaitForMultipleObjectsEx in the message loop, and SleepEx with a zero delay in a few other places. Note: although the MSDN documentation is unclear, you must specify MWMO_AWAITABLE for MsgWaitForMultipleObjectsEx to be alertable. (And it has to be the Ex version.)

Each overlapped WSASend or WSARecv is given an WSAOVERLAPPED structure and this structure must stay valid until the completion routine is called. I ran into problems because in some cases the completion routine wasn't getting called until after the socket had been closed, at which point I'd free'd the WSAOVERLAPPED structure. I got around this be calling SleepEx with a zero delay so the completion routines would run.

When I looked at some debugging tracing I noticed that it seldom blocked for very long. So I added a 1ms SleepEx before blocking to see if the data would arrive, in which case it wouldn't need to block and incur a context switch. This eliminated some blocking, but sometimes it didn't seem to work. I realized it was probably because the sleep was getting ended by an unrelated completion routine (e.g. from the preceding write). So I added a loop to ensure it was waiting at least a millisecond and that fixed it. Of course, I'm testing with the client and server on the same machine so the latency is very low. Across the network it will be slower and will still need to block sometimes.

Although the code wasn't that complicated, it took me a while to get it working properly (i.e. fast). As always, the devil is in the details. But the end result looks good. Background socket IO now runs about 30% faster than the old WSAAsyncSelect version, and almost as fast as the foreground synchronous blocking IO.

Sunday, January 01, 2017

Java ByteBuffer Gotcha

I had a weird bug in some Java code in jSuneido that took me a while to figure out. I briefly thought I'd found a bug in the ByteBuffer implementation, although I realized that was a low probability. (Especially given the confusing nature of ByteBuffer.) In the end it makes sense, although it perhaps could be documented better.

Here's the scenario - you slice or duplicate a ByteBuffer. This makes a new ByteBuffer instance that shares its data with the original. Then you compact the original buffer. Note - this does not modify the data that you are interested in. However, it does move it.

ByteBuffer buf = ByteBuffer.allocate(8192);
for (int i = 0; i <= 300; ++i)
    buf.put((byte) i);
for (int i = 0; i < 12; ++i)
ByteBuffer slice = buf.duplicate();

This will print 0 and then 12, i.e. the slice has changed, even though you didn't explicitly modify the buffer.

In retrospect it's obvious - compact does alter the buffer, which is shared with the slice. So the contents of the slice "changes".

I'd be tempted to add a warning to the documentation for compact that it will "invalidate" any existing slices or duplicates, the same way that C++ documentation points out which operations invalidate iterators. But maybe it should be obvious.

I'm mostly posting this because my searches for the problem didn't find anything. Perhaps this will help someone in the future find the problem quicker.