Saturday, April 22, 2017

Streaming Functional

Suneido has functional aspects like map and filter. They're not a major part of the language but they're definitely useful. However, it has always bothered me that they aren't "streaming". An operation like Map or Filter always generated a new collection. For small collections and many usages that works ok. Obviously it doesn't allow you to work with infinite streams but that's not a common requirement for our applications. But it's annoyingly inefficient if the desired result isn't a collection. e.g. If you end up joining the results into a string or writing them to a file.

I've thought about how I could change Suneido's functional methods to be streaming but it alway seemed like it would be a big change. When Java 8 introduced its streaming methods (for similar reasons) it was another reminder. However I didn't really like their approach - having to start by calling .stream and end by with some kind of collector. It makes sense and works well but it's verbose. One of the things I like about functional style is that it tends to be quite succinct.

The other day I was thinking about it again and I realized I might be closer than I thought, that Suneido might already have most of what I needed.

Early on in Suneido's development I'd face a similar problem with iterating over the members of an object. The Members method returned a new collection but that was wasteful if you just wanted to iterate. So I'd introduced "sequences" which could be iterated over without instantiating a new collection. Basically they just wrapped an iterator. However, they could also be treated as collections. If you called a collection method on them, they would instantiate the collection automatically. This meant a single method like Members could be used both to iterate lazily, and to generate a new collection, automatically and invisibly.

Later I added a Sequence function so that you could write these kinds of lazy sequences in user code. For example, we used it for a string lines method. But it never got used extensively.

My insight (as usual, obvious in hindsight) was that Sequence provided almost everything I needed to make the functional methods streaming. I simply rewrote Map and Filter as iterators and wrapped them in Sequence.

I wrote them as standalone functions, but also made them methods on collections. I prefer to write list.Filter(...).Map(...) than Map(Filter(list))

But I immediately ran into a problem. list.Filter(...).Map(...) worked, but it wasn't streaming. i.e. The result of Filter was still getting instantiated. The reason was that the sequence/iterator didn't have a Map method so it would instantiate the collection which did.

That was easy to fix, although it required a change to the runtime implementation of Sequence. There was no reason that you couldn't call user defined collection methods on sequences. It was only the built-in collection methods that required instantiation.

The next problem I ran into was that people were misusing Map purely for side effects. They should have been using Each instead. That was easy to fix, but somewhat tedious to examine every use of Map.

Then I found that Join would trigger an instantiation because it was a built-in collection method. So I added a built-in Join method on sequences.

I also found that laziness (as with these streams) and mutability don't play well together. I ended up needing to explicitly instantiate sequences in a few places because otherwise things would change out from under them. This is ugly and I'm not really happy with it, although thankfully it seems quite rare. But I could see it stumping someone who wasn't aware of how sequences work.

Being explicit about when to transition between collections and streams (like Java 8) avoids many of these issues, but doing it automatically fits better with Suneido's philosophy.

This project was a bit of a roller coaster (aren't they all?) alternating between thinking it was easy and I was done, with finding "fundamental" flaws without obvious solutions.  Unusually, even quite flawed implementations would handle most (simple) scenarios. It was the corner cases that gave me grief.

After hanging Suneido a few times I ended up adding a way to mark infinite sequences and refusing to instantiate them or display their contents. For fun I wrote a Primes infinite sequence.

Overall I'm pretty happy with how it turned out. I haven't done many benchmarks. I'm sure I can come up with scenarios where it will be a big win, but I doubt it'll make much difference in the larger picture.

One weakness is that Suneido doesn't have any easy way to write sequence operations. Currently they have to be written as iterators, which is ok for simple stuff but not very nice for things like tree traversals. Ideally you'd have some kind of "generators" with "yield". A project for another time!

Thursday, March 30, 2017

Concurrency is Hard episode n+1

For almost a year a bug has been haunting me. Once every few thousand server startups, the server would hang. It wasn't totally frozen - the server monitor web page was still running and the server would still accept connections (but not service them). Simply restarting would fix it. There was no discernible pattern, customers got it at random and there didn't seem to be anything in common between them.

At first we wrote it off to bad hardware or server glitches. But it continued to happen a little too consistently for that to be the cause.

It was hard to pinpoint exactly when it started since we didn't pick up on it right away. But we knew the rough time frame. So I could look at version control and see what changes I'd made. But nothing seemed like it could cause it. I could have rolled back changes, but most of them were bug fixes so I wasn't keen to do that. And for all we knew it could have been a Windows update or a Java update or changes to the application code that triggered it.

Once we recognized we had a problem, my first thought was that it was some kind of deadlock during the startup sequence. No problem, Java has ThreadMXBean findDeadlockedThreads. It worked great when I forced deadlocks. I scheduled it to run five minutes after startup. It didn't find anything. I scheduled it to run every five minutes. Still didn't find anything. So either it wasn't a deadlock, or the deadlock detection wasn't working. In other words, no progress.

It seems terrible that we had this bug for almost a year. I did work on it, and many hours in total, but it was never really urgent. An individual customer might get it once every six months and they simply had to restart. So there wasn't a lot of external pressure to fix it. It certainly "bugged" me but with no clues it was hard to even know what to try.

I studied the code. I rewrote large sections. I ran the ThreadSafe static analysis tool on it. We added logging in numerous places.

We did stress testing to try and make it happen, although it's hard to "stress test" server startup. All we could do is repeatedly start and stop the server. Of course, it never happened when we did this.

What I really needed was to examine a hung system with a debugger (e.g. jvisualvm). But we only install a JRE on our customers servers, not the full JDK. If it happened consistently on one customer we could have installed the JDK on their server but it didn't. Towards the end I was starting to think about installing the JDK on all our customers. I would have been happy even to make the bug worse so it happened more frequently. At least then I'd have a chance of getting it under the microscope.

Looking at the Java documentation I found you could debug core dumps. So next time we got a chance to access a hung server, we forced a core dump. But we're using a launcher to run as a Windows service, and the launcher meant the debugging tools didn't recognize the core dump. So much for that idea.

One of the problems with Java's "synchronized" keyword is that you can't put any timeouts on it. So towards the end I was considering replacing all my uses of synchronized with explicit locks with timeouts. At least then deadlocks would time out and give me a clue where they were happening. In retrospect, this would have worked to locate where the problem was.

One of the problems was the slow turnaround. After making a change to jSuneido we would run it internally for at least a week. Then we'd roll it out to beta customers, and then finally to the rest of the customers. Then we had to wait for several days to see if the problem reoccured. So after each change I'd have to wait several weeks to see if it made a difference. Of course, there was nothing stopping me from continuing to work on the problem during this time, but the natural inclination was to want to wait and see if the change had worked.

One pattern we noticed was that it seemed to be related to requests to the HTTP server during startup. We weren't aware of that last time we stress tested so I thought it might be something to try. So I set up a server to restart every minute and arranged for a constant stream of HTTP requests. It ran all day without a hiccup. I normally shut down my computer at night. But that was only about 500 restarts, not really enough to have a good chance of detecting something that only happens every few thousand times. What the heck, I left it running overnight.

When I went to check it in the morning I had no expectation of finding anything other than it happily running. So it took me a minute of staring at the screen to realize it had hung! Eureka! Funny to be so excited about managing to hang your software!

I was almost afraid to touch it, for fear of disturbing the hung process. I fired up jvisualvm and did a thread dump. As I'd expected from the beginning, it was a deadlock. Two threads each owning a lock that the other wants.

And still the deadlock detection didn't show anything. I realized the reason was that one of the threads was not waiting on a lock, it was "parked" in Guava cache code. The code it was in was documented as "thread safe". As Inigo says in The Princess Bride, "You keep using that word. I do not think it means what you think it means."

I briefly entertained the (attractive) thought that the deadlock was in the Guava code and not my fault. That would have been nice, although harder to get fixed. But unfortunately it was quite obvious from the thread dump that it was my code that was at fault, despite the one thread being parked in Guava code.

The problem, and one of the lessons from this bug, is that code that calls a user supplied function is never truly thread safe because you can't control what that function does. That's also one of the problems with "functional" code, it is not amenable to static analysis.

In hindsight I realize that ThreadMXBean also has dumpAllThreads. I might have been able to write code to detect when the server hung and dumped the threads automatically. That would have given me the information I needed to solve this. Of course, I had no way to know that before now.

When I started rewriting Suneido from C++ to Java one of the big reasons was that we needed a multi-threaded server, and I thought Java had good concurrency support. But what it has is locks and shared state, which we've come to realize is not a great way to handle concurrency. Even back then I knew that, and I tried hard to minimize locking (which also helps performance) by keeping state thread contained and using immutable data. But I still used a certain amount of locking and concurrent data structures (which commonly use locking). And it's mostly worked out pretty well - jSuneido hasn't had a lot of concurrency problems. However, it's a bit like the lack of safety and garbage collection in C++, if you're careful you can get away with it, but sooner or later it's going to catch you.

Tuesday, January 17, 2017

Bug Tale

Yesterday, towards the end of the day, one of my staff came to me with a bug in cSuneido. Actually it was the second time he'd brought it to me. The first time I sent him away with a request for more details. This time he brought me a printed screen shot of the error. Perhaps that seemed more tangible.

Sometimes I get the impression that my staff gets a certain amount of secret enjoyment out of pointing out my bugs. Some of that's probably just my imagination and frustration. But even if it's true, it's understandable. They have to put up with me critiquing their code all the time.

I wasn't happy about it. It had been a busy day with the usual parade of interruptions. I didn't feel like I'd accomplished anything, and now this steaming turd had been deposited on my desk. I started to work on the bug, although I was pretty sure there wouldn't be enough time left in the day to get anywhere on it. The only advantage of starting late was that the interruptions had died out now that most people had left for the day.

On the positive side, the bug was repeatable. On the negative side it wasn't simple to reproduce - you had to run client-server, open our application, and then run a bunch of queries for over a minute. After thinking I'd fixed it several times I realized that it also only happened the first time you did this after starting the client. Subsequent runs worked fine.

The error itself was an access violation. The more I work in "safe" languages like Java or JavaScript or Go, the more I hate unsafe languages like C++. An access violation could be anything - a dangling pointer, a garbage collection issue, an uninitialized variable, an invalidated reference ...

On top of this, the error occurred inside a background fiber. Even better, it didn't occur when I ran the debug version of cSuneido.

As I expected, I didn't get anywhere before I headed home for the day, pissed off.

After supper I debated whether to work on it some more. If I could make a little progress, even just get a clue or narrow it down, then I'd end the day feeling a lot better. On the other hand, if I just ended up banging my head on the wall, I'd probably feel even worse.

I took the gamble, of course. Programmers are nothing if not eternal optimists. They have to be. But I hedged my bet by not "getting serious" and firing up the big iMac. I just sat in the living room with my laptop. That way if I failed I could tell myself it was because I'd just been poking around.

I didn't find the bug, but I did narrow it down enough that I felt I was hot on the trail. I could recreate the problem with a much simpler test case, and I'd found that I could recreate it in the debugger as long as I used the release build. It's harder to debug in the optimized version but being able to use the debugger at all was a big help.

It turned out the only significance of the queries running for over a minute was that during that minute several timer tasks got queued and ran concurrently when the queries ended. I could get the same result by just starting two fibers "at the same time".

Thankfully, the next day I was working at home and could focus on the problem. It was quite puzzling at first. I could see (in the debugger) exactly where the problem was, but the code looked correct, and almost all the time it worked correctly. I even resorted to looking at the assembly language and register contents, something I haven't done for a long time.

Stepping through the code I found there was a fiber context switch in the middle of the problem lines. And from looking at the assembler it was pretty obvious the compiler was caching the results of some common subexpressions, which it probably wasn't doing in the debug version. But I couldn't see how that caused a problem.

With fibers being cooperative and not pre-emptive, you don't really need to worry about concurrency issues. But this turned out to be a concurrency issue after all.

The problem lines were:

tls().thedbms = dbms_remote_async(server_ip);

tls() was getting evaluated and cached. But if dbms_remote_async "blocked" waiting to connect, then another fiber would run, and if that other fiber created a new fiber, and this caused the fibers vector to grow (reallocate), then the cached value of tls() would be invalid, causing the access violation.

Sure enough, if I called reserve on the vector to pre-grow it, then the problem went away.

It only happened the first time because after that the vector wouldn't need to grow and the tls() reference would stay valid.

I was grateful that the problem was so repeatable. If this had been real threads it would have been much more erratic. One of the advantages of fibers is that they are deterministic.

One local fix was to rewrite it as:

auto dbms = dbms_remote_async(server_ip);
tls().thedbms = dbms;

But where else in the code did I have this potential problem? And what would stop me from reintroducing the problem in future code.

My next thought was that I needed to tell the compiler that tls() was "volatile", i.e. it could change concurrently. But that wasn't really the problem. Even in single threaded code inserting into a vector invalidates any references, that's part of its contract.

One option was to use Windows fiber local storage. This didn't exist back when I rolled my own.

Another option was to dynamically allocate the tls structure so it didn't reside inside the vector.

However, there could potentially be other reference into the vector. I'd had problems with this in the past and "solved" them by using indexes into the vector rather than pointers. But it was still something I had to continually be on the lookout for.

Instead, I decided to switch from a vector to a simple fixed size static array. cSuneido isn't designed for huge numbers of fibers anyway. References into a fixed size static array were safe, nothing can invalidate them.

Problem solved (fingers crossed) and my mood has distinctly improved :-)

If you're interested in the nitty gritty, the change is on Github

Tuesday, January 10, 2017

Windows Overlapped Socket IO with Completion Routines

Up till now, cSuneido has used WSAAsyncSelect to do non-blocking socket IO. But WSAAsyncSelect is deprecated and it's not the nicest approach anyway. cSuneido needs non-blocking socket IO for background fibers, the main fiber uses synchronous blocking IO. (Although that means the main fiber will block background fibers.) Note: Windows fibers are single threaded, cooperative multi-tasking, coroutines. The advantage of fibers is that because they are single threaded and you control the context switches, you don't have the concurrency issues you would with "real" preemptive threads.

I thought that the WSAAsyncSelect code was the source of some failures we were seeing so I decided to rewrite it. My first rewrite used a polling approach. I know that's not scalable, but cSuneido doesn't do a lot of background processing so I figured it would be ok. Basically, I put the sockets in non-blocking mode, and whenever an operation returned WSAWOULDBLOCK the fiber would give up the rest of its time slice (e.g. 50ms) This was quite simple to implement and seemed to work fine.

However, I soon found it was too slow for more than a few requests. For example, one background task was doing roughly 400 requests. 400 * 50 ms is 20 seconds - ouch!

Back to the drawing board. One option was to use WSAEventSelect, but it looked complicated and I wasn't quite sure how to fit it in with the GUI event message loop.

Then I saw that WSARecv and WSASend allowed completion routines, a bit like XMLHttpRequest or Node's non-blocking IO. This seemed like a simpler approach. The fiber could block (yielding to other fibers) and the completion routine could unblock it.

At first I thought I had to use WSASocket and specify overlapped, but it turned out that the regular socket function sets overlapped mode by default. That's ok because it has no effect unless you use WSARecv or WSASend in overlapped mode.

Sending was the easy part since there was no need to block the sending fiber. It could just "fire and forget". One question was whether it would always do the full transfer or whether it might just do a partial transfer and require calling WSASend again (from the completion routine) to finish the transfer. I couldn't find a definitive answer for this. I found several people saying that in practice, unless there is a major issue (like running out of memory), it will always do the full transfer. Currently I just have an assert to confirm this.

Receiving is trickier. You may need to block until the data is available. And the completion routine may get called for partial data in which case you need to call WSARecv again for the remainder. (Although this complicates the code, it's actually a good thing since it allows you to request larger amounts of data and receive it as it arrives.)

WSASend and WSARecv can succeed immediately. However, the completion routine will still be called later. And for WSARecv at least, "success" may only be a partial transfer, in which case you still need to block waiting for the rest.

One complication to this style of overlapped IO is that completion routines are only called when you're in an "alertable" state. There are only a handful of functions that are alertable. I used MsgWaitForMultipleObjectsEx in the message loop, and SleepEx with a zero delay in a few other places. Note: although the MSDN documentation is unclear, you must specify MWMO_AWAITABLE for MsgWaitForMultipleObjectsEx to be alertable. (And it has to be the Ex version.)

Each overlapped WSASend or WSARecv is given an WSAOVERLAPPED structure and this structure must stay valid until the completion routine is called. I ran into problems because in some cases the completion routine wasn't getting called until after the socket had been closed, at which point I'd free'd the WSAOVERLAPPED structure. I got around this be calling SleepEx with a zero delay so the completion routines would run.

When I looked at some debugging tracing I noticed that it seldom blocked for very long. So I added a 1ms SleepEx before blocking to see if the data would arrive, in which case it wouldn't need to block and incur a context switch. This eliminated some blocking, but sometimes it didn't seem to work. I realized it was probably because the sleep was getting ended by an unrelated completion routine (e.g. from the preceding write). So I added a loop to ensure it was waiting at least a millisecond and that fixed it. Of course, I'm testing with the client and server on the same machine so the latency is very low. Across the network it will be slower and will still need to block sometimes.

Although the code wasn't that complicated, it took me a while to get it working properly (i.e. fast). As always, the devil is in the details. But the end result looks good. Background socket IO now runs about 30% faster than the old WSAAsyncSelect version, and almost as fast as the foreground synchronous blocking IO.

Sunday, January 01, 2017

Java ByteBuffer Gotcha

I had a weird bug in some Java code in jSuneido that took me a while to figure out. I briefly thought I'd found a bug in the ByteBuffer implementation, although I realized that was a low probability. (Especially given the confusing nature of ByteBuffer.) In the end it makes sense, although it perhaps could be documented better.

Here's the scenario - you slice or duplicate a ByteBuffer. This makes a new ByteBuffer instance that shares its data with the original. Then you compact the original buffer. Note - this does not modify the data that you are interested in. However, it does move it.

ByteBuffer buf = ByteBuffer.allocate(8192);
for (int i = 0; i <= 300; ++i)
    buf.put((byte) i);
for (int i = 0; i < 12; ++i)
ByteBuffer slice = buf.duplicate();

This will print 0 and then 12, i.e. the slice has changed, even though you didn't explicitly modify the buffer.

In retrospect it's obvious - compact does alter the buffer, which is shared with the slice. So the contents of the slice "changes".

I'd be tempted to add a warning to the documentation for compact that it will "invalidate" any existing slices or duplicates, the same way that C++ documentation points out which operations invalidate iterators. But maybe it should be obvious.

I'm mostly posting this because my searches for the problem didn't find anything. Perhaps this will help someone in the future find the problem quicker.