The Software Life: 2017

Tuesday, December 19, 2017

Concurrency Again

It started with long update transactions. Mostly we ignored them - some of our customers have slow servers, and sometimes they try to do transactions that are too large.

jSuneido aborts long update transactions as a defense mechanism. Its optimistic concurrency control has to track overlapping transactions, so if a transaction takes a long time, it can lead to a lot of overlaps.

But some of these long update transactions didn't make much sense. They appeared to be slow in the final commit stage. This stage just writes to the memory mapped files and normally is very fast, nowhere near the 20 second transaction limit.

I started going through the code, trying to see what might explain this. I wondered if it might have something to do with the background (once a minute) task that calls force on the MappedByteBuffers to flush them to external storage. Trying to ensure that files are durably written is tricky these days with so many levels of caching etc. but it seems worth trying.

I would have thought that force itself would be fast since most of the work would be done at the operating system level. But the documentation says that it doesn't return till it finishes writing to storage. I added some logging and found that force is usually fast (a millisecond) but occasionally slow (10's of seconds). There didn't seem to be a particular reason for the slow ones. I assume Windows is doing something that blocks.

The slow force would be enough to cause long update transactions, but force was supposed to be happening in a background thread so it shouldn't have blocked anything.

Crap! The force method was synchronized. And so was a method that transactions needed to use to commit. So the force would block the commit and could cause the problem. The problem was relatively rare because the force was only once a minute and usually fast, and it would have to hit exactly when a transaction was committing.

Did force really need to be synchronized? Did the other methods? Back to the age old problems of concurrency with shared memory and locking.

I removed the synchronized. But then I worried about accessing the array of MappedByteBuffers without any locking. I could make the array reference volatile but not the array elements. So I changed it to use AtomicReferenceArray.

At first I thought it was working ok. I ran tests with a bunch of threads hammering on the database and it seemed to work ok. But then my database got corrupted. Was my change failing? I ran the tests again and everything was fine. Hmmm. I changed the force to run every 10 seconds instead of every minute. Sure enough, now the database would get corrupted every time I ran the test.

It still didn't make sense to me that removing synchronized from force would corrupt the data. I noticed that my force method was also writing a few zeros to the end of the file to force the last modified date of the file to get updated. I could see that having concurrency issues. I changed it to rewrite the first few bytes of the file header instead. That fixed it, no more corruption. Hopefully it's smart enough to see that nothing has actually changed and not physically write to storage.

Everything looked good according to my tests, but that's generally not sufficient when it comes to concurrency. I also feel like I should convince myself that it is "safe", even if only informally.

Writing to the database (the final stage of an update transaction) was protected by a lock. So there could only be a single writer at a time. (Yes, that's a bottleneck, but it's a small part of the overall process so it doesn't reduce concurrency too much.)

But there are multiple reader threads. So the question is visibility, not exclusion. The Java Memory Model makes no guarantees of visibility without either volatile or locking. To ensure visibility I needed writers to release a lock when they were finished, and readers to acquire that same lock when they were starting.

For database transactions, that seemed to be covered. Transactions called synchronized methods (on the same object) at the beginning and end.

But there were other readers besides database transactions. The database checking process was one. It also turned out to be ok because it acquired the commit lock briefly at the start. (And writers released it at the end.)

That left the force process. Part of the problem was that it's low level code. It doesn't (and shouldn't) "know" about transactions or commit locks etc. I needed something at the same level. Luckily, writers called a particular low level method when they ended so I could synchronize that method along with the force method. Oops! Just about repeated the original mistake of force holding a lock while it was working. It didn't need to hold it since the goal wasn't exclusion. I just needed to acquire (and then release) the lock (to ensure visibility). So I made a dummy "sync" method that was synchronized and called that at the start of the force method.

I think I've convinced myself that it is "safe". I wouldn't say I'm one hundred percent sure, but reasonably confident.

Another question is fragility. Would future changes be likely to break this? Minor changes seem pretty safe. The things I'm depending on are reasonably fundamental. If I made bigger changes it could be dangerous, but presumably if I'm making bigger changes I would reconsider concurrency. And I did document it.

So if visibility and exclusion are covered, do I need to use an AtomicReferenceArray? Presumably not, so I reverted those changes and went back to a plain old array.

I did another round of testing on both OS X and Windows and didn't encounter any problems.

The problem with this kind of concurrency issue is that you can't just throw locks at it. The problems are just as likely to come from too many locks as from not enough. I wish you could just lock everything and be done with it. But if you do that, you'll probably have deadlocks, and you'll be back to single threaded performance.

Thursday, November 23, 2017

Suneido Bind

In functional programming you have the idea of "partial application" of functions, where you supply some of the arguments to a function and the result is a new function that takes the remaining arguments. This is also referred to as "binding" some of the arguments, for example C++ std::bind.

We've had a version of this in Suneido for a long time, called "Curry". This is actually not the correct name, although it's related. Technically, currying f(a,b) would give you a function you could call as g(a)(b), where g(a) would give you a function you could call as h(b).

Suneido's Curry similarly allows you to supply some of the leading arguments, creating something you can call later with the remaining arguments. For example:

f = Curry(Print, 'hello')
f('world')
=> hello world

The code is fairly simple, using Suneido's variable number of argument facilities (similar to Javascript's spread and rest)

Recently, I was thinking that it would be nice if you could simply write: (like Scala)

f = Print('hello', _)

That would require compiler changes, but I realized I could do something similar to try out the idea:

f = Bind(Print, 'hello', Bind)

I needed something to stand for unsupplied arguments and using Bind itself seemed suitable.

To make it fast, I had the idea that I could generate a custom class. That sounds expensive, but you only need a class per "shape" of bind, not the particular values. And the classes could be cached. The code is a bit more complicated but not bad.

As hoped, the functions returned by Bind were faster than the ones from Curry, about four times faster. That was nice. And although Bind didn't handle variable numbers of arguments like Curry, it didn't require the supplied arguments to be at the beginning.

Of course, the other way to do this is using Suneido blocks (like closures or lambdas in other languages but with a syntax that came from Smalltalk):

f = {|b| Print('hello', b) }

It turned out that was slightly faster than my fancy new Bind. Which means we shouldn't be using Curry either since it's even slower. (jSuneido is even smart enough to see that's not actually a closure and optimize it to just be a normal function, avoiding the overhead of capturing the environment.)

It would still be nicer to write: Print('hello', _) but that could simply be syntactic sugar for the equivalent block. That wouldn't be too hard to implement in jSuneido since it compiles via an AST but it would be harder in cSuneido which compiles in a single pass directly from source to byte code.

Of course, if you had that, then it would also be nice to be able to write e.g. Map(2 * _) as a shorthand for: Map({|x| 2 * x })

Saturday, September 02, 2017

Go Interfaces and Immediate Integers

One of the unique things about the Go language is how it handles "dynamic" references compared to a language like Java.

In Java we have:

Whereas in Go we have:

In Go an interface is a "fat pointer". Your first thought might be, six of one, half a dozen of the other. What difference does it make where you put the type. However, there are significant tradeoffs.

Next, you might think, I could have multiple references to the same value, in which case Go will use more space because it will duplicate the type. But if we have a large array of values (in which case we're unlikely to have references to all of them), then Go would save space because it wouldn't need the type on each value.

The other big saving is if you have statically (compile time) typed references in Go they will look like:

And again, you save space because there's no need to store the type at runtime.

Given that background, the problem I was thinking about was how to handle integers in a Go version of Suneido. In Java you have to used boxed (on the heap) integers for generic references.

Older versions of Go were perfect for this purpose since they stored integers right in the interface in the same place as the pointer.

But sadly (for me) when they improved the garbage collector they removed immediate values:

The implementation of interface values has been modified. In earlier releases, the interface contained a word that was either a pointer or a one-word scalar value, depending on the type of the concrete object stored. This implementation was problematical for the garbage collector, so as of 1.4 interface values always hold a pointer. In running programs, most interface values were pointers anyway, so the effect is minimal, but programs that store integers (for example) in interfaces will see more allocations.

So now if you store an integer in an interface it gets allocated on the heap, effectively "boxing" it like Java does, although with less overhead.

For something like implementing Suneido, this really sucks. It adds a huge amount of overhead (relatively) on simple integer operations.

You could make an "extra fat" value that had an interface and an integer, but that's ugly too.

In C++ you can use a union to handle immediate integers and that's what cSuneido does. But Go doesn't have unions in the C/C++ sense.

But the union in cSuneido basically reserves a portion of the address space for use by immediate integers. i.e. a pointer falling in that range of addresses is assumed to be and is treated as an integer.

So why not do something similar in Go. Here's my proof of concept:

var space [1024 * 1024 * 1024]int8
var base = uintptr(unsafe.Pointer(&space[0]))

func ExampleSmi() {
   n := 123456789

   x := &space[n]
   i := uintptr(unsafe.Pointer(x)) - base
   fmt.Println(i)
   // Output:

   // 123456789

It does use the unsafe package, but only in a read-only, relatively safe way. (IMO) Pointers will always point into valid memory so the garbage collector will be happy. And it should be fast since the unsafe functions are primarily a compile time feature, they don't do much at runtime.

This example handles 30 bit integers (1 gb). (Unsigned here, but easily modified to handle signed.) I tried to allocate a 4 gb array to handle 32 bit integers but the compiler appears to have a limit of 2 gb. (Note: 4gb is an insignificant portion of a 64 bit address space.) That's too bad because I could probably ignore overflow from math on 32 bit integers, but not 30 bit integers.

We want to declare the "space" statically so that "base" is a constant. One question I had was whether this would add 1gb to the executable. With Go 1.9 on OS X it doesn't. (It appears to go in SNOPTRBSS which I translate as "no pointer uninitialized/zero memory") My other question is whether it's smart enough to allocate zeroed virtual memory and avoid actually zeroing the memory. I'm not sure how to check this, but there doesn't seem to be any lag in startup. Also, looking at the process in the OS X Activity Monitor is shows real memory usage under 2mb. Presumably if the virtual memory is never modified it will never use any real memory. Also, importantly, the array is of type int8 so it won't need to be scanned by the garbage collector.

Even better, we can make our own type based on int8 (e.g, type Smi int8) and then define methods on this type. As long as we don't make the Smi type public and are careful to only make instances within the array then it will work like any other type. We basically have the old behavior of Go, albeit with a smaller range of integers.

It's a bit of kludge but it should work quite well and it's all standard (albeit "unsafe") Go. And it's no worse than tagged pointers or nan boxing.

Friday, August 04, 2017

[[nodiscard]]

I was reviewing some of my cSuneido C++ code and I noticed that I was ignoring the return value from a function, which meant that a variable might not be set. The function looked like:

bool int_if_num(int* pi);

I recalled reading about the new (C++17) [[nodiscard]] attribute which should catch this kind of error. I added the attribute to the function:

[[nodiscard]] bool int_if_num(int* pi);

I thought it was supported in Visual Studio but it turned out that it's "coming soon" in the next release.

But I also had Clang which does support it, and sure enough it caught the error:

error: ignoring return value of function declared with 'nodiscard' attribute

(I thought there might be more instances, but this was the only one.)

The nice thing about this kind of feature is that it's fine if it doesn't have universal support. I commonly compile with GCC and Clang in addition to Visual Studio because they each catch different issues.

[Afterward - reading the code more carefully, I found that I was initializing the variable prior to the call so it wasn't a problem after all. But that relied on the function not modifying the value unless it succeeded, which could change, and then there would be a problem. So it was still worth fixing the code.]

Tuesday, June 27, 2017

Eclipse Tip

I have become accustomed to seeing "change bars" in Visual Studio i.e. being able to see which lines have been changed but not committed to version control.

It seemed odd that Eclipse wouldn't have this feature so I went searching for it and sure enough it does, but strangely, it's disabled by default. It's also a little hard to find. It's not under Annotations or Team (version control). It's under Editor > Text Editors > Quick Diff, which makes sense, and you can get there easily by searching for "diff", as long as you know what to search for.

You want to check "Enable quick diff" and "Show differences in overview ruler" and "Git Revision" under "Use this reference source". (assuming you're using Git)

See also: stack overflow: Highlighting modified lines in Eclipse

Saturday, April 22, 2017

Streaming Functional

Suneido has functional aspects like map and filter. They're not a major part of the language but they're definitely useful. However, it has always bothered me that they aren't "streaming". An operation like Map or Filter always generated a new collection. For small collections and many usages that works ok. Obviously it doesn't allow you to work with infinite streams but that's not a common requirement for our applications. But it's annoyingly inefficient if the desired result isn't a collection. e.g. If you end up joining the results into a string or writing them to a file.

I've thought about how I could change Suneido's functional methods to be streaming but it alway seemed like it would be a big change. When Java 8 introduced its streaming methods (for similar reasons) it was another reminder. However I didn't really like their approach - having to start by calling .stream and end by with some kind of collector. It makes sense and works well but it's verbose. One of the things I like about functional style is that it tends to be quite succinct.

The other day I was thinking about it again and I realized I might be closer than I thought, that Suneido might already have most of what I needed.

Early on in Suneido's development I'd face a similar problem with iterating over the members of an object. The Members method returned a new collection but that was wasteful if you just wanted to iterate. So I'd introduced "sequences" which could be iterated over without instantiating a new collection. Basically they just wrapped an iterator. However, they could also be treated as collections. If you called a collection method on them, they would instantiate the collection automatically. This meant a single method like Members could be used both to iterate lazily, and to generate a new collection, automatically and invisibly.

Later I added a Sequence function so that you could write these kinds of lazy sequences in user code. For example, we used it for a string lines method. But it never got used extensively.

My insight (as usual, obvious in hindsight) was that Sequence provided almost everything I needed to make the functional methods streaming. I simply rewrote Map and Filter as iterators and wrapped them in Sequence.

I wrote them as standalone functions, but also made them methods on collections. I prefer to write list.Filter(...).Map(...) than Map(Filter(list))

But I immediately ran into a problem. list.Filter(...).Map(...) worked, but it wasn't streaming. i.e. The result of Filter was still getting instantiated. The reason was that the sequence/iterator didn't have a Map method so it would instantiate the collection which did.

That was easy to fix, although it required a change to the runtime implementation of Sequence. There was no reason that you couldn't call user defined collection methods on sequences. It was only the built-in collection methods that required instantiation.

The next problem I ran into was that people were misusing Map purely for side effects. They should have been using Each instead. That was easy to fix, but somewhat tedious to examine every use of Map.

Then I found that Join would trigger an instantiation because it was a built-in collection method. So I added a built-in Join method on sequences.

I also found that laziness (as with these streams) and mutability don't play well together. I ended up needing to explicitly instantiate sequences in a few places because otherwise things would change out from under them. This is ugly and I'm not really happy with it, although thankfully it seems quite rare. But I could see it stumping someone who wasn't aware of how sequences work.

Being explicit about when to transition between collections and streams (like Java 8) avoids many of these issues, but doing it automatically fits better with Suneido's philosophy.

This project was a bit of a roller coaster (aren't they all?) alternating between thinking it was easy and I was done, with finding "fundamental" flaws without obvious solutions. Unusually, even quite flawed implementations would handle most (simple) scenarios. It was the corner cases that gave me grief.

After hanging Suneido a few times I ended up adding a way to mark infinite sequences and refusing to instantiate them or display their contents. For fun I wrote a Primes infinite sequence.

Overall I'm pretty happy with how it turned out. I haven't done many benchmarks. I'm sure I can come up with scenarios where it will be a big win, but I doubt it'll make much difference in the larger picture.

One weakness is that Suneido doesn't have any easy way to write sequence operations. Currently they have to be written as iterators, which is ok for simple stuff but not very nice for things like tree traversals. Ideally you'd have some kind of "generators" with "yield". A project for another time!

Thursday, March 30, 2017

Concurrency is Hard episode n+1

For almost a year a bug has been haunting me. Once every few thousand server startups, the server would hang. It wasn't totally frozen - the server monitor web page was still running and the server would still accept connections (but not service them). Simply restarting would fix it. There was no discernible pattern, customers got it at random and there didn't seem to be anything in common between them.

At first we wrote it off to bad hardware or server glitches. But it continued to happen a little too consistently for that to be the cause.

It was hard to pinpoint exactly when it started since we didn't pick up on it right away. But we knew the rough time frame. So I could look at version control and see what changes I'd made. But nothing seemed like it could cause it. I could have rolled back changes, but most of them were bug fixes so I wasn't keen to do that. And for all we knew it could have been a Windows update or a Java update or changes to the application code that triggered it.

Once we recognized we had a problem, my first thought was that it was some kind of deadlock during the startup sequence. No problem, Java has ThreadMXBean findDeadlockedThreads. It worked great when I forced deadlocks. I scheduled it to run five minutes after startup. It didn't find anything. I scheduled it to run every five minutes. Still didn't find anything. So either it wasn't a deadlock, or the deadlock detection wasn't working. In other words, no progress.

It seems terrible that we had this bug for almost a year. I did work on it, and many hours in total, but it was never really urgent. An individual customer might get it once every six months and they simply had to restart. So there wasn't a lot of external pressure to fix it. It certainly "bugged" me but with no clues it was hard to even know what to try.

I studied the code. I rewrote large sections. I ran the ThreadSafe static analysis tool on it. We added logging in numerous places.

We did stress testing to try and make it happen, although it's hard to "stress test" server startup. All we could do is repeatedly start and stop the server. Of course, it never happened when we did this.

What I really needed was to examine a hung system with a debugger (e.g. jvisualvm). But we only install a JRE on our customers servers, not the full JDK. If it happened consistently on one customer we could have installed the JDK on their server but it didn't. Towards the end I was starting to think about installing the JDK on all our customers. I would have been happy even to make the bug worse so it happened more frequently. At least then I'd have a chance of getting it under the microscope.

Looking at the Java documentation I found you could debug core dumps. So next time we got a chance to access a hung server, we forced a core dump. But we're using a launcher to run as a Windows service, and the launcher meant the debugging tools didn't recognize the core dump. So much for that idea.

One of the problems with Java's "synchronized" keyword is that you can't put any timeouts on it. So towards the end I was considering replacing all my uses of synchronized with explicit locks with timeouts. At least then deadlocks would time out and give me a clue where they were happening. In retrospect, this would have worked to locate where the problem was.

One of the problems was the slow turnaround. After making a change to jSuneido we would run it internally for at least a week. Then we'd roll it out to beta customers, and then finally to the rest of the customers. Then we had to wait for several days to see if the problem reoccured. So after each change I'd have to wait several weeks to see if it made a difference. Of course, there was nothing stopping me from continuing to work on the problem during this time, but the natural inclination was to want to wait and see if the change had worked.

One pattern we noticed was that it seemed to be related to requests to the HTTP server during startup. We weren't aware of that last time we stress tested so I thought it might be something to try. So I set up a server to restart every minute and arranged for a constant stream of HTTP requests. It ran all day without a hiccup. I normally shut down my computer at night. But that was only about 500 restarts, not really enough to have a good chance of detecting something that only happens every few thousand times. What the heck, I left it running overnight.

When I went to check it in the morning I had no expectation of finding anything other than it happily running. So it took me a minute of staring at the screen to realize it had hung! Eureka! Funny to be so excited about managing to hang your software!

I was almost afraid to touch it, for fear of disturbing the hung process. I fired up jvisualvm and did a thread dump. As I'd expected from the beginning, it was a deadlock. Two threads each owning a lock that the other wants.

And still the deadlock detection didn't show anything. I realized the reason was that one of the threads was not waiting on a lock, it was "parked" in Guava cache code. The code it was in was documented as "thread safe". As Inigo says in The Princess Bride, "You keep using that word. I do not think it means what you think it means."

I briefly entertained the (attractive) thought that the deadlock was in the Guava code and not my fault. That would have been nice, although harder to get fixed. But unfortunately it was quite obvious from the thread dump that it was my code that was at fault, despite the one thread being parked in Guava code.

The problem, and one of the lessons from this bug, is that code that calls a user supplied function is never truly thread safe because you can't control what that function does. That's also one of the problems with "functional" code, it is not amenable to static analysis.

In hindsight I realize that ThreadMXBean also has dumpAllThreads. I might have been able to write code to detect when the server hung and dumped the threads automatically. That would have given me the information I needed to solve this. Of course, I had no way to know that before now.

When I started rewriting Suneido from C++ to Java one of the big reasons was that we needed a multi-threaded server, and I thought Java had good concurrency support. But what it has is locks and shared state, which we've come to realize is not a great way to handle concurrency. Even back then I knew that, and I tried hard to minimize locking (which also helps performance) by keeping state thread contained and using immutable data. But I still used a certain amount of locking and concurrent data structures (which commonly use locking). And it's mostly worked out pretty well - jSuneido hasn't had a lot of concurrency problems. However, it's a bit like the lack of safety and garbage collection in C++, if you're careful you can get away with it, but sooner or later it's going to catch you.

Tuesday, January 17, 2017

Bug Tale

Yesterday, towards the end of the day, one of my staff came to me with a bug in cSuneido. Actually it was the second time he'd brought it to me. The first time I sent him away with a request for more details. This time he brought me a printed screen shot of the error. Perhaps that seemed more tangible.

Sometimes I get the impression that my staff gets a certain amount of secret enjoyment out of pointing out my bugs. Some of that's probably just my imagination and frustration. But even if it's true, it's understandable. They have to put up with me critiquing their code all the time.

I wasn't happy about it. It had been a busy day with the usual parade of interruptions. I didn't feel like I'd accomplished anything, and now this steaming turd had been deposited on my desk. I started to work on the bug, although I was pretty sure there wouldn't be enough time left in the day to get anywhere on it. The only advantage of starting late was that the interruptions had died out now that most people had left for the day.

On the positive side, the bug was repeatable. On the negative side it wasn't simple to reproduce - you had to run client-server, open our application, and then run a bunch of queries for over a minute. After thinking I'd fixed it several times I realized that it also only happened the first time you did this after starting the client. Subsequent runs worked fine.

The error itself was an access violation. The more I work in "safe" languages like Java or JavaScript or Go, the more I hate unsafe languages like C++. An access violation could be anything - a dangling pointer, a garbage collection issue, an uninitialized variable, an invalidated reference ...

On top of this, the error occurred inside a background fiber. Even better, it didn't occur when I ran the debug version of cSuneido.

As I expected, I didn't get anywhere before I headed home for the day, pissed off.

After supper I debated whether to work on it some more. If I could make a little progress, even just get a clue or narrow it down, then I'd end the day feeling a lot better. On the other hand, if I just ended up banging my head on the wall, I'd probably feel even worse.

I took the gamble, of course. Programmers are nothing if not eternal optimists. They have to be. But I hedged my bet by not "getting serious" and firing up the big iMac. I just sat in the living room with my laptop. That way if I failed I could tell myself it was because I'd just been poking around.

I didn't find the bug, but I did narrow it down enough that I felt I was hot on the trail. I could recreate the problem with a much simpler test case, and I'd found that I could recreate it in the debugger as long as I used the release build. It's harder to debug in the optimized version but being able to use the debugger at all was a big help.

It turned out the only significance of the queries running for over a minute was that during that minute several timer tasks got queued and ran concurrently when the queries ended. I could get the same result by just starting two fibers "at the same time".

Thankfully, the next day I was working at home and could focus on the problem. It was quite puzzling at first. I could see (in the debugger) exactly where the problem was, but the code looked correct, and almost all the time it worked correctly. I even resorted to looking at the assembly language and register contents, something I haven't done for a long time.

Stepping through the code I found there was a fiber context switch in the middle of the problem lines. And from looking at the assembler it was pretty obvious the compiler was caching the results of some common subexpressions, which it probably wasn't doing in the debug version. But I couldn't see how that caused a problem.

With fibers being cooperative and not pre-emptive, you don't really need to worry about concurrency issues. But this turned out to be a concurrency issue after all.

The problem lines were:

tls().thedbms = dbms_remote_async(server_ip);
tls().thedbms->auth(tok);

tls() was getting evaluated and cached. But if dbms_remote_async "blocked" waiting to connect, then another fiber would run, and if that other fiber created a new fiber, and this caused the fibers vector to grow (reallocate), then the cached value of tls() would be invalid, causing the access violation.

Sure enough, if I called reserve on the vector to pre-grow it, then the problem went away.

It only happened the first time because after that the vector wouldn't need to grow and the tls() reference would stay valid.

I was grateful that the problem was so repeatable. If this had been real threads it would have been much more erratic. One of the advantages of fibers is that they are deterministic.

One local fix was to rewrite it as:

auto dbms = dbms_remote_async(server_ip);
tls().thedbms = dbms;
dbms->auth(tok);

But where else in the code did I have this potential problem? And what would stop me from reintroducing the problem in future code.

My next thought was that I needed to tell the compiler that tls() was "volatile", i.e. it could change concurrently. But that wasn't really the problem. Even in single threaded code inserting into a vector invalidates any references, that's part of its contract.

One option was to use Windows fiber local storage. This didn't exist back when I rolled my own.

Another option was to dynamically allocate the tls structure so it didn't reside inside the vector.

However, there could potentially be other reference into the vector. I'd had problems with this in the past and "solved" them by using indexes into the vector rather than pointers. But it was still something I had to continually be on the lookout for.

Instead, I decided to switch from a vector to a simple fixed size static array. cSuneido isn't designed for huge numbers of fibers anyway. References into a fixed size static array were safe, nothing can invalidate them.

Problem solved (fingers crossed) and my mood has distinctly improved :-)

If you're interested in the nitty gritty, the change is on Github

Tuesday, January 10, 2017

Windows Overlapped Socket IO with Completion Routines

Up till now, cSuneido has used WSAAsyncSelect to do non-blocking socket IO. But WSAAsyncSelect is deprecated and it's not the nicest approach anyway. cSuneido needs non-blocking socket IO for background fibers, the main fiber uses synchronous blocking IO. (Although that means the main fiber will block background fibers.) Note: Windows fibers are single threaded, cooperative multi-tasking, coroutines. The advantage of fibers is that because they are single threaded and you control the context switches, you don't have the concurrency issues you would with "real" preemptive threads.

I thought that the WSAAsyncSelect code was the source of some failures we were seeing so I decided to rewrite it. My first rewrite used a polling approach. I know that's not scalable, but cSuneido doesn't do a lot of background processing so I figured it would be ok. Basically, I put the sockets in non-blocking mode, and whenever an operation returned WSAWOULDBLOCK the fiber would give up the rest of its time slice (e.g. 50ms) This was quite simple to implement and seemed to work fine.

However, I soon found it was too slow for more than a few requests. For example, one background task was doing roughly 400 requests. 400 * 50 ms is 20 seconds - ouch!

Back to the drawing board. One option was to use WSAEventSelect, but it looked complicated and I wasn't quite sure how to fit it in with the GUI event message loop.

Then I saw that WSARecv and WSASend allowed completion routines, a bit like XMLHttpRequest or Node's non-blocking IO. This seemed like a simpler approach. The fiber could block (yielding to other fibers) and the completion routine could unblock it.

At first I thought I had to use WSASocket and specify overlapped, but it turned out that the regular socket function sets overlapped mode by default. That's ok because it has no effect unless you use WSARecv or WSASend in overlapped mode.

Sending was the easy part since there was no need to block the sending fiber. It could just "fire and forget". One question was whether it would always do the full transfer or whether it might just do a partial transfer and require calling WSASend again (from the completion routine) to finish the transfer. I couldn't find a definitive answer for this. I found several people saying that in practice, unless there is a major issue (like running out of memory), it will always do the full transfer. Currently I just have an assert to confirm this.

Receiving is trickier. You may need to block until the data is available. And the completion routine may get called for partial data in which case you need to call WSARecv again for the remainder. (Although this complicates the code, it's actually a good thing since it allows you to request larger amounts of data and receive it as it arrives.)

WSASend and WSARecv can succeed immediately. However, the completion routine will still be called later. And for WSARecv at least, "success" may only be a partial transfer, in which case you still need to block waiting for the rest.

One complication to this style of overlapped IO is that completion routines are only called when you're in an "alertable" state. There are only a handful of functions that are alertable. I used MsgWaitForMultipleObjectsEx in the message loop, and SleepEx with a zero delay in a few other places. Note: although the MSDN documentation is unclear, you must specify MWMO_AWAITABLE for MsgWaitForMultipleObjectsEx to be alertable. (And it has to be the Ex version.)

Each overlapped WSASend or WSARecv is given an WSAOVERLAPPED structure and this structure must stay valid until the completion routine is called. I ran into problems because in some cases the completion routine wasn't getting called until after the socket had been closed, at which point I'd free'd the WSAOVERLAPPED structure. I got around this be calling SleepEx with a zero delay so the completion routines would run.

When I looked at some debugging tracing I noticed that it seldom blocked for very long. So I added a 1ms SleepEx before blocking to see if the data would arrive, in which case it wouldn't need to block and incur a context switch. This eliminated some blocking, but sometimes it didn't seem to work. I realized it was probably because the sleep was getting ended by an unrelated completion routine (e.g. from the preceding write). So I added a loop to ensure it was waiting at least a millisecond and that fixed it. Of course, I'm testing with the client and server on the same machine so the latency is very low. Across the network it will be slower and will still need to block sometimes.

Although the code wasn't that complicated, it took me a while to get it working properly (i.e. fast). As always, the devil is in the details. But the end result looks good. Background socket IO now runs about 30% faster than the old WSAAsyncSelect version, and almost as fast as the foreground synchronous blocking IO.

Sunday, January 01, 2017

Java ByteBuffer Gotcha

I had a weird bug in some Java code in jSuneido that took me a while to figure out. I briefly thought I'd found a bug in the ByteBuffer implementation, although I realized that was a low probability. (Especially given the confusing nature of ByteBuffer.) In the end it makes sense, although it perhaps could be documented better.

Here's the scenario - you slice or duplicate a ByteBuffer. This makes a new ByteBuffer instance that shares its data with the original. Then you compact the original buffer. Note - this does not modify the data that you are interested in. However, it does move it.

ByteBuffer buf = ByteBuffer.allocate(8192);
for (int i = 0; i <= 300; ++i)
buf.put((byte) i);
buf.flip();
for (int i = 0; i < 12; ++i)
buf.get();
ByteBuffer slice = buf.duplicate();
System.out.println(slice.get(0));
buf.compact();
System.out.println(slice.get(0));

This will print 0 and then 12, i.e. the slice has changed, even though you didn't explicitly modify the buffer.

In retrospect it's obvious - compact does alter the buffer, which is shared with the slice. So the contents of the slice "changes".

I'd be tempted to add a warning to the documentation for compact that it will "invalidate" any existing slices or duplicates, the same way that C++ documentation points out which operations invalidate iterators. But maybe it should be obvious.

I'm mostly posting this because my searches for the problem didn't find anything. Perhaps this will help someone in the future find the problem quicker.