Thursday, January 23, 2014

jSuneido Network Bug

beetle

Our installations run a scheduler as a client. (On jSuneido this could probably just be a thread on the server, but cSuneido is single threaded.)

On Windows, when we shut down the server, the scheduler will exit. On Linux the scheduler would hang.

I narrowed it down to a simple test case (on OS X, which seemed to behave like Linux)
  1. start the server
  2. start a client REPL
  3. from the client,execute: ServerEval("Exit")
  4. server exits
  5. client hangs (on Linux but not on Windows)
Strangely, if you killed the server (with Ctrl+C) then the client would get an exception instead of hanging.

I assumed that the client was blocking when it tried to read the response from the ServerEval (which never came because the server had terminated). And that on Linux, for some reason, it didn't recognize the socket was closed when it was blocked reading, although that didn't make a lot of sense.

I ran the client in the debugger and when it was hung, I paused it to see where it was. Sure enough it was in the socket read.

I searched the web trying to find anything related. There wasn't much, which was surprising. Most problems like this are documented by someone. 

But I did notice some of the code examples were checking the return value from channel.read and I wasn't. The documentation said read returns -1 when "the channel has reached end-of-stream". That didn't sound like channel closed to me, but it seemed like I should be checking it anyway.

And that was the problem. It wasn't actually blocking on the read, the read was returning -1, but I was looping until I read all the data, and that's what was hanging it.

To verify, I restored the code, made it hang, and checked the CPU usage - sure enough the client Java was at 100% (because it was in a tight loop calling channel.read over and over).

I'm still not sure why killing the server behaves differently from exiting normally. I guess the socket gets closed differently. (i.e. gracefully or not)

In hindsight it seems like an obvious bug in my code (not checking the return value). I think what threw me off was that it worked fine in Windows. Java is usually pretty good at hiding operating system differences, but not in this case.

No comments: