Friday, April 30, 2010

jSuneido Concurrency Bug

I spent all day yesterday working on my concurrency bug in jSuneido. (see previous post)

It didn't seem like I made any progress, but I guess that's not true - I eliminated a few possibilities.

Originally I assumed it was a deadlock since it was "freezing". So I managed to make the problem happen (that takes a while), connected JConsole, and ran the deadlock detector. It didn't find any deadlocks. Thinking about it, that makes sense since it's really only the client that gets "stuck" - the server thinks everything is fine. There are no worker threads running for that client.

So much for that assumption. I tried to narrow it down, without much success. I added a bunch of assertions, which failed, but I found it was my assumptions that were wrong, not the code.

The client appears to be waiting for a response to a request. But the server is not in the process of handling a request - it either didn't get the request, or it thinks it finished it. The read and write buffers for that client are empty.

Sometimes you have to run test for a long time before the problem shows up. So you try something, and the longer the tests run successfully, the more you think you've got it fixed. But eventually it has always shown up. I guess I should be thankfully this is on the order of minutes rather than hours! Even so, it's reminiscent of the "old" days when you had to have something to read while you waited for compiles. (And no internet to browse back then.)

On the positive side, the handling for inactive transactions and clients seems to be working well - when the client gets stuck, the server eventually aborts its transactions and if it stays stuck, closes the connection.

My current hypothesis is that it's related to a potentially overlap between handling requests. As an optimization, worker threads first attempt to directly write their response. This is in non-blocking mode so it may or may not write everything. Any remaining data is sent by the select thread when the socket becomes write ready. But if the worker does successfully write the entire response, and then there is a context switch back to the "select" thread, the next request can be read and another worker started, before the previous worker has a chance to finish. I'm not sure why this would be a problem, but my assumption when I was writing the code was that workers for the same client would not overlap, so I didn't worry about making the data structures thread safe. That's the next area I want to look at.

No comments: