Two more days working on my concurrency bug. I'm growing less and less confident that I'll "figure it out". The more I narrow it down, the less frequently it happens, which makes it more and more difficult (and time consuming) to know whether I've "fixed" it.
I'm really reluctant to say it, but I'm starting to wonder if there's a bug in Java. I have it narrowed down to where I check the data on the server immediately before writing it and it's always ok. But every so often the client will receive garbage. (Which usually leads to a timeout because it's waiting for a newline.) I'm pretty sure it's not a problem on the client side because it only happens when the server is multi-threaded.
If this was C or C++ I'd assume I had a bad pointer and was overwriting memory or that I was referencing memory that was prematurely free'd. But that's not supposed to be possible in Java.
There is no concurrent access in my code to the data that is being sent. I don't reuse the buffer or anything like that. I've tried using a read-only buffer, and I've also tried allocating a fresh buffer every time.
I have found a "fix". If I synchronize the channel.write on a global object, so only one thread can write at a time, then it seems to work. At least I've run it for about 10 times as long as it usually takes to get a failure.
But this "fix" makes no sense. First, channel.write is supposed to be thread safe. Second, concurrent writes (when they happen) are to separate channels, channels are not shared by multiple threads.
It could be a concurrency bug in Java or the OS, but why would I be the only one to encounter it?
Of course, this change may not be directly related to the problem. It may just be altering the timing enough to make the bug not happen, or to happen so rarely I'm not seeing it.
I've gone back and re-read all the NIO material I can find on the web. Despite a certain amount of conflicting advice I think I have a pretty good idea how it works, how to use it, and the common errors. I even read the arguments against using NIO (which I'm sympathetic to, given my problems).
Although it seems to be working, I'm not comfortable leaving it as it is. Since I don't know the source of the problem, or why/how the "fix" works, the bug could reappear at any time. But I'm also running out of things to try. It's narrowed down to only a few hundred lines of fairly straightforward code, most of which is identical to examples found on the web.
So my options are to abandon NIO and use a simple thread per connection model. According to some reports, this would be faster for the number of connections I expect (up to several hundred).
Or I could use someone else's framework. The contenders seem to be Apache Mina and JBoss Netty. (see Netty vs Mina)
I think I'll probably give Netty a try and see how it goes. But it's frustrating not being able to figure it out - I hate "giving up"!