Tuesday, May 04, 2010

This is what I was afraid of!

I just spent another day trying to find my jSuneido concurrency bug, with no concrete progress. That makes four days I think, definitely the trickiest bug on this project (so far). It's this kind of bug that makes me afraid of concurrency in the first place.

I made a test server and test client and at first I thought I had recreated the problem, but the fix I found only worked for the test program, not jSuneido :-( Since then my test server and client have worked perfectly, despite extending them to be more like the real jSuneido code.

I've tried synchronizing here and there and making defensive copies of this and that but the bug still happens. I've read and re-read the code. I've drawn diagrams of the sequence of events.

I start to wonder if the bug is in the cSuneido client rather than the jSuneido server. But that doesn't really make sense because we have thousands of people using the cSuneido client and we don't see this problem.

My next step is to write a simple Java client for the jSuneido server. That will eliminate the cSuneido client. I think the best bet is to try to gradually narrow down the problem and isolate it.

I don't doubt I'll find the problem eventually. The question is how many days I'll need to bang my head against the wall. Hopefully the wall crumbles before my head!

2 comments:

Larry Reid said...

I'm sure you've thought of these, but here goes:

Is it possible that there's a "bug" in the cSuneido client that would only show up with the jSuneido server?

Is there a bug in both cSuneido client and cSuneido server that cancel each other out (two wrongs make a right)?

Or is there a "bug" in cSuneido that only manifests when running against a server that implements real concurrency?

Andrew McKinlay said...

Good thoughts. It seems unlikely, but when you've eliminated the likely, I guess that's what you have left!

There's also potentially some difference between Windows and OS X TCP/IP and sockets implementations. (like the Nagle problems I had)

Also, I haven't actually run the same test with the cSuneido server. I should try that.