The Software Life: Concurrency is Hard episode n+1

For almost a year a bug has been haunting me. Once every few thousand server startups, the server would hang. It wasn't totally frozen - the server monitor web page was still running and the server would still accept connections (but not service them). Simply restarting would fix it. There was no discernible pattern, customers got it at random and there didn't seem to be anything in common between them.

At first we wrote it off to bad hardware or server glitches. But it continued to happen a little too consistently for that to be the cause.

It was hard to pinpoint exactly when it started since we didn't pick up on it right away. But we knew the rough time frame. So I could look at version control and see what changes I'd made. But nothing seemed like it could cause it. I could have rolled back changes, but most of them were bug fixes so I wasn't keen to do that. And for all we knew it could have been a Windows update or a Java update or changes to the application code that triggered it.

Once we recognized we had a problem, my first thought was that it was some kind of deadlock during the startup sequence. No problem, Java has ThreadMXBean findDeadlockedThreads. It worked great when I forced deadlocks. I scheduled it to run five minutes after startup. It didn't find anything. I scheduled it to run every five minutes. Still didn't find anything. So either it wasn't a deadlock, or the deadlock detection wasn't working. In other words, no progress.

It seems terrible that we had this bug for almost a year. I did work on it, and many hours in total, but it was never really urgent. An individual customer might get it once every six months and they simply had to restart. So there wasn't a lot of external pressure to fix it. It certainly "bugged" me but with no clues it was hard to even know what to try.

I studied the code. I rewrote large sections. I ran the ThreadSafe static analysis tool on it. We added logging in numerous places.

We did stress testing to try and make it happen, although it's hard to "stress test" server startup. All we could do is repeatedly start and stop the server. Of course, it never happened when we did this.

What I really needed was to examine a hung system with a debugger (e.g. jvisualvm). But we only install a JRE on our customers servers, not the full JDK. If it happened consistently on one customer we could have installed the JDK on their server but it didn't. Towards the end I was starting to think about installing the JDK on all our customers. I would have been happy even to make the bug worse so it happened more frequently. At least then I'd have a chance of getting it under the microscope.

Looking at the Java documentation I found you could debug core dumps. So next time we got a chance to access a hung server, we forced a core dump. But we're using a launcher to run as a Windows service, and the launcher meant the debugging tools didn't recognize the core dump. So much for that idea.

One of the problems with Java's "synchronized" keyword is that you can't put any timeouts on it. So towards the end I was considering replacing all my uses of synchronized with explicit locks with timeouts. At least then deadlocks would time out and give me a clue where they were happening. In retrospect, this would have worked to locate where the problem was.

One of the problems was the slow turnaround. After making a change to jSuneido we would run it internally for at least a week. Then we'd roll it out to beta customers, and then finally to the rest of the customers. Then we had to wait for several days to see if the problem reoccured. So after each change I'd have to wait several weeks to see if it made a difference. Of course, there was nothing stopping me from continuing to work on the problem during this time, but the natural inclination was to want to wait and see if the change had worked.

One pattern we noticed was that it seemed to be related to requests to the HTTP server during startup. We weren't aware of that last time we stress tested so I thought it might be something to try. So I set up a server to restart every minute and arranged for a constant stream of HTTP requests. It ran all day without a hiccup. I normally shut down my computer at night. But that was only about 500 restarts, not really enough to have a good chance of detecting something that only happens every few thousand times. What the heck, I left it running overnight.

When I went to check it in the morning I had no expectation of finding anything other than it happily running. So it took me a minute of staring at the screen to realize it had hung! Eureka! Funny to be so excited about managing to hang your software!

I was almost afraid to touch it, for fear of disturbing the hung process. I fired up jvisualvm and did a thread dump. As I'd expected from the beginning, it was a deadlock. Two threads each owning a lock that the other wants.

And still the deadlock detection didn't show anything. I realized the reason was that one of the threads was not waiting on a lock, it was "parked" in Guava cache code. The code it was in was documented as "thread safe". As Inigo says in The Princess Bride, "You keep using that word. I do not think it means what you think it means."

I briefly entertained the (attractive) thought that the deadlock was in the Guava code and not my fault. That would have been nice, although harder to get fixed. But unfortunately it was quite obvious from the thread dump that it was my code that was at fault, despite the one thread being parked in Guava code.

The problem, and one of the lessons from this bug, is that code that calls a user supplied function is never truly thread safe because you can't control what that function does. That's also one of the problems with "functional" code, it is not amenable to static analysis.

In hindsight I realize that ThreadMXBean also has dumpAllThreads. I might have been able to write code to detect when the server hung and dumped the threads automatically. That would have given me the information I needed to solve this. Of course, I had no way to know that before now.

When I started rewriting Suneido from C++ to Java one of the big reasons was that we needed a multi-threaded server, and I thought Java had good concurrency support. But what it has is locks and shared state, which we've come to realize is not a great way to handle concurrency. Even back then I knew that, and I tried hard to minimize locking (which also helps performance) by keeping state thread contained and using immutable data. But I still used a certain amount of locking and concurrent data structures (which commonly use locking). And it's mostly worked out pretty well - jSuneido hasn't had a lot of concurrency problems. However, it's a bit like the lack of safety and garbage collection in C++, if you're careful you can get away with it, but sooner or later it's going to catch you.

The Software Life

Thursday, March 30, 2017

Concurrency is Hard episode n+1

No comments: