I recently fixed a long standing (many years) bug in the C++ implementation of Suneido. A friend remarked how you'd wish that after this long all the bugs would have been found. Of course, it doesn't take much code to provide room for bugs to lurk.
The problem that was reported was that if you created one thread inside another that cSuneido would crash. It seemed to happen quite consistently and predictably. That was from the IDE. If you ran the same code without the IDE it worked fine. Or if you played around a bit in the IDE first, it would also work fine.
cSuneido "threads" aren't real threads. They are Windows "fibers" - more like coroutines. They don't actually run concurrently, but they allow cooperative multi-tasking. The big advantage is that since you control when the task switching happens and can do it at "safe" points in the code, you don't have to worry about low level concurrency issues. The downside is that you can't take advantage of multiple cpu's. But this was implemented at a time when no one had multiple cpu's and Moore's Law was still happily improving single cpu performance.
Suneido's C++ fiber code had a std::vector of fibers. It also had a main fiber, separate from the vector. The current fiber was a reference (pointer) to either the main fiber or an element of the vector.
Even from that minimal description you could probably guess the problem. Vector implementations normally grow by allocating a new larger array, copying over the data, and throwing out the smaller old array. So adding an element to a vector invalidates any references to its content. So the current fiber reference would be pointing to stale data. (It wouldn't actually be a dangling pointer because cSuneido uses garbage collection.) The reference to stale data could cause an "impossible" situation that would lead to a fatal error. (So the problem was nothing to do with creating one fiber inside another, it was simply that creating two fibers in that sequence happened to be one way to expose the bug.)
The problem was rare because it required a specific sequence of events. First, the vector had to grow. Which is why if you played around first (and expanded the vector) it wouldn't happen. Second, the stale reference had to be used in such a way that it caused a problem. Since the data would normally be identical the stale reference wouldn't matter. And the next fiber switch would update it to a valid value so the stale reference wouldn't hang around.
Actually, I think there was at least one more potential problem scenario. When fibers ended they were removed from the vector. This probably wouldn't cause a reallocation (many implementations never shrink the array) but it would invalidate any references after that item. You'd either end up with a reference to the wrong item or past the end of the array.
I'm a little embarrassed to discover such a long standing blatant mistake, and a newbie mistake at that. All the times I've looked at that code and I never picked up on it. Ouch.
But to me the real moral of the story is "don't use unsafe languages". Interestingly, this bug was not a memory management issue since cSuneido (unlike almost all C++ programs) uses garbage collection. It's just a result of C++ allowing unsafe raw pointers/references.
C++ fans would tell you that modern C++ has plenty of high level features that are "safe". But the point is that it still has lots of unsafe features. (And AFAIK there is no way to enforce use of a "safe" subset. And C++ continues to resist "real" garbage collection.) I would much rather work in a language like Java or Go (or others) that just don't allow unsafe code of this nature, and eliminate a whole class of problems. Figuring out my high level issues is challenging enough without worrying about unsafe low level issues.
The problem that was reported was that if you created one thread inside another that cSuneido would crash. It seemed to happen quite consistently and predictably. That was from the IDE. If you ran the same code without the IDE it worked fine. Or if you played around a bit in the IDE first, it would also work fine.
cSuneido "threads" aren't real threads. They are Windows "fibers" - more like coroutines. They don't actually run concurrently, but they allow cooperative multi-tasking. The big advantage is that since you control when the task switching happens and can do it at "safe" points in the code, you don't have to worry about low level concurrency issues. The downside is that you can't take advantage of multiple cpu's. But this was implemented at a time when no one had multiple cpu's and Moore's Law was still happily improving single cpu performance.
Suneido's C++ fiber code had a std::vector of fibers. It also had a main fiber, separate from the vector. The current fiber was a reference (pointer) to either the main fiber or an element of the vector.
Even from that minimal description you could probably guess the problem. Vector implementations normally grow by allocating a new larger array, copying over the data, and throwing out the smaller old array. So adding an element to a vector invalidates any references to its content. So the current fiber reference would be pointing to stale data. (It wouldn't actually be a dangling pointer because cSuneido uses garbage collection.) The reference to stale data could cause an "impossible" situation that would lead to a fatal error. (So the problem was nothing to do with creating one fiber inside another, it was simply that creating two fibers in that sequence happened to be one way to expose the bug.)
The problem was rare because it required a specific sequence of events. First, the vector had to grow. Which is why if you played around first (and expanded the vector) it wouldn't happen. Second, the stale reference had to be used in such a way that it caused a problem. Since the data would normally be identical the stale reference wouldn't matter. And the next fiber switch would update it to a valid value so the stale reference wouldn't hang around.
Actually, I think there was at least one more potential problem scenario. When fibers ended they were removed from the vector. This probably wouldn't cause a reallocation (many implementations never shrink the array) but it would invalidate any references after that item. You'd either end up with a reference to the wrong item or past the end of the array.
I'm a little embarrassed to discover such a long standing blatant mistake, and a newbie mistake at that. All the times I've looked at that code and I never picked up on it. Ouch.
But to me the real moral of the story is "don't use unsafe languages". Interestingly, this bug was not a memory management issue since cSuneido (unlike almost all C++ programs) uses garbage collection. It's just a result of C++ allowing unsafe raw pointers/references.
C++ fans would tell you that modern C++ has plenty of high level features that are "safe". But the point is that it still has lots of unsafe features. (And AFAIK there is no way to enforce use of a "safe" subset. And C++ continues to resist "real" garbage collection.) I would much rather work in a language like Java or Go (or others) that just don't allow unsafe code of this nature, and eliminate a whole class of problems. Figuring out my high level issues is challenging enough without worrying about unsafe low level issues.
3 comments:
I'm totally with you. I look at some open source projects that might be fun to help out on, but then I see they're C or C++ and I just can't handle the thought of messing with malloc for the rest of my life.
Did you know you can cause a memory leak in Java?
I can think of various ways to "use up" memory in Java but I'm not sure I'd call them "leaks". What are you thinking of?
I just discovered I have a book called Safe C++. Looking at it, it mostly convinces me of exactly the opposite, since it talks about all the ways you can screw up!
Post a Comment