Tuesday, September 22, 2015

A Bug Hunter's Life

I recently spent two weeks tracking down a single bug. Needless to say, it was a frustrating process. Like most such things, what made it difficult was not a single problem, it was a whole cascade of errors, including many during the debugging process. I like to think I'm fairly organized and methodical, especially when working on a tough bug. I take notes on what I'm doing, what my assumptions are, what I've tried, what the results were. And yet I still managed to make multiple major mistakes and go a long way down the wrong paths based on false assumptions I should have questioned much sooner.

It started with a customer getting a corrupted database. Most of the time this is from hardware issues or crashing. But this one appeared to be caused by a bug in the code.

It wasn't easy to look into because all I had was the corrupt database, with no clues as to how it might have gotten that way. Thankfully, jSuneido's append-only database means that the database itself forms a kind of log. Based on that I could see that two transactions had committed conflicting updates - something that the whole transaction system is designed to prevent.

It seemed a pretty safe bet that this was a concurrency related issue - my least favorite kind of bug :-( I wasn't surprised that I couldn't make it happen. With roughly 1000 systems running 24x7 this was the only case I was aware of. So my sole sources of information were that database and the code itself.

In order to detect conflicts at commit time (jSuneido uses optimistic concurrency control) you need to know the outstanding overlapping update transactions (i.e. ones that had started after this one and had not completed yet). Perhaps the bug was in determining this set of transactions? I wrote a completely new implementation of this code and used it to cross-check.

We deployed this version to our customers and pretty quickly started to get lots of errors. But only a couple were from the new cross-check. The rest were from code I hadn't touched. And yet they quite clearly started when the new version was deployed. Strange.

The errors indicated that sometimes when it went to complete a transaction it would be missing from the list of outstanding ones. Again, this shouldn't be possible. That would explain why the overlapping set would be wrong, so it made a certain amount of sense. I spent days studying the code and trying (unsuccessfully) to figure out how it was happening.

This turned out to be completely off track - it was a just problem with my new cross checking code.

The stack traces showed a transaction commit that turned into an abort. That happens if an update transaction doesn't end up doing any updates. So I wrote a bunch of tests with those kinds of transactions in the mix, but they all worked fine. If nothing else, the code was getting reviewed and tested pretty heavily.

Finally, after wasting days on this, I realized there was another way for a commit to turn into an abort. The relevant part of the code looked like:

try {
     // commit stuff
} finally {

What was happening was that an exception during the commit stuff would get overridden (and lost) if there was an error in abortIfNotComplete(). So I wasn't seeing the original error at all. And the reason that the transaction was missing from the list was that it had got far enough into the commit to remove it.

I'm not sure why I wrote it using finally. In hindsight it was obviously a bad idea. I rewrote it like:

try {
     // commit stuff
} catch (Throwable e) {
     try {
     } catch (Throwable e2) {
          throw e;

addSuppressed was added in Java 7. It is used, for example, in try-with-resources for exceptions when closing after another exception.

Part of the reason I didn't realize which path was being taken in the code was that I didn't have line numbers in the stack traces.  That was because in the Ant build task "debuglevel=vars" had been added. The intent had been to increase the amount of debugging information. But because of the way Ant and the Java -g option work, this had actually meant only variable information. I changed it to "debug=true" which becomes "-g" which means all debug information (file, line, and vars). Once this change percolated through and I started to get stack traces with line numbers, then I could see which path was being taken in the code.

One big step forward was when I was finally able to recreate the bug. Running random updates (of certain kinds) in a bunch of threads would encounter the bug within a few seconds. It wasn't the ideal recreation because the bug would occur buried within a lot of other activity, so it still wasn't possible to determine exactly what was happening.

I couldn't be 100% sure that what I was able to recreate was the same bug, but the end result was more or less identical - two transactions deleting the same record. But confusingly, the errors that I saw were about duplicate records being output, leading me to spend a bunch of time looking at the output code.

A huge mistake that I made repeatedly, without realizing it, was that I was building one variation of the jSuneido jar, and then testing with a different out of date one. So I wasn't actually testing with the changes I was making. This was because, for faster turnaround, I wasn't running a full build, I was just building the main jar that we deploy. But I was testing using the jar with Win32 support so I could use the IDE. This explained the totally baffling behavior where the debugging output that I was 100% certain should be triggered didn't appear at all. Which confused me and made me think I was totally wrong in my ideas about what was happening.

This mistake cost me about 4 days. I actually had the bug fixed, but it kept happening because I was testing with the old jar. Of course, since the bug still happened I assumed that I had not located it yet and I went looking for it in other places where, of course, I didn't find it.

Even worse, I thought my method of recreating the bug was quite "fragile". For instance, it would not happen running in the Eclipse debugger. Coming from C and C++, I didn't find that too surprising. And although Java should be more predicable, there would still be timing and threading differences which could affect concurrency. But I was totally wrong - the reason I couldn't recreate it within Eclipse was that I was running the latest code, which had the fix. Arghhh!

The actual bug and fix are obvious in hindsight. Throughout the long process I suspected that would be the likely result. But that didn't make it any easier, in fact it made it even more frustrating.

jSuneido's database uses optimistic concurrency. That means concurrent transactions proceed independently without any locking. Then when it comes time to commit, they verify that nothing has happened that would conflict. For example, they might find that another transaction has output a duplicate key. In which case the transaction will abort due to the conflict. Or it might find that another transaction has deleted a key that we also deleted. And that's where the bug was. I had made the assumption (back when I wrote this code) that it was ok if two concurrent transactions deleted the same key. After all, the end result is the same. And it's still serializable (i.e. the result is as if the transactions had happened one after another instead of concurrently).

The problem is not the deletes by themselves. If that was all the transactions were doing, my logic was correct. But transactions aren't that simple. They perform other actions based on whether the delete succeeded. For example, deleting an old version of a record and outputting a new version.  If you allow two transactions to think they deleted the old version, then they will, incorrectly, both output a new version. (leading to the duplicate outputs I was seeing)

All I had to do to fix the bug was to make duplicate deletes a conflict, like I already did with duplicate outputs. Two weeks for one line of code. How's that for productivity.

One interesting part of this whole adventure was that although it was a concurrency bug, it wasn't the kind of subtle interaction problem that I really fear. It was, in hindsight, a fairly clear, obvious issue.

Wednesday, July 15, 2015

Problem with Eclipse Mars on Windows after Java Update

I installed the new Java update on my Windows work machine, and removed the old version.

When I tried to run Eclipse (Mars) it said it couldn't find a JRE at the path of the old version of Java.

That made sense because I'd removed it. But why was it specifically looking for that path?

When I'd updated Java in the past Eclipse had found the new version fine.

I found that in the eclipse.ini file it had a -vm option set to the old version of Java. I removed this and now Eclipse starts fine.

I wonder if this was a result of installing the recently released Eclipse Mars with the Oomph installer?

Note: The Oomph installer puts Eclipse in your user folder (e.g. c:/Users/andrew in my case) rather than in Program Files. This may be a workaround to not require admin privileges to install.

Monday, July 13, 2015

Visual Studio Tip

I've been trying out Visual Studio 2015 Community RC and every time I did a build I'd get:

All packages are already installed and there is nothing to restore.
NuGet package restore finished.

It didn't seem to hurt anything, but I prefer to keep the build output clean so eventually I got tired of it and figured out you can get rid of it using Tools > Options

Of course, that's assuming you don't need to check for missing packages during builds.

Tuesday, June 09, 2015

Transpiling Suneido to JavaScript

Transpiling is source to source compiling, translating from one language to another, where both languages are at a similar level of abstraction.

For the background on why we'd want to do this, see my previous blog post on suneido.js

Transpiling is still compiling, and unless the translation is trivial, you need to parse the source language. I didn't want to write another Suneido parser, especially since I already had one in C++ in cSuneido and another in Java in jSuneido. cSuneido emits byte code as it parses, making it fast, but not very reusable. But jSuneido's parser builds an abstract syntax tree (AST) which is exactly what I needed.

One option would have been to write the transpiler in Java, as part of jSuneido. But I decided to write it in Suneido, for easier development and so it would be more accessible to Suneido programmers.

I added an AstParse function to jSuneido which takes a source string and returns an AST. Instead of converting the entire AST to Suneido objects I wrapped the internal AST and converted it lazily. [aside - I'm also hoping that we can use this in the IDE, e.g. for refactoring tools]

The big issue is deciding how to map from the Suneido language to JavaScript. Some things are easy, for example Suneido strings could be JavaScript strings. But other Suneido data types don't map directly. Even numbers are different since Suneido uses decimal floating point (for accurate business math) whereas JavaScript has binary floating point. Operations also differ so they have to be implemented as calls to a runtime support library. Suneido is also both more strict and more flexible with function arguments so that also requires runtime support.

So far I have statement and expression translation almost complete, and minimal versions of the runtime support routines (see: https://github.com/apmckinlay/suneido.js)

Here's a screenshot from a basic web page (running from a jSuneido server) that demonstrates the translation:

Sunday, June 07, 2015

Simplest CodeMirror

Maybe it's just my relative inexperience with JavaScript but I struggled a bit to get a simple example of CodeMirror to work. The examples in the documentation and the download are all partial snippets and it wasn't obvious how to use them. In hopes of saving someone else some time, here's what I came up with. (Of course, it is trivial in retrospect!)

I was originally going to share a JSFiddle, but it does so much of the boilerplate for you, that it defeats the purpose! But you can run it there.

Friday, May 29, 2015


suneido.js = Suneido + JavaScript

After getting back from my last traveling I mentioned to a friend that I was looking forward to getting back to looking into a Go version of Suneido. Not that I’d committed to it, but it seemed like a reasonable path to take since it had the potential to replace both the C++ cSuneido client and the Java jSuneido server.

But my friend pointed out that what we really needed was a web front end. Which he’s told me before, but for some reason this time I paid attention. It’s easy to get into a rut, to make decisions unconsciously based more on what is familiar than on what’s really best. Not that you ever really know what the best path is, but a Go implementation of Suneido wouldn’t really change the game. It’d just be an incremental improvement.

Currently our front end GUI is Windows specific. So to access our application software over the internet you have to use RDP (Remote Desktop Protocol), which requires a Windows server. So even though jSuneido will happily run on Linux, we can’t take advantage of that.

And although RDP works fairly well, our Windows GUI is not the look and feel of the web that people expect these days. We’re starting to get comments that our program looks dated. (Reminds me of when we moved from terminal mode MS-DOS to Windows. And yes, I have been in this business that long!)

If all we wanted was to write web software, there are plenty of languages and tools and frameworks to do that. But we have a million lines of complex Suneido code. Rewriting that in another language for another database is not something I even want to think about!

And therefore the idea of somehow running our existing code, with the least amount of changes, on a web front end.

One way to do that would be to rewrite just the front end in HTML + CSS + JavaScript. But that would mean a certain amount of duplication since some code has to run on the front end and the server. And it would mean getting all our programmers up to speed in HTML, CSS, and JavaScript as well as Suneido.

Suneido’s philosophy (right or wrong) has always been to provide an integrated “everything included” platform that shields programmers from the hassles and complexities of dealing with multiple languages and databases and frameworks. That shielding has also made Suneido a ery stable platform. I can’t think of too many languages or databases or frameworks where code you wrote 15 years ago would work unchanged today. (Java itself has been around that long, but that’s just the language piece of the puzzle.)

So what I really wanted was an approach that would encapsulate and hide the HTML, CSS, and JavaScript from application developers, like Google GWT does. (Of course, you’d need to know that stuff to work on the lower level implementation.) In my mind, that breaks down into two pieces.

The easy part is to be able to compile/translate Suneido code to JavaScript. I’ve been looking into that, and have a lot of it implemented. More on that in another blog post.

The harder part is to figure out how to map our Windows oriented GUI to the browser. At a general level that’s not that hard. Our user interface is “component” based with screens composed from a small number of basic building blocks. It shouldn’t be too hard to obtain or write equivalent “widgets” for the web. On the other hand I’m certain that some of the details will be painful.

One of the issues is that our current user interface is very chatty. It was developed on and for local area networks with low latency. So it uses a lot of lower level requests to the server. If we stayed on local area networks we could stick with the same model, but really we want to be able to run across the internet, where you have much higher latency. Bandwidth is also a factor, but even if you have really high bandwidth the latency will kill you if you’re doing too many little requests.

Talking about this with some of my programmers, one of the questions was whether we would use node.js for the server. I think they were thinking that the whole of Suneido would be re-written in JavaScript, similar to how cSuneido is written in C++ and jSuneido is written in Java. But it wouldn’t make sense to write Suneido’s database in JavaScript. The server can still run the existing jSuneido. It’s perfectly capable of acting as a web server and at the same time providing the database back end. The server back end would still be running Suneido code (compiled to JVM byte code). Only the front end user interface code would be translated to JavaScript to run on the browser.

I wouldn't say I've committed to this direction, but it seems worth looking into it. The big question is how much of our existing code we'd be able to use, and how much revising / rewriting we'd have to do.

Tuesday, May 19, 2015

Stonebraker on Databases

I recently listened to a podcast on Software Engineering Radio with database expert Michael Stonebraker. (at the recommendation of a friend - thanks Larry) It's a few years old, but still quite relevant.

As usual, I relate much of what I read and hear about to Suneido. In this case it was interesting to see how Suneido's database held up to Stonebraker's criticism of current conventional relational databases.

He talks about profiling a database and finding that 90% of the time was spent on "overhead". The time consisted of four main areas:

buffer pool management
Keeping a cache of database pages, page replacement tracking, and converting from external (disk) record format to internal (in-memory) format. Since most transactional (OLTP) databases now fit in main memory, this work is even more wasteful.

record locking
Most conventional relational databases use pessimistic row level read and write locks to ensure that transactions are atomic, consistent, and isolated (i.e. ACID). Managing locks and waiting to acquire locks can be slow.

thread locking
Most databases are multi-threaded which, in most cases, means they use locking to protect shared data structures. Locking limits parallelism.

crash recovery write-ahead logs
For durability (the last part of ACID) databases commonly use a write-ahead log where changes are written (and in theory flushed to disk) prior to updating the actual database. In case of a crash, the log can be used for recovery. But writing the log is slow.

So how does Suneido do in these areas?

Instead of a buffer pool, the database is memory mapped. This handles databases that fit into memory as well as ones that don't. When they don't the page replacement is handled by the operating system which is in a good position to do this efficiently. Modern operating systems and hardware already have so many layers of caching and buffering that it seems crazy to add yet another layer of your own!

Suneido also does as much work as possible using the external format of records, only converting to internal format when necessary. For example, most database "wheres" are done in external format. The encoding of values into external format maintains ordering, so sorting can also be done while still in external format.

Rather than pessimistic record locking, Suneido uses optimistic multi-version concurrency control in conjunction with an append-only "immutable" database. This means that read transactions do not require any locking and do not interact with update transactions. Write transactions only require locking for the actual commit.

Suneido's database server  is multi-threaded with requests handled by a pool of threads. But most of the internal data is in immutable persistent data structures, which require minimal locking. And the database itself is an immutable persistent data structure requiring minimal locking. (I minimized the locking as much to avoid bugs as for performance.)

Finally, Suneido doesn't use a write-ahead log. Instead, its immutable append-only design makes it a type of log structured database, where the database itself can act as the log.

NOTE: This is based on the newer Java implementation of Suneido which has a different database engine than the older C++ version.

I haven't benchmarked Suneido's database against other systems so I can't make any claims about speed. But in terms of avoiding most of the overheads that Stonebraker identifies, Suneido seems to hold up pretty well.