Technology Review: Software That Fixes Itself
Cool but a little scary - will the software start to evolve?
Monday, November 30, 2009
Tuesday, November 24, 2009
jSuneido Socket Server
Up till now I've been using Ron Hitchens NIO socket server framework. It has worked pretty well, but it's primarily sample code. As far as I know it's not in production anywhere and not really maintained.
The first problem I ran into with it was that it didn't use gathering writes so it was susceptible to Nagle problems. I got around that with setTcpNoDelay, although that's not the ideal solution.
Another problem I ran into was that the input buffer was a fixed size. And worse, it would hang up in an infinite loop if it overflowed. To get around this I made the buffer big, but again, not an ideal solution.
And lastly, everything sent or received had to be copied in or out of buffers maintained by the framework. (rather than used directly)
So I decided to bite the bullet and write my own. It took me about half a day to write. It's roughly 180 lines of code. It's not as flexible as Ron's but it does what I need - gathering writes, unlimited input buffering, and the ability to use the buffers directly without copying. It's fairly easy to use - there's a simple echo server example at the end of the code. I wouldn't want to have to write it with just the Sun Java docs to go by, but with the examples in Ron's book, Java NIO, it's not too bad.
Of course, there may still be bugs in it, but it seems to work well so far.
The first problem I ran into with it was that it didn't use gathering writes so it was susceptible to Nagle problems. I got around that with setTcpNoDelay, although that's not the ideal solution.
Another problem I ran into was that the input buffer was a fixed size. And worse, it would hang up in an infinite loop if it overflowed. To get around this I made the buffer big, but again, not an ideal solution.
And lastly, everything sent or received had to be copied in or out of buffers maintained by the framework. (rather than used directly)
So I decided to bite the bullet and write my own. It took me about half a day to write. It's roughly 180 lines of code. It's not as flexible as Ron's but it does what I need - gathering writes, unlimited input buffering, and the ability to use the buffers directly without copying. It's fairly easy to use - there's a simple echo server example at the end of the code. I wouldn't want to have to write it with just the Sun Java docs to go by, but with the examples in Ron's book, Java NIO, it's not too bad.
Of course, there may still be bugs in it, but it seems to work well so far.
Thursday, November 19, 2009
jSuneido Back on Track
After my last post I spent a full day chasing my bug with very little progress. Around 7pm, just as I was winding down for the day I found a small clue. It didn't seem like much, but it was nice to end the day on any sort of positive note.
This morning, using the clue, I was able to find the problem. It didn't turn out to be a low level synchronization issue, it was a higher level logical error, although still related to concurrency. That explained the consistency in the error. I had missed one type of transaction conflict, and that meant under certain circumstances one transaction would overwrite another. The fix was easy (two lines of code) once I figured it out.
Even with the clue, it wasn't exactly easy to track down. I ended up digging through a 100,000 line log file. Luckily I wasn't just looking through it, I was searching for particular things. It was a matter of finding the particular 50 lines where the error happened. After that it was fairly obvious.
Since fixing the bug I've run millions of iterations of a variety of scenarios for as long as 30 minutes with no problems. This evening I'll let it run for a couple of hours. I'll also think up some additional testing scenarios - there are still a few things that I'm not exercising.
Cleaning up the code before sending it to version control I found an entire data structure (a hash map of transactions) that wasn't being used! I was carefully adding and removing from it, but I never actually used it. I must have at some point. So I removed it and everything worked the same. Humorous.
I don't want to be overly optimistic, I'm sure there are still bugs (there always are), but it's starting to feel like I'm over the worst of it.
This morning, using the clue, I was able to find the problem. It didn't turn out to be a low level synchronization issue, it was a higher level logical error, although still related to concurrency. That explained the consistency in the error. I had missed one type of transaction conflict, and that meant under certain circumstances one transaction would overwrite another. The fix was easy (two lines of code) once I figured it out.
Even with the clue, it wasn't exactly easy to track down. I ended up digging through a 100,000 line log file. Luckily I wasn't just looking through it, I was searching for particular things. It was a matter of finding the particular 50 lines where the error happened. After that it was fairly obvious.
Since fixing the bug I've run millions of iterations of a variety of scenarios for as long as 30 minutes with no problems. This evening I'll let it run for a couple of hours. I'll also think up some additional testing scenarios - there are still a few things that I'm not exercising.
Cleaning up the code before sending it to version control I found an entire data structure (a hash map of transactions) that wasn't being used! I was carefully adding and removing from it, but I never actually used it. I must have at some point. So I removed it and everything worked the same. Humorous.
I don't want to be overly optimistic, I'm sure there are still bugs (there always are), but it's starting to feel like I'm over the worst of it.
Wednesday, November 18, 2009
Offsite Sync and Backup
I have a large amount of music (~30gb) and photo files (~300gb). I back them up to my Time Capsule but that wouldn't protect me if my house burnt down. (Photo files from my Pentax K7 are 20mb each and I might take 10,000 in a year - that's 200gb added per year.)
So for an off-site backup, and so I can access them I keep a "mirror" copy on my computer at work. Currently, I update this mirror manually periodically, by copying new files to a portable hard drive and carrying that to work. But this is an awkward solution, and I don't update as often as I should.
There are a variety of backup and sync products out there, but none of them seem to handle this scenario.
I have been using Dropbox to sync my jSuneido files between home and work and laptop and it works really well. But their biggest account is 100gb.
Google's storage is getting cheaper, but Picasa won't let me store my big DNG (raw) photo files.
Jungle Disk has no limit storage, but at $.15 per gb that's roughly $50 per month, which isn't cheap.
Apart from the cost, the big problem with online storage is that uploading 300gb takes a long time. I signed up for Jungle Disk but it estimated 60 days to upload my files! Obviously, after that I'd only have to upload new files, but even a few thousand photos from a long holiday will take days or weeks to upload. Maybe I need a faster internet connection!
CrashPlan has a really interesting approach of letting you backup to other machines, either your own or your friends. This avoids the cost of storage. The upload speed may be better since the machines are local and aren't servicing other users. But CrashPlan doesn't sync, so I'd have an off-site backup, but I couldn't access the files (without restoring them). Another problem with CrashPlan is it requires both machines to be turned on at the same time. But to be environmentally friendly, I try to turn off my computers when I'm not using them.
Note: Jungle Disk only recently added sync and from their forum it sounds like it has problems.
Here is an idea for a new service.
I don't really need a copy of my files in the cloud. If I could sync between my home and work computers that would be sufficient. I don't really want to be paying $50 per month just to store my files in the cloud.
All I really need to store in the cloud is a "summary" of my files (e.g. file names, dates, sizes, maybe hashes) plus any new or modified files. Once the files have propagated to my computers they can be removed from the cloud. If you used a clever hash scheme you keep even do partial updates of large files. (Although for music and photos this isn't that important since the files don't usually change.)
This would require far less storage than keeping a complete copy in the cloud.
You'd still have the problem of the initial syncing. But that could either be done by a different method e.g. a portable hard drive like I've been using, or by requiring both computers to be running at the same time for the initial sync. This is similar to Amazon allowing you to send them physical media to load data into S3. And if you had a big addition of files (like the photos from a long holiday) you could use an alternate method to move them around, and the sync could recognize that you already had the same files on each computer.
The businesses that make money from selling storage probably wouldn't be crazy about this idea, but it seems like a natural addition to CrashPlan since they aren't charging for storage, and charging for the sync service would be additional revenue. And presumably it could be cheap since the storage and bandwidth needs are minimal. (The actual data would be transferred peer to peer.)
You could even borrow some ideas from Git - their "tree" of hash values would work well for this, and also provides security and error checking.
If I had some spare time it would be a fun project. If anyone out there wants to implement it, you can count me in as your first customer :-)
So for an off-site backup, and so I can access them I keep a "mirror" copy on my computer at work. Currently, I update this mirror manually periodically, by copying new files to a portable hard drive and carrying that to work. But this is an awkward solution, and I don't update as often as I should.
There are a variety of backup and sync products out there, but none of them seem to handle this scenario.
I have been using Dropbox to sync my jSuneido files between home and work and laptop and it works really well. But their biggest account is 100gb.
Google's storage is getting cheaper, but Picasa won't let me store my big DNG (raw) photo files.
Jungle Disk has no limit storage, but at $.15 per gb that's roughly $50 per month, which isn't cheap.
Apart from the cost, the big problem with online storage is that uploading 300gb takes a long time. I signed up for Jungle Disk but it estimated 60 days to upload my files! Obviously, after that I'd only have to upload new files, but even a few thousand photos from a long holiday will take days or weeks to upload. Maybe I need a faster internet connection!
CrashPlan has a really interesting approach of letting you backup to other machines, either your own or your friends. This avoids the cost of storage. The upload speed may be better since the machines are local and aren't servicing other users. But CrashPlan doesn't sync, so I'd have an off-site backup, but I couldn't access the files (without restoring them). Another problem with CrashPlan is it requires both machines to be turned on at the same time. But to be environmentally friendly, I try to turn off my computers when I'm not using them.
Note: Jungle Disk only recently added sync and from their forum it sounds like it has problems.
A Proposed Solution
Here is an idea for a new service.
I don't really need a copy of my files in the cloud. If I could sync between my home and work computers that would be sufficient. I don't really want to be paying $50 per month just to store my files in the cloud.
All I really need to store in the cloud is a "summary" of my files (e.g. file names, dates, sizes, maybe hashes) plus any new or modified files. Once the files have propagated to my computers they can be removed from the cloud. If you used a clever hash scheme you keep even do partial updates of large files. (Although for music and photos this isn't that important since the files don't usually change.)
This would require far less storage than keeping a complete copy in the cloud.
You'd still have the problem of the initial syncing. But that could either be done by a different method e.g. a portable hard drive like I've been using, or by requiring both computers to be running at the same time for the initial sync. This is similar to Amazon allowing you to send them physical media to load data into S3. And if you had a big addition of files (like the photos from a long holiday) you could use an alternate method to move them around, and the sync could recognize that you already had the same files on each computer.
The businesses that make money from selling storage probably wouldn't be crazy about this idea, but it seems like a natural addition to CrashPlan since they aren't charging for storage, and charging for the sync service would be additional revenue. And presumably it could be cheap since the storage and bandwidth needs are minimal. (The actual data would be transferred peer to peer.)
You could even borrow some ideas from Git - their "tree" of hash values would work well for this, and also provides security and error checking.
If I had some spare time it would be a fun project. If anyone out there wants to implement it, you can count me in as your first customer :-)
Immutable and Pure
More and more I find myself wanting a programming language where I could mark classes as immutable and functions as pure (no side-effects) and have this checked statically by the compiler. Being able to mark methods as read-only (like C++ const) would also be nice.
This is coming from a variety of sources:
- reading about functional languages like Haskell and Clojure
- working on concurrency in jSuneido (immutable classes and pure functions make concurrency easier)
- problems in my company's applications where side-effects have been added where they shouldn't
I have been using the javax annotation for Immutable, which in theory can be checked by programs like FindBugs and that's a step in the right direction.
There are a lot of new languages around these days, but so far I haven't seen any with these simple features. Of course, in a "true" functional language like Haskell, "everything" is pure and immutable (except for monads), so this doesn't really apply. But I think for the foreseeable future most of us are going to be using a mixture.
This is coming from a variety of sources:
- reading about functional languages like Haskell and Clojure
- working on concurrency in jSuneido (immutable classes and pure functions make concurrency easier)
- problems in my company's applications where side-effects have been added where they shouldn't
I have been using the javax annotation for Immutable, which in theory can be checked by programs like FindBugs and that's a step in the right direction.
There are a lot of new languages around these days, but so far I haven't seen any with these simple features. Of course, in a "true" functional language like Haskell, "everything" is pure and immutable (except for monads), so this doesn't really apply. But I think for the foreseeable future most of us are going to be using a mixture.
Tuesday, November 17, 2009
To Laugh or To Cry?
I sat down this morning to write more concurrency tests for jSuneido, fully expecting to uncover more bugs. Amazingly, everything worked perfectly. I have to admit I was feeling pretty darn good, I was almost ready to claim victory. But as the saying goes, pride goes before a fall.
It was time for coffee so I figured I might as well let the tests run for a longer period. I came back to find ... the exact same error I've been fighting for the last week or more! I wouldn't have been surprised to uncover different bugs, but I could have sworn I had squashed this one.
It's bizarre that I keep coming back to this exact same error. I would expect concurrency errors to be more random. Even for a single bug I would expect it to show a variety of symptoms. I guess I shouldn't be complaining, consistency is often helpful to debugging.
I've obviously reduced the frequency of occurrence of the error. I just hope I can get the error to occur in less than 10 minutes of testing. Otherwise it's going to be a very slow debug cycle and I'll have lots of time to review the code!
So am I laughing or crying? Mostly laughing at the moment, but ask me again after I've spent a bunch more hours (or, heaven forbid, days) struggling to find the problem.
It was time for coffee so I figured I might as well let the tests run for a longer period. I came back to find ... the exact same error I've been fighting for the last week or more! I wouldn't have been surprised to uncover different bugs, but I could have sworn I had squashed this one.
It's bizarre that I keep coming back to this exact same error. I would expect concurrency errors to be more random. Even for a single bug I would expect it to show a variety of symptoms. I guess I shouldn't be complaining, consistency is often helpful to debugging.
I've obviously reduced the frequency of occurrence of the error. I just hope I can get the error to occur in less than 10 minutes of testing. Otherwise it's going to be a very slow debug cycle and I'll have lots of time to review the code!
So am I laughing or crying? Mostly laughing at the moment, but ask me again after I've spent a bunch more hours (or, heaven forbid, days) struggling to find the problem.
Monday, November 16, 2009
How Can This Be Acceptable?
I recently downloaded the latest version of the Scite programming editor. And subsequently, every time I ran it I got Windows security warnings. There's a check box that implies it will let you stop these warnings, but as far as I can tell it has no effect. I have no idea why the previous version ran without any warnings.
I eventually got these instructions to work:
I eventually got these instructions to work:
1.. Right-click the file and select Properties.At my count, that's 7 levels of nested dialogs. And my name didn't show up in the list for step 12 so I had to Add "APM\andrew" (obviously, users would know to type that). Who designs this stuff? Who reviews it? Microsoft is supposed to hire all these really smart people, but they still seem to produce a lot of stupid stuff.
2.. Click on the Security tab.
3.. Click Advanced in the lower right.
4.. In the Advanced Security Settings window that pops up, click on the Owner tab.
5.. Click Edit.
6.. Click Other users or groups.
7.. Click Advanced in the lower left corner.
8.. Click Find Now.
9.. Scroll through the results and double-click on your current user account.
10.. Click OK to all of the remaining windows except the first Properties window.
11.. Select your user account from the list up top and click Edit.
12.. Select your user account from the list up top again and then in the pane below, check Full control under Allow, or as much control as you need.
13.. You'll get a security warning, click Yes.
14.. On some files that are essential to Windows, you'll get a "Unable to save permission changes. access is denied" warning and there's nothing that you can do about it to the best of my knowledge.
15.. Reconsider why you're using Windows.
Sunday, November 15, 2009
jSuneido Success
As I hoped, once I had a small failing test it didn't take too long to find the problem and fix it. It didn't make me feel too stupid (at least no more than the usual when you figure out a bug) since it was a fairly subtle synchronization issue. Have I ever mentioned that concurrency is hard?
The funny (in a sick way) part was that after all that, I still had the original problem. Ouch. Obviously, the problem I isolated and fixed wasn't the only one.
Pondering it more I realized that the bugs I'd been chasing were all originating from a certain aspect of the design. And I realized that even if I managed to chase them down and squash them, that it was still going to end up fragile. Some future modification was likely to end up with the same problem.
So I reversed course, deleted most of the code I wrote in the last few days, and took a simpler approach. Not quite as fast, but simplicity is worth a lot. It only took a half hour or so to make the changes.
Amazingly, all the tests now pass! It took me a minute to grasp that fact. What does that mean when there are no error messages? Oh yeah, that must mean it's working - that's weird.
I'll have to write a bunch more tests before I feel at all confident that it's functional, but this is definitely a step in the right direction. I feel a certain amount of reluctance to start writing more tests - I'd like to savor the feeling of success before I uncover a bunch more problems!
The funny (in a sick way) part was that after all that, I still had the original problem. Ouch. Obviously, the problem I isolated and fixed wasn't the only one.
Pondering it more I realized that the bugs I'd been chasing were all originating from a certain aspect of the design. And I realized that even if I managed to chase them down and squash them, that it was still going to end up fragile. Some future modification was likely to end up with the same problem.
So I reversed course, deleted most of the code I wrote in the last few days, and took a simpler approach. Not quite as fast, but simplicity is worth a lot. It only took a half hour or so to make the changes.
Amazingly, all the tests now pass! It took me a minute to grasp that fact. What does that mean when there are no error messages? Oh yeah, that must mean it's working - that's weird.
I'll have to write a bunch more tests before I feel at all confident that it's functional, but this is definitely a step in the right direction. I feel a certain amount of reluctance to start writing more tests - I'd like to savor the feeling of success before I uncover a bunch more problems!
The Joy of a Small Failing Test
Up till now I could only come up with small tests that succeeded and large scale tests that failed.
What I needed was a small test that failed. I finally have one. And even better, it actually fails in the debugger :-)
It's not so easy to come up with a small failing test because to do that you have to narrow down which part of the code is failing. Which is half the challenge, the other half is to figure out why it's failing.
At least now I feel like I have the beast cornered and it's only a matter of time before I kill it.
The test is simple enough that I look at it and think "this can't fail". But it is failing, so obviously I'm missing something. I just hope it's not something too obvious in hindsight because then I'll feel really stupid when I find it.
What I needed was a small test that failed. I finally have one. And even better, it actually fails in the debugger :-)
It's not so easy to come up with a small failing test because to do that you have to narrow down which part of the code is failing. Which is half the challenge, the other half is to figure out why it's failing.
At least now I feel like I have the beast cornered and it's only a matter of time before I kill it.
The test is simple enough that I look at it and think "this can't fail". But it is failing, so obviously I'm missing something. I just hope it's not something too obvious in hindsight because then I'll feel really stupid when I find it.
Saturday, November 14, 2009
jSuneido Progress
The good news is that I've fixed a number of bugs and came up with a reasonable (I think) solution for my design flaw. The solution involved the classic addition of indirection.[1] Of course, it's not the indirection that is the trick, it's how you use it.
The bad news is that after I'd done all this, I was still getting the original error! It only occurs about once every 200,000 transactions (with 2 threads). (Thank goodness for fast computers - 200,000 transaction only takes about 5 seconds.) Frustratingly, it doesn't happen in the debugger. With this kind of problem it's not much use adding print statements because you get way too much irrelevant output. A technique I've been finding useful is to have each transaction keep a log of what it's doing. Then when I get the error I can print the log from the offending transaction. It's not perfect because with concurrency problems you really need to see what the other thread was doing, but it's better than nothing.
It was also annoying because it was the end of the day so I had to leave it with a known error :-(
Thinking about it, I realized I had rushed coding some of the changes, hadn't really reviewed them, and hadn't written any tests. Not good. When I went back to it this morning, sure enough I had made mistakes in my rush job. Obviously, that self imposed pressure to get things resolved by the end of the day is not always a good thing.
So now I'll go back and review the code and write some tests before I worry about whether I've fixed the original problem.
The bad news is that after I'd done all this, I was still getting the original error! It only occurs about once every 200,000 transactions (with 2 threads). (Thank goodness for fast computers - 200,000 transaction only takes about 5 seconds.) Frustratingly, it doesn't happen in the debugger. With this kind of problem it's not much use adding print statements because you get way too much irrelevant output. A technique I've been finding useful is to have each transaction keep a log of what it's doing. Then when I get the error I can print the log from the offending transaction. It's not perfect because with concurrency problems you really need to see what the other thread was doing, but it's better than nothing.
It was also annoying because it was the end of the day so I had to leave it with a known error :-(
Thinking about it, I realized I had rushed coding some of the changes, hadn't really reviewed them, and hadn't written any tests. Not good. When I went back to it this morning, sure enough I had made mistakes in my rush job. Obviously, that self imposed pressure to get things resolved by the end of the day is not always a good thing.
So now I'll go back and review the code and write some tests before I worry about whether I've fixed the original problem.
1. A famous aphorism of David Wheeler goes: All problems in computer science can be solved by another level of indirection;. Kevlin Henney's corollary to this is, "...except for the problem of too many layers of indirection." - from Wikipedia
Wednesday, November 11, 2009
jSuneido Multi-Threading Issues
It didn't take much testing to find something that worked single-threaded but failed multi-threaded.
I was expecting this - I figured there'd be issues to work out.
But I was expecting them to be hard to track down and easy to fix and it turned out to be the opposite - easy to track down but hard to fix.
The problem turned out to be more a design flaw than a bug. I've thought of a few solutions but I'm not really happy with any of them.
Oh well, I knew all along this wasn't going to be easy. It'll come.
I was expecting this - I figured there'd be issues to work out.
But I was expecting them to be hard to track down and easy to fix and it turned out to be the opposite - easy to track down but hard to fix.
The problem turned out to be more a design flaw than a bug. I've thought of a few solutions but I'm not really happy with any of them.
Oh well, I knew all along this wasn't going to be easy. It'll come.
Monday, November 02, 2009
IntlliJ IDEA Goes Open Source
I recently learned that IntelliJ has released a free, open source community edition of their IDE.
IntelliJ is one of the main IDE's along with Eclipse and Netbeans. I hadn't looked at it much because the other two are free, but it does get some good reviews. (Apparently they did offer free licenses to open source projects but I wasn't aware of that.)
I tried downloading it and installing it and had no problems. It comes with Subversion support "out of the box" and I was easily able to check out my jSuneido project. That's more than I can say for Eclipse where it's still a painful experience to get Subversion working (at least on a Mac). IntelliJ proves that it is possible to do it smoothly.
I haven't had time to play with it much yet. My first impression was that the UI was a little "rougher" than Eclipse. I can probably tweak the fonts to get it a bit closer. Maybe it's due to Eclipse using SWT. (I'm not sure what IntelliJ is using.)
IntelliJ is known for their strong refactoring. To be honest, I only use a few basic refactorings in Eclipse (like rename and extract method) so I don't know if this would be a big benefit. I should probably use more...
IntelliJ is also supposed to have the best Scala plugin. I'll have to try it. I tried the Eclipse one but wasn't too impressed with where it's at so far.
IntelliJ is one of the main IDE's along with Eclipse and Netbeans. I hadn't looked at it much because the other two are free, but it does get some good reviews. (Apparently they did offer free licenses to open source projects but I wasn't aware of that.)
I tried downloading it and installing it and had no problems. It comes with Subversion support "out of the box" and I was easily able to check out my jSuneido project. That's more than I can say for Eclipse where it's still a painful experience to get Subversion working (at least on a Mac). IntelliJ proves that it is possible to do it smoothly.
I haven't had time to play with it much yet. My first impression was that the UI was a little "rougher" than Eclipse. I can probably tweak the fonts to get it a bit closer. Maybe it's due to Eclipse using SWT. (I'm not sure what IntelliJ is using.)
IntelliJ is known for their strong refactoring. To be honest, I only use a few basic refactorings in Eclipse (like rename and extract method) so I don't know if this would be a big benefit. I should probably use more...
IntelliJ is also supposed to have the best Scala plugin. I'll have to try it. I tried the Eclipse one but wasn't too impressed with where it's at so far.
Subscribe to:
Posts (Atom)