Thursday, June 28, 2012

First Impressions of Xtend

I've been interested in Xtend for a while. It's an improvement over Java, without being a big jump like Scala.

I finally got around to playing with it a little. I took one of my Java classes from jSuneido and rewrote it in Xtend. But I got stuck with some weird errors around array access. I did some searching on the web and found that Xtend doesn't support array indexing e.g. array[i]. I was surprised. You can pass arrays and do for loops over them, but not index them.

I imagine part of the reason for this is that square brackets are used for closures, although in different contexts, so it seems like you could still parse it ok.

Then I saw you can write array.get(i) - more verbose, but more consistent with other containers. (Personally, I'd rather be able to use x[i] or x(i) like Scala for all the containers.)

Then I discovered that the way it handles .get(i) is to wrap the array in another class. This is similar to how Scala does implicit conversions to add methods to existing classes.  There's nothing wrong with this approach, but it means allocating a new object for every array access. Considering arrays are usually used for low level efficiency, this doesn't seem ideal. Granted, in some cases the JIT compiler may be able to eliminate the allocation, and if this technique gets common, it's more likely it will be a target for optimization.

It seems like a better approach would be to recognize array.get(i) and compile it to array access byte codes, without any wrapping. Maybe that's hard to do in their compiler.

Another alternative is to leave array code in Java. Which is fine, but doesn't leave a good first impression.

One of the justifications given for not supporting array indexing is that you should use the higher level containers instead. I think there are times when the extra efficiency of bare arrays is justified, but for the most part I agree. And, in fact, the class that I converted could just as easily use ArrayList's rather than arrays, with likely not too much performance penalty.

Even after I fixed this I still had some weird errors left. It turned out that Xtend doesn't have increment and decrement operators (++ and --) Fair enough, Scala doesn't have them either. But I was more surprised to find that it didn't handle += either, although it is listed as an operator in the documentation. For a language that claims "Less Noise", it seems odd to have to write "i = i + 1" instead of "i++".

Once I got past these issues, I was pretty happy with the results. Even without using any of Xtend's more powerful features (like closures and extension methods) the code was cleaner and less verbose, although not radically so.

One of Xtends claims is "top notch" Eclipse IDE support. Unfortunately, that doesn't mean much in the way of refactoring.

I'm still left with doubts about the future of Xtend. It doesn't have the momentum or backing of Scala. With a "smaller" project like this, it could easily stagnate, mutate, or die.

Saturday, June 23, 2012

Immudb in Production

We took the plunge and converted our (Axon's) in-house accounting and CRM system over to the new version of jSuneido with the immudb append-only storage engine. We have about 50 users so it's a decent test, although we're not as heavy users as some of our customers.

So far so good, it's been running for a couple of days with no glitches.

We'll give it a little longer and then look at converting actual customers.

Thursday, June 14, 2012

Optimizing Java Memory Usage

One of the things I've been meaning to work on for a while is the memory usage for temporary indexes in Suneido's database.

I started by measuring the current memory usage. I used two test cases, first a single table with 400,000 records sorting by a non-indexed field, and second the same table but joined with another one. I used Visual VM to watch the heap and force garbage collections to get an idea of memory usage. The first case used about 200mb and the second about 400mb. For the simple case, that's about 500 bytes per record. I was pretty sure I could improve this usage.

My assumption was that a lot of the usage was Java's per-object overhead.

For each record, the temporary index code created a new sort key record, plus an Object[] array that stored references to each of the source data records (more than one in the case of joins). These sort key arrays are then added to a MergeTree.

My first step was to switch from allocating a separate small array for each sort key, to putting them into a big array and referencing them by offset.

I could have used a standard ArrayList, but it works by doubling the size of it's internal array each time it needs to grow. At very large sizes, this is not the best strategy. For example, to grow from 100mb to 200mb you need to allocate a new 200mb array, copy the old 100mb array into it, and then the old array has to be garbage collected. On top of this, if the data is long lived, it will probably end up being moved to older generations by the garbage collector, where it won't get garbage collected as easily.

Instead, I wrote an ArraysList which uses a bunch of medium sized arrays rather than one big array. To grow it simply allocates another medium size array, without any copying or garbage collecting. This is not necessarily better in every situation, but for my usage it is.

The next problem was that now I was storing an int offset (into the ArraysList) instead of an array. My MergeTree is generic, which means to use if for int's they'd have to be boxed into Integers. Then I'd be back to too much per-object overhead. So I had to make a specialized IntMergeTree. It's too bad that Java makes such a big distinction between primitives and references, and that generics only work with references. (Scala tries to reduce the distinction, and it even lets you specialize a generic class for a primitive type, but it still ends up boxing and unboxing most of the time.)

In the common case where you're just sorting a single table, there's no need for the array at all, and instead of storing the ArraysList offset, you can just store the integer "address" of the record.

The next step was to store the sort key record in larger blocks of memory rather than individually, and again reference them by offset rather than a Java reference. For this I could use the MemStorage class I had already written to create in-memory databases for tests. It inherits from the same Storage base class as my MmapFile memory mapped file access. These are based on ByteBuffers. Each sort key record used at least three memory objects (the record, its ByteBuffer, and the ByteBuffer's array) so this was a significant saving. (This doesn't require much serialization because the sort key records are already in a serialized form.)

The end result was that the memory usage was less than 1/4 of what it was before. My first case now took about 40mb instead of 200mb, and my second case took about 90mb instead of 400mb. Not bad for a couple of days work.

I wouldn't normally recommend this kind of messing around - the overhead is seldom a big issue. But in the case of a server, with large numbers of users competing for memory, I think it was worth it.

Wednesday, May 23, 2012

IllegalMonitorStateException and Stack Overflow

Just when I thought immudb was ready to deploy, I got sporadic IllegalMonitorStateException's.

I wasn't sure what that exception meant, but it sounded like concurrency, and that was not good news.

According to the documentation it comes from wait and notify. The catch is that my code doesn't directly use wait and notify. So something that my code uses is in turn using wait and notify.

I tried to catch the exception in the debugger to see where it was coming from, but of course, it never happened when I ran inside the debugger :-(

I started looking around at the docs for concurrency classes that I use and found that ReentrantReadWriteLock WriteLock unlock can throw IllegalMonitorStateException "if the current thread does not hold this lock".

Ouch. My server is NOT thread-per-connection. It uses a thread pool to service requests. So the thread that handles the request that starts a transaction (and acquires the lock) may not be the same thread that handles the request that ends the transaction (and releases the lock).

Because I'm not testing with lots of connections, most of the time all the requests will be handled by the same thread and it will work. But once in a while an additional thread would be used AND would happen to be ending a transaction, and then I'd get this error.

And my concurrency test is equivalent to thread-per-connection so it doesn't run into the problem. That's a weakness in my test, and in this respect the bounded executor I was using before is more equivalent to the actual server.

Of course, I can't guarantee this is the only source of the error, but regardless, it was a bug I needed to fix.

I needed a read-write lock that allowed lock and unlock to be in different threads. I didn't need (or want) it to be reentrant (where the thread holding the lock can acquire it multiple times).

I searched on the web but couldn't find anything.

It looks like it could be written using AbstractQueuedSynchronizer but I'm afraid if I write it myself I'll make some subtle concurrency mistake. I could find various examples of using AbstractQueuedSynchronizer but not a ReadWriteLock.

I'm stumped so I decided to post a question on Stack Overflow. (After searching to make sure there wasn't an existing question.)

I've always been a fan of Stack Overflow. I'm not a heavy user, but I've asked and answered a few questions, and if it comes up in web searches I give it preference. But this time I got a little frustrated with the responses. No one wanted to answer the question - they just wanted to tell me what I was doing was wrong. That's a valid response to some questions, and I have to admit I hadn't really explained the context. I thought the question was specific enough to make the context unnecessary.

I kept clarifying the question and trying to convince the responders that I really did need what I was asking for. It was even more frustrating that people were voting for the responses that just told me I was wrong.

Of course, part of the frustration was that I started to doubt my own design decisions. Maybe I should just be using thread-per-connection - it would have avoided the current issue.

But in the end, Stack Overflow came through again - someone posted an answer that was exactly what I needed. Bizarrely, that answer didn't get as many votes, even though it was the "right" answer.

The answer was to use Semaphore. I hadn't even noticed this class because it's in java.util.concurrent, not in java.util.concurrent.locks where I was looking. I guess it's not a "lock" although it can be used as one.

And when I went to look at the source code for Semaphore, I found that it is implemented (at least in OpenJDK) with AbstractQueuedSynchronizer (which is in java.util.concurrent.locks)

It was simple to write my own read-write lock using Semaphore and everything seems to work fine. I wondered about performance but it seems to be roughly the same as before. I ran some tests and I didn't get any IllegalMonitorStateException's, but it was sporadic before, so that doesn't guarantee it's fixed.

AbstractQueuedSynchronizer has both "shared" and "exclusive" features which seem to map well to read-write locking. But Semaphore doesn't use the exclusive feature. It seems like you could write a read-write lock based on AbstractQueuedSynchronizer that would be a little "cleaner" than Semaphore. But for now at least, I'm happier using something like Semaphore that is tried and tested.

Sunday, May 20, 2012

More Immudb Results

The test that I was using to measure concurrency performance was using a bounded executor - code I'd found on the web, back when I knew Java even less than I do now. I decided that it was more complex than it needed to be and I rewrote it just using a number of worker threads. Surprisingly, that seemed to eliminate the drop in performance with more threads.

I also tested on my Windows machine which has a similar CPU but with an SSD (solid state drive) instead of a hard disk. Here are the results:

Now the performance seems to consistently level off as the number of client threads increase. That's less worrying than the performance dropping, but it's still a little puzzling. It still doesn't appear to be due to actual concurrency scaling issues like lock contention. Perhaps it's running into some other limit like storage or memory bandwidth?

Windows + SSD was roughly 2 times as fast as Mac + HD. I suspect that's mostly due to the SSD, although the OS and hardware could also play a part.

immudb also shows more improvement on SSD than the previous version. (blue to red versus orange to green) My guess would be that this is because immudb's large contiguous writes are optimal for SSD.

On both platforms, immudb is about 4 times as fast as the previous version.

Tuesday, May 15, 2012

Immudb Concurrent Performance Puzzle

One of the tests I hadn't run on immudb (my new append-only database engine for Suneido) was a concurrency test I'd used to debug the previous database engine.

Here are some rough results. As usual, this is not a scientific benchmark. The operations it's doing aren't necessarily representative, and I'm not running it long enough or enough times to get accurate figures. But it still gives me some information.

I'm pretty happy with the results relative to the previous version. Up to four client threads it's about four times as fast - can't complain about that. And another good result that's not shown by this chart is that immudb has a lot less transaction conflicts.

Also on the positive side, I only found a couple of bugs, and none of them took more than a few minutes to fix. This is a sharp contrast to when I was debugging the previous version. I give credit to the immutability and the resulting reduction in locking.

But I'm puzzled by the drop in performance over 4 threads. And I didn't even include 8 threads because the results were all over the place - anywhere from 3000 to 6000. There's some variation with less threads, but nowhere near this much.

If the performance just levelled off, that would be one thing, but I'm not happy with the performance dropping under heavy load - that's not what you want to see.

The previous version may not perform as well, but it doesn't show the puzzling drop off with more threads. Although, even with the drop, the immudb performance is still much better than the previous version.

At first I was only running 1, 2, 4, and 8 threads. My computer has 4 cores with hyper-threading for more or less 8 hardware threads so I thought maybe it was because I was using up all the hardware threads and not leaving any for other things like garbage collection and the OS. But that wouldn't explain the drop off at 5 threads.

I looked at hprof output, but it didn't give me any clues. It did highlight some areas that I could probably improve, but not related to concurrency. (they were related to one of my pet peeves - ByteBuffer)

More threads could mean more memory usage, but JConsole and VisualVM don't show a lot of time in garbage collection.

The obvious issue is some kind of contention. Immudb doesn't do much locking (because most things are immutable) but I could see contention over the commit lock. But JConsole and VisualVM  don't show much waiting for locks. Thread dumps show all the threads as RUNNABLE And I don't see underutilization of the cpus, as I would expect with heavy contention (since they'd be waiting for locks, not executing).

Another possibility is that more concurrent transactions will mean more work to do read validation. And since each commit has to check against all other concurrent transactions, this is ON2 which could be bad. But again, I don't see any signs of it in my monitoring.

And I can't think of any explanation for the excessive variation. Strangely, 16 threads shows much less variation than 8.

I'm more puzzled than worried at this point. In actual usage I don't think it will be an issue because our systems, even the largest ones, won't be applying a continuous load this heavy.

Anyone have any ideas or suggestions of things to look at or try?

Saturday, May 05, 2012

No Service

Every trip I make to the US, I end up wishing I had cell phone internet access.

I have a Canadian data plan on my iPhone. I don't use it a lot, since I can usually find wifi, but occasionally it is very handy, e.g. to use Google Maps to find an address while wandering around. I could use this plan in the US but the roaming charges are ridiculous and I refuse to pay them.

When I got my iPad, I bought the 3G model thinking that at some point I could get a US SIM card and data plan. In the winter when I was in Savannah I stopped at an AT&T store to try to set it up. The two young guys working there were friendly enough, but did not seem at all inclined to help. The first told me I couldn't do it because I had a Canadian SIM card. I told them I didn't have a SIM card. The second guy then said it wouldn't work because I didn't have a SIM card. Couldn't they give me a SIM card? Yeah, they could, but it wouldn't work. Could we try it? No, because I wouldn't have an account. Could they set up an account for me? No, because I'm Canadian. At this point I gave up.

So this trip, I thought I'd try an Apple store and see if they would be more helpful. I explained what I wanted to the greeter. He said "no problem" and gave me an AT&T SIM card and even installed it for me (between greeting each person that came in the store). All I had to do was phone AT&T and set up a plan.

Of course, that wasn't as easy as it sounded. I knew phoning would be a hassle, so first I tried to sign up on the iPad itself. That wouldn't work because it insisted on a US billing address, so my credit card wouldn't work. Then I tried on the AT&T web site, but same problem. So I gave in and phoned. It took me multiple tries to get through the computerized phone system, wait on hold, get the wrong department and get transferred, wait on hold again. Finally I got someone in the correct department who informed me it was impossible. I had to have a credit card with a US billing address. Considering this was for a pre-paid plan, I'm not sure why they care where the money comes from. But it's not like you can argue logic with a call center.

I did a little searching on the web and the suggestion I found was to use a prepaid gift credit card. I've never used one of these but it was worth a try. I picked up a Visa card from a local convenience store. But the instructions with the card said if you wanted to use it on the web you should register it with your address. I needed a US address so I used our hotel address. Finally, I could set up my AT&T plan.

It all seemed to work properly, but I could barely get service. AT&T must not have very good coverage in our area of Las Vegas. When we got to Fresno I tried again. Good service - five bars - but still no data connection. I checked all the settings and checked my account but everything looked correct.

So we stopped at the Apple store on our way out of Fresno. I talked to the greeter, but he had no idea what the problem would be and said I'd need to make an appointment for the Genius Bar. Unfortunately, the next free appointment wasn't for two hours :-(  I wandered through the store thinking what to do next. I could just buy a US iPad (the new model would be nice), but I hated to do that when I didn't know what the problem was. There was a guy behind the Genius Bar who didn't appear to be busy so I asked if he had a couple of minutes to help me. He was a little reluctant (probably not supposed to take walk ups) but agreed to take a look. He checked all the same things I had already checked. Eventually he decided it was because it was a Canadian iPad. (It always strikes me as funny how these support people come up with theories, but then present them as "fact".) But it's a US SIM, I said. He told me to wait a minute and disappeared into the back with my iPad. He returned after 10 or 15 minutes and handed me my iPad saying "here you go". I looked at the screen and there was the little "3G" icon at the top! And I was connected to the internet!

I asked what the fix was. He hesitated and then said they turned it off and back on. Shelley was standing next to me and laughed, "isn't that what you would have told me to do?" I kicked myself for not trying this. The reason I didn't is that iOS tries hard to hide the whole idea of "turning off" your device. They try to abstract away the issue of on or off, but as usual, it's a "leaky" abstraction. Of course, the other question is why it needed to be turned off and on - what did that reset?

So finally, I have my US data. Has it been useful? Not very, since we're mostly out in the boonies and there is poor coverage. It should be more useful when I'm traveling in cities. In any case, it had become a quest and the usefulness wasn't really the motivation any more!

I only signed up for the $15 for 250mb prepaid plan since I wasn't sure how it would work. I'm not sure if that'll be sufficient. If I needed more I could get 3gb for $30. Also, in theory, I think I could swap the SIM card into my iPhone since that's a little easier to carry around and use on the street. Now I just have to remember to cancel my plan when I get home, although the most they could bill was the $50 of the prepaid card.


Thursday, March 29, 2012

More positive immudb results

I'm now at the point where I can successfully run our entire application test suite, not just the stdlib ones. Here's how much the database grows in each version:

cSuneido:  20 mb

immudb1:  400 mb   :-(

immudb2:  19 mb   :-)

I'm happy that my decision to back up and try a different approach turned out to be a good one.

It's not too surprising that immudb2 gives slightly better results than cSuneido for running tests, since it will often avoid persisting indexes for temporary tables, whereas cSuneido will always write them.

NOTE: This is still not a realistic workload. I suspect immudb will end up having more database growth than cSuneido. But as long as it's not excessive (as these results would tend to indicate) then it's ok.

Tuesday, March 27, 2012

Preliminary Immudb Results

I ran a few tests for database growth on the new version of my immudb append-only database storage engine for jSuneido.

First, I tested with persisting indexes every update transaction. This is more or less what it did previously. Then I tested with persisting every so many update transactions.

every 1 = growth 5,882,408
every 10 = growth 1,322,712
every 100 = growth 551,080
every 1000 = growth 354,168

This is the kind of results I was counting on. Deferring persisting the indexes results in much less growth.

Coincidentally, the improvement shown above is about 17 times, not far from the 20 times difference between cSuneido and the first version of immudb. This is just coincidence since I'm running different tests, but it's very nice to see that the improvement may be enough to bring immudb into the same ballpark as cSuneido, which is pretty good considering cSuneido is a mutable database.

Is persisting every 1000 update transactions reasonable? I think so. These tests only take about 5 seconds to run. Remember, the actual data is being written right away, log style, so persisting less often does not increase the risk of losing data. It only increases the amount of work to be done by crash recovery. Making crash recovery re-process 5 seconds (or more) of transactions seems fine. Especially considering that cSuneido rebuilds the entire indexes after a crash, which could take a long time for a big database.

These numbers are from running the stdlib (standard library) tests. This isn't a very typical work load, so the numbers aren't going to be very accurate. But I'm just looking for some very rough feedback at this point. If I'd seen little or no improvement, I'd be seriously depressed right now and considering taking up golf instead :-)

Ideally, I need a good statistical model of a typical work load, or at least representative, since there's probably no such thing as typical. Another project for my "spare" time!

Hmmm... thinking about what the stdlib tests do versus what a typical workload might be, I realized that one atypical thing the stdlib tests do is to create, use, and then drop temporary database tables. But in that case, if you defer persisting till after the table is dropped, then you never need to save the indexes for that table at all. (Although the data will still be written immediately, log style.)

It turned out I wasn't handling this, which actually was good because it would have seriously skewed the above results. It was easy enough to fix.

Rerunning the tests, I got the results I expected - much the same growth when persisting every transaction, less growth than before as the persist interval increased. Persisting every 1000 transactions resulted in a growth of only 62,424 - roughly 5 times better.

This is nice, but not really relevant to production, since a typical workload does not include creating and dropping a lot of temporary tables. It will be nice for development because running the tests won't cause the database to grow so much.  (At least until I implement background compaction.)

Persisting every so many update transactions is not necessarily the best approach, I'm using it because it's the easiest. Alternately, you could persist every so many seconds or minutes, or try to persist during "idle" times. Currently, I'm persisting synchronously i.e. while holding the same lock that commits use. So update transactions can't commit till the persist finishes. (But read-only transactions are not affected, and even update transactions can read and write, they just have to wait to commit.)  Depending on how much index information has to be persisted, this could introduce delays in server response. To alleviate this, you could persist in a background thread. Because of the immutability, this should be relatively straightforward. The exception is updating the in-memory index information to show it has been saved - this would require some synchronization.

Another issue I haven't dealt with is that I may need to "flush" some of the index information out of memory. Once it's been stored, a btree node in memory consists mainly of a ByteBuffer that points to part of the memory mapped index file. Since the memory mapped space is "virtual", it will get evicted if the OS runs short on memory. But for a really big database, even the small objects referencing the virtual memory may add up to too much memory usage.

It would be easy enough for the persist process to discard the btree nodes once they have been saved. But that would have an effect on performance since they would have to be recreated on-demand. You'd want to evict selectively, presumably on some kind of LRU or NFU basis, which would require tracking some kind of usage data. Presumably, you'd also only want to do this when necessary, i.e. when memory is getting full. Hmmm... perhaps one approach would be to use WeakReference's and let Java worry about when and what to evict. Anyway, that's a challenge for another day.

And now back to testing :-)

Sunday, March 25, 2012

jSuneido immudb milestone

The improved version of the jSuneido append-only storage engine (immudb) now passes all the standard library tests.

Which is great, except it just means I'm right back where I was two months ago with my previous jSuneido immudb milestone. As usual, it feels like I only started on this a couple of weeks ago.

Except that hopefully I've fixed the database growth and transaction conflict issues.

I haven't implemented the idea of passing the table select predicate down to the transaction so it can use it for read validation. But even without this, the conflict detection should be at least as fine grained as cSuneido (and much better than previous versions of jSuneido).

I'm almost to the point where my strangler fig approach has taken over the old immudb code and I should go back and clean it all up so I have a single version. I can't see needing the old immudb version again (and it'll be in version control if I do).

The last problem I ran into was a tricky one. In order to defer btree updates till commit, I use transaction local btrees. Which means to access an index you have to look at both the local and the global btrees. To do this I use an iterator that merges two other iterators. Iterating through two sorted sequences in parallel is a standard easy problem. But Suneido allows you to bidirectionally step forwards or backwards in any order with the same iterator. And it turns out that this makes iterating in parallel a lot tougher. The trick is when you change directions. The final code is not that large (look at next and prev in MergeIndexIter) but I went through about six different versions before I finally got one that appears to work. Of course, most of my previous versions also appeared to work, until I added more test cases.

A big part of finally figuring this out was coming up with the right visualization for it. Here's what I ended up with:


Notice it's paper and pencil. (It should have been on graph paper but I so seldom do stuff on paper anymore that I didn't have any handy.) I would have preferred to do it digital, but I couldn't think of an easy way to do it. I could have done it in a drawing program or even a spreadsheet or word processor, but it would have been a hassle. Drawing programs make it possible to do just about anything, but they'd don't tend to make it easy and natural (at least, if you only use them occasionally like me). Maybe there's a market niche for someone to go after. Or maybe there already is a good program for doing this - anyone know one?

Given the extra complexity, I started to question whether Suneido needed to allow you to switch directions with an iterator. I can only think of a couple of places where we use this. But I'll leave that question for another day.

And now back to testing. I imagine I've still got a few bugs lurking in there. I also need to do some testing and see whether I've actually solved the database growth problem. It may be harder to know whether the transaction issue is solved until we actually get it in production.

Friday, March 23, 2012

I'm a little slow sometimes

A roundabout detour of bit twiddling optimizations that turn out to be ultimately unnecessary.

Originally, Suneido used 32 bit database file offsets. Remember when disks (let alone files) were limited to 4 gb? When we needed to support databases larger than 4 gb, I didn't want to switch to 64 bit offsets because of the extra space required to store them, so I "compressed" larger offsets into 32 bits. (a bit like the Java -XX:+UseCompressedOops option)

The approach I used was to align and shift. If you align the data in the file on, for example, 8 byte boundaries, then the lower 3 bits of an offset will always be zero. So you can shift a 35 bit offset to the right 3 bits so it fits into 32 bits. (Who says bit twiddling is dead?) This allows a maximum file size of 32 gb instead of 4 gb. (Actually, for obscure historical reasons, even though the alignment is 8 bytes, the shift is only 2 bits, so the maximum size is 16 gb.)

The tradeoff is some wasted space. Each block of data will waste an average of 4 bytes.

The problem is, it's starting to look like we need to allow even larger databases. I could just increase the align/shift to 16 bytes/4 bits to allow 64 gb. But that would increase the average waste to 8 bytes, even though most databases aren't that big.

Ideally, you'd want small databases to use a small align/shift, and larger databases to use a larger align/shift. I could make the align/shift a command line option, but you'd have to rebuild the database to change it. And I'm not a big fan of cryptic options that most people won't understand.

My next thought was to use a variable align/shift. There are different ways you could do this, but a simple one would be to reserve two bits to specify the align/shift. This would leave 30 bits (1 gb range) for the unshifted offset. For example:

00 = 4 byte alignment = 0 to 4 gb
01 = 8 byte alignment = 4 to 12 gb
10 = 16 byte alignment = 12 to 28 gb
11 = 32 byte alignment = 28 to 50 gb

For databases up to 4 gb, this would use a smaller alignment than currently and therefore less wasted space, while still allow a much higher maximum size than now. Only large databases would pay the cost of higher wastage due to larger alignment.

This scheme would be slightly more complex to decode, but not too bad.

At first I was excited about this approach, but it doesn't scale. You can increase alignment to allow bigger files, but then you waste more and more space. At some point, you'd be further ahead to drop the alignment and just store 64 bit offsets. Except you don't want the overhead of large offsets on small databases.

So why not just use a variable length encoding like Google Protocol Buffer varint?

A variable length encoding would not be the simplest thing to manipulate in the code, but we don't have large numbers of offsets in memory, so you could use 64 bit unencoded offsets in memory, and just use the variable length encoding on disk.

At this point I whack myself on the forehead. D'oh! Sometimes I am such an idiot! (A lot of my "aha!" moments are spoiled by thinking that if I was just a little bit smarter and had thought of it a little sooner, I could have saved myself a lot of work.)

Suneido already uses a variable length encoding when storing file offsets!

The offsets are stored in the btree key records. They are getting encoded the same way other numeric values are packed into records, which is a already a variable length encoding.

So all the messing around with align/shift is mostly wasted effort. It does reduce the magnitude of the offsets, which does mean they are encoded slightly smaller, but this is minor.

I'd be further ahead to drop the align/shift altogether, use 64 bit offsets in memory, and let the variable length encoding handle making small offsets small. This would eliminate the whole maximum database file size issue!

If I want to save space, I could use a different variable length encoding for offsets. The current encoding is designed for Suneido's decimal floating point numbers and is not ideal for simple integers.

I can't use the Protocol Buffer varint encoding because I need encoded integers to sort in the same order as the original values. varint's don't because they are stored least significant first. One simple approach would be to reserve the first two bits to specify the length of the value, for example:

00 = 4 bytes
01 = 5 bytes
10 = 6 bytes
11 = 7 bytes

This would give a maximum of 54 bits (7 * 8 - 2) or about 16,000 terabytes. By the time anyone is trying to make a Suneido database bigger than that, it should be someone else's problem :-)

Of course, I'm itching to implement this, but I'm in the middle of debugging my immudb improvements so it'll have to wait.

Monday, March 05, 2012

Piracy?

I've been a fan of Pragmatic Programmers and their books for a long time. I buy their books directly, both to maximize their revenue, and to get eBooks without DRM.

Recently I noticed this at the bottom of one of my order emails. (I'm not sure how long they've been including it.)
Please don't put these eBooks on the 'net or on file-sharing networks. Sometimes folks do this accidentally by downloading into a shared directory.
I don't have any problem with the "don't post these eBooks on the 'net" part. But I really question the file-sharing part - not the legality of it, since I'm no expert on that, but the "morality" of it. Even the 'net part needs to be more specific - I put my books on the "net" - into Dropbox so I can access them from various devices and locations, but they're not public. Of course, maybe that's not "legal" either.

If I bought a paper copy of the book, and put it on the bookshelf in our company coffee room, I don't think I'd be breaking any laws. But I'm not supposed to do the same thing with a digital copy.

Potentially, multiple people could simultaneously read the digital copy. But I have trouble getting anyone to read any books. The chances of multiple people reading the same book at the same time is slim to nil.

Note: If my staff needs regular access to a book, I have no problem with buying extra copies.

I always thought Prag Prog were more "enlightened" than this. But obviously someone there couldn't resist the urge to "wag their finger" at us.

In some ways, Amazon, despite their DRM, is more flexible. I could buy Kindle books on a company account, and set up multiple Kindle readers (either devices or software) associated with that account. I don't know if Amazon would regard that as entirely legitimate, but they allow it. (Albeit, intended for "families".)

Publishers want to have their cake and eat it too - they want the benefits of digital - no printing, no inventory, no shipping, and at the same time, they want to take away all the traditional rights we had with paper books to share and lend. Some eBooks are now more expensive than the hardcover - what's the story with that? Who are the pirates now?

Sunday, March 04, 2012

jSuneido Immudb Progress

I've made good progress on the changes to the append-only database storage engine for jSuneido. The new version now passes all the built-in tests, although I have a few things left to do to improve performance.

I'm pretty happy with how it went. No big surprises and the code turned out quite simple for the most part. It's nice that there's less shared mutable state and therefore less potential locking issues.

Previously, updating the shared state and persisting to disk were a combined process. It was a little tricky teasing this apart so they were independent. For example, the semi-immutable data structures were flipping to immutable when they were stored. Now they have to be flipped to immutable before they go into the shared state, separately from being stored.

Of course, the reason I did all these changes was to try to address some "production scale" issues, and I still haven't got to the point where I can really tell if I've solved them or not. So far, the signs are good.

As usual, the code is in Mercurial on SourceForge

Related: Immudb Progress

Sunday, February 19, 2012

Immudb Progress

I've been plugging away at improvements to the immudb append-only storage engine for jSuneido.

To reduce database growth: (see Houston, we have a problem and Problem or Opportunity?)

- Change to persist btree index changes to disks periodically, instead of after every transaction.
- This makes redirection less important so I'm removing it. (Funny how this great idea is suddenly more trouble than it's worth.)
- This also allows persisting btrees in a separate concurrent thread. To make this as independent as possible I plan to split the database into two files. I have mixed feelings about this because I really like having everything in one file. With two files you have to make sure you always backup / restore / copy / move / rename the two files together because if they get out of sync you're in trouble.

To reduce transaction conflicts: (see Suneido Database Transaction Puzzle)

- Instead of tracking reads and writes by btree node, track writes by record and reads by index range and predicate.
- This makes it harder to use the previous approach of merging btree versions. Instead I'm "overlaying" transaction local indexes over the "master" indexes, and then applying the index changes to the master index during commit. (Easy because commits are single-threaded.)


None of these changes is that big, but unfortunately, they are fairly invasive. Rather than "break" everything while the changes are in progress I've been taking a "strangler fig" approach, keeping the existing code intact and functional while at the same time building up a new version that leans heavily on the old one. Once I have the new version functional on its own I can remove the vestiges of the old code.

I should probably be doing these two sets of changes separately, but they keep getting mixed together in my head. I just know where I want to get to, not necessarily which problem each particular change is aimed at. The changes all affect the same code, so in some respects it's easier to do them together. Time will tell whether this was the right approach :-)

Another question is whether the changes I'm working on will actually succeed in reducing database growth sufficiently. I'm actually fairly confident about this because it's not a "yes or no" question. If you persist the indexes after every commit you get the excessive growth I've already seen. At the opposite extreme, if you only persisted at shutdown, you'd get minimal growth. Presumably, the answer lies somewhere between these two extremes. Looking at it another way, if you updated the same btree node in each transaction, if you only saved it once per 10 transactions, then you'd get 1/10th the database growth. If 1/10th is still too big, then you only save every 100 transactions, giving 1/100th the database growth. Since the actual data is still being written to disk at each commit, the only cost of persisting the indexes less often is increased work by crash recovery, which (a) shouldn't happen that often, and (b) processing e.g. 100 commits worth of data to recover from a crash doesn't seem that bad.

Another nice feature of deferring persisting indexes is that it offloads work from commits which are a potential bottleneck since they're single-threaded. There are some issues with persisting indexes concurrently with commits, but I think I have an approach figured out. On the other hand, I'm now updating the master indexes during the commit so it's not all gain.

So far so good. I have some of the pieces working and it's gradually coming together. I'm anxious to see how it turns out.

Wednesday, January 25, 2012

On-line Database Compaction

One of the weaknesses (at least in my mind) of Suneido's database is that it requires off-line compaction. i.e you have to periodically shut down the server and run a standalone compaction process.

Practically speaking, that hasn't been a big issue. With the mutable storage engine, database growth is relatively slow and you don't need to compact that often. Few of our customers run 24 x 7 so it's not a problem to shut down at night occasionally. Also, some of the updates we deploy have to be run single user anyway.

But with the immutable storage engine, even with my improved design, the database will grow faster, and require more frequent compaction.

It would sure be nice if you could compact on-line, in the background, while the server was running.

Thinking about my planned improvements, I realized this might now be relatively easy. (Actually, I think I could have used a similar approach before, it just became more obvious now.)

The basic idea is to use a read-only transaction, and its snapshot of the data base as of a certain point in time, to compact the database up to that point. (Just like the current off-line compaction.)

But since the database is active, you will probably have activity after that point. So then you reprocess any updates after your transaction and apply them to the new database. (This is similar to how crash recovery would reprocess updates that happened after the last good checkpoint.) When you "catch up", then you switch over to the new database (with some synchronization required).

There is a potential problem if the server is so busy that the compaction never catches up. In practice, the compaction could be scheduled at a slow time of the day, and most of our systems aren't continuously busy. In any case, this wouldn't "hurt" anything, other than performance. You could easily detect this situation and abort the compaction.

Although the server can continue to operate during the compaction, performance may be affected. (As it can be with our current on-line backups.) CPU time is usually not the problem these days. A bigger issue is that reading all the data can cause other working set data to be evicted from memory. One option may be to not use virtual memory to read the data. Then you're only competing for disk cache and not virtual memory space. What we've found with our on-line backups is that as long as you have lots of memory it's not a problem.

One benefit of building a new compacted database file (rather than compact in place) is that it is inherently crash proof, at least in the sense that if the server crashes, the current database won't be affected. The new database being created can simply be discarded.

In garbage collector terms this is equivalent to a "copying collector" i.e. it just copies the live data, rather than scanning all the data (live and dead) as, for example, a "mark-sweep" collector does.

Suneido does support on-line backups, taking advantage of snapshot isolation to make a consistent backup as of a certain point in time. With the append only immudb storage engine, another option would be to simply copy the bytes of the database file up to a certain point without paying any attention to interpreting the structure. If the data and indexes are in the same file, this would copy both. If they were in separate files you'd have the option of only copying the data (like the current on-line backup). However, this would still be somewhat larger than the current backup because it would include deleted records. On the other hand, it should be faster since it's just a bulk copy.

Note: These ideas, like my ideas for limiting database growth, are just ideas. I'm sure there are unseen complexities waiting to trip me up.

Saturday, January 21, 2012

Problem or Opportunity?

I started to get a little worried about the database growth problem with my immudb append-only storage engine for jSuneido. I could think of ways to reduce the growth a little bit, but not by much, and only at the cost of adding a bunch of complexity.

Was I going to have to throw out a year's work, when it seemed so promising? Having to throw out a few day's work is one thing, but a year? How could I have gone so far without seeing this problem? Especially since I knew perfectly well it was an issue with this approach.

I'm as much an eternal optimist as the next programmer, with a big enough ego to think I can solve most problems. But I'm also human with the standard intermittent feelings of inadequacy and stupidity. In other words, I was starting to worry. But I knew enough to give myself time to mull it over.

I was sitting eating lunch, reading a computer book when I started to think about the problem.

What if I wrote out the redirections as just a list of values? That would be a lot more compact than saving the changes to the hash trie. That would be fine while you were running because you'd still have the hash trie in memory. But what about restarting? How would you get the in-memory hash trie back? And how would you handle crash recovery?

It seemed like the problem wasn't so much the way I was saving the information, it was saving it so often, for every transaction. I did this so the on-disk data structure matched the in-memory data structure after every transaction commit.

But was that strictly necessary? What if you relaxed that constraint? What if you only wrote out all the extra information (btree nodes, redirections, db info) periodically? That would reduce the database growth a lot. You could then consider a continuum, from saving after every commit, to the opposite extreme of only saving when you shut down the server.

I realized that what I was thinking about was basically the idea of "checkpoints" - a standard database implementation technique. You could also think of it as simply deferring index writes to disk.

Crash recovery becomes a little harder - you'd have to start from the last good checkpoint and reprocess the data from transactions after that. Similar to how other database crash recovery reprocesses logs from the last checkpoint. This isn't quite as elegant or simple as being able to simply pick up from the last good commit, but crashes don't happen that often. And it was still a lot better than having to rebuild all the indexes after a crash as Suneido requires with the current mutable storage engine.

The beauty of this solution is that it doesn't require a lot of changes. The immutable persistent in-memory data structures and the code that uses them stays exactly the same. You just don't save it as often.

Of course, taanstaafl. The cost of this approach would be higher memory usage. Once the index changes are saved to disk you don't have to keep the in-memory versions. Obviously, if you don't save to disk, you do have to keep the in-memory versions. However, the memory cost is much less than the disk growth because the memory will be garbage collected. Memory is cheap and jSuneido assumes you have lots. (For small systems with limited memory, cSuneido is a better choice.)

Another benefit from saving after multiple transactions is that there can be considerable overlap. For example, if there are "hotspots" in the data that are frequently updated, you only save them once per checkpoint instead of once per commit.

So the commits would simply save the data records (inserts, updates, and deletes) to the database file, with no other information. The btrees would be updated in-memory only. Then periodically you would save a checkpoint with the btree and db info changes.

Thinking about it, I wonder if I needed the redirections with this approach. The point of the redirections is to save space by not path copying to the root. But periodic checkpoints may save enough space that redirections aren't as important. Eliminating them would definitely simplify the code - always a good thing!

I also realized that I could even write the checkpoints in a separate concurrent thread since they could use the persistent immutable data structures to view the state as of a certain point in time (just like transactions do) without any concurrency issues and without blocking ongoing database activity. The only trick would be coordinating writing to the database file. Even that coordination could be avoided if you wrote the checkpoints (i.e. the index btrees) to a separate file from the data. Although I like having everything in one file, it's certainly not essential.

They say that creative problem solving is a lot about combining known, existing pieces into new combinations. I had all the pieces I needed to solve this problem, I just had to put them together. (Ironically, I talked about a lot of the same pieces almost two years ago in A Faster, Cleaner Database Design for Suneido)

And so I go from thoughts of a depressing major failure, to feeling like a kid with a new Lego kit :-)

Thursday, January 19, 2012

Houston, we have a problem

I thought things were going well with my new immudb append-only storage engine. It finally passed all our application tests, and performance was looking good.

Then I noticed the amount of database growth when I ran our full application test suite:

mutable storage engine = 20mb
immutable storage engine = 400mb

Ouch. I always expected it to use more space - that's the cost of immutable persistent data structures. But I didn't expect it to be 20 times as much.

With immutable persistent data structures in memory, the space is reclaimed by garbage collection. But so far the only form of garbage collection for database space that Suneido has had is offline compaction.

It's a little surprising that the performance is good, even on hard disk rather than SSD. I assume that's because sequential writes are actually quite fast. And also because it's using memory mapped file access so it's not blocking.

Given that the performance is ok, the database growth might even be acceptable. It might just mean compacting the database nightly instead of weekly as we currently do. However, presumably the performance would be even better if it wasn't writing so much.

I think it's worth spending some time to see if I can improve it. The first step was to see what the breakdown of the space was. (Note: these figures are growth, not total size.)

96820 update transactions
data space 11 mb
btree node space 140 mb
redirection space 89 mb
db info space 148 mb

If you look at the growth per transaction it's only about 4 kb. That doesn't sound too bad, until you have 100,000 transactions.

The btree node space could likely be decreased a lot (maybe half?) by compression of the nodes, perhaps something as simple as prefix compression.

Redirection is used to avoid path copying all the way to the root in the btrees. It's stored as an immutable persistent hash trie. Unfortunately, it does have to do path copying itself and that may be part of the problem.

I also noticed that each subsequent run used more redirection space. That's a little curious because it's doing the same set of updates each time. However, as the total number of redirections grows, then path copying will grow.

The "db info" is another hash trie containing aggregate data about each table and index. The information is used for query optimization. It could use redirection instead of path copying but currently it doesn't because it's using the same hash trie implementation as the redirections. It currently stores a "blob" of information for each table.

One improvement would be to delete redirection and db info when a table or index is destroyed. However, this would help the tests more than actual application usage, since the tests create and destroy lots of temporary tables. This might mean actual application usage would not show as high database growth as the tests.

I'm not sure what to try first. It's too bad the space is coming from three similarly sized sources. If the majority had come from one source then that would be the obvious place to start. There's always a chance that there are bugs or design flaws causing part of the problem. Another variable to try adjusting is btree node size.

* You could argue that our application test suite ideally shouldn't be hitting the database at all. But that's not relevant to this post.

** The title of this post comes from Apollo 13

Thursday, January 12, 2012

jSuneido immudb milestone

As of today, my new append-only storage engine for jSuneido (immudb) successfully runs all our application tests. Of course, there are probably still bugs, but it's a big milestone.

And it runs the standard library portion of the tests roughly twice as fast as with the previous storage engine. It also loads databases from a dump roughly twice as fast. And crash recovery is hugely faster.

Overall, I'm pleased with how it's worked out, other than the transactions issues I posted about recently.

As usual, I was surprised when I looked at how long it's been since I started on this - almost exactly a year! Of course, I wasn't working on it full time, and I had to do quite a lot of refactoring on jSuneido to allow pluggable storage engines, but still a lot of work.

Next, I'm planning to work on the transaction issues. Hopefully that won't take a year!

See also: all the posts related to immudb