Wednesday, January 25, 2012

On-line Database Compaction

One of the weaknesses (at least in my mind) of Suneido's database is that it requires off-line compaction. i.e you have to periodically shut down the server and run a standalone compaction process.

Practically speaking, that hasn't been a big issue. With the mutable storage engine, database growth is relatively slow and you don't need to compact that often. Few of our customers run 24 x 7 so it's not a problem to shut down at night occasionally. Also, some of the updates we deploy have to be run single user anyway.

But with the immutable storage engine, even with my improved design, the database will grow faster, and require more frequent compaction.

It would sure be nice if you could compact on-line, in the background, while the server was running.

Thinking about my planned improvements, I realized this might now be relatively easy. (Actually, I think I could have used a similar approach before, it just became more obvious now.)

The basic idea is to use a read-only transaction, and its snapshot of the data base as of a certain point in time, to compact the database up to that point. (Just like the current off-line compaction.)

But since the database is active, you will probably have activity after that point. So then you reprocess any updates after your transaction and apply them to the new database. (This is similar to how crash recovery would reprocess updates that happened after the last good checkpoint.) When you "catch up", then you switch over to the new database (with some synchronization required).

There is a potential problem if the server is so busy that the compaction never catches up. In practice, the compaction could be scheduled at a slow time of the day, and most of our systems aren't continuously busy. In any case, this wouldn't "hurt" anything, other than performance. You could easily detect this situation and abort the compaction.

Although the server can continue to operate during the compaction, performance may be affected. (As it can be with our current on-line backups.) CPU time is usually not the problem these days. A bigger issue is that reading all the data can cause other working set data to be evicted from memory. One option may be to not use virtual memory to read the data. Then you're only competing for disk cache and not virtual memory space. What we've found with our on-line backups is that as long as you have lots of memory it's not a problem.

One benefit of building a new compacted database file (rather than compact in place) is that it is inherently crash proof, at least in the sense that if the server crashes, the current database won't be affected. The new database being created can simply be discarded.

In garbage collector terms this is equivalent to a "copying collector" i.e. it just copies the live data, rather than scanning all the data (live and dead) as, for example, a "mark-sweep" collector does.

Suneido does support on-line backups, taking advantage of snapshot isolation to make a consistent backup as of a certain point in time. With the append only immudb storage engine, another option would be to simply copy the bytes of the database file up to a certain point without paying any attention to interpreting the structure. If the data and indexes are in the same file, this would copy both. If they were in separate files you'd have the option of only copying the data (like the current on-line backup). However, this would still be somewhat larger than the current backup because it would include deleted records. On the other hand, it should be faster since it's just a bulk copy.

Note: These ideas, like my ideas for limiting database growth, are just ideas. I'm sure there are unseen complexities waiting to trip me up.

Saturday, January 21, 2012

Problem or Opportunity?

I started to get a little worried about the database growth problem with my immudb append-only storage engine for jSuneido. I could think of ways to reduce the growth a little bit, but not by much, and only at the cost of adding a bunch of complexity.

Was I going to have to throw out a year's work, when it seemed so promising? Having to throw out a few day's work is one thing, but a year? How could I have gone so far without seeing this problem? Especially since I knew perfectly well it was an issue with this approach.

I'm as much an eternal optimist as the next programmer, with a big enough ego to think I can solve most problems. But I'm also human with the standard intermittent feelings of inadequacy and stupidity. In other words, I was starting to worry. But I knew enough to give myself time to mull it over.

I was sitting eating lunch, reading a computer book when I started to think about the problem.

What if I wrote out the redirections as just a list of values? That would be a lot more compact than saving the changes to the hash trie. That would be fine while you were running because you'd still have the hash trie in memory. But what about restarting? How would you get the in-memory hash trie back? And how would you handle crash recovery?

It seemed like the problem wasn't so much the way I was saving the information, it was saving it so often, for every transaction. I did this so the on-disk data structure matched the in-memory data structure after every transaction commit.

But was that strictly necessary? What if you relaxed that constraint? What if you only wrote out all the extra information (btree nodes, redirections, db info) periodically? That would reduce the database growth a lot. You could then consider a continuum, from saving after every commit, to the opposite extreme of only saving when you shut down the server.

I realized that what I was thinking about was basically the idea of "checkpoints" - a standard database implementation technique. You could also think of it as simply deferring index writes to disk.

Crash recovery becomes a little harder - you'd have to start from the last good checkpoint and reprocess the data from transactions after that. Similar to how other database crash recovery reprocesses logs from the last checkpoint. This isn't quite as elegant or simple as being able to simply pick up from the last good commit, but crashes don't happen that often. And it was still a lot better than having to rebuild all the indexes after a crash as Suneido requires with the current mutable storage engine.

The beauty of this solution is that it doesn't require a lot of changes. The immutable persistent in-memory data structures and the code that uses them stays exactly the same. You just don't save it as often.

Of course, taanstaafl. The cost of this approach would be higher memory usage. Once the index changes are saved to disk you don't have to keep the in-memory versions. Obviously, if you don't save to disk, you do have to keep the in-memory versions. However, the memory cost is much less than the disk growth because the memory will be garbage collected. Memory is cheap and jSuneido assumes you have lots. (For small systems with limited memory, cSuneido is a better choice.)

Another benefit from saving after multiple transactions is that there can be considerable overlap. For example, if there are "hotspots" in the data that are frequently updated, you only save them once per checkpoint instead of once per commit.

So the commits would simply save the data records (inserts, updates, and deletes) to the database file, with no other information. The btrees would be updated in-memory only. Then periodically you would save a checkpoint with the btree and db info changes.

Thinking about it, I wonder if I needed the redirections with this approach. The point of the redirections is to save space by not path copying to the root. But periodic checkpoints may save enough space that redirections aren't as important. Eliminating them would definitely simplify the code - always a good thing!

I also realized that I could even write the checkpoints in a separate concurrent thread since they could use the persistent immutable data structures to view the state as of a certain point in time (just like transactions do) without any concurrency issues and without blocking ongoing database activity. The only trick would be coordinating writing to the database file. Even that coordination could be avoided if you wrote the checkpoints (i.e. the index btrees) to a separate file from the data. Although I like having everything in one file, it's certainly not essential.

They say that creative problem solving is a lot about combining known, existing pieces into new combinations. I had all the pieces I needed to solve this problem, I just had to put them together. (Ironically, I talked about a lot of the same pieces almost two years ago in A Faster, Cleaner Database Design for Suneido)

And so I go from thoughts of a depressing major failure, to feeling like a kid with a new Lego kit :-)

Thursday, January 19, 2012

Houston, we have a problem

I thought things were going well with my new immudb append-only storage engine. It finally passed all our application tests, and performance was looking good.

Then I noticed the amount of database growth when I ran our full application test suite:

mutable storage engine = 20mb
immutable storage engine = 400mb

Ouch. I always expected it to use more space - that's the cost of immutable persistent data structures. But I didn't expect it to be 20 times as much.

With immutable persistent data structures in memory, the space is reclaimed by garbage collection. But so far the only form of garbage collection for database space that Suneido has had is offline compaction.

It's a little surprising that the performance is good, even on hard disk rather than SSD. I assume that's because sequential writes are actually quite fast. And also because it's using memory mapped file access so it's not blocking.

Given that the performance is ok, the database growth might even be acceptable. It might just mean compacting the database nightly instead of weekly as we currently do. However, presumably the performance would be even better if it wasn't writing so much.

I think it's worth spending some time to see if I can improve it. The first step was to see what the breakdown of the space was. (Note: these figures are growth, not total size.)

96820 update transactions
data space 11 mb
btree node space 140 mb
redirection space 89 mb
db info space 148 mb

If you look at the growth per transaction it's only about 4 kb. That doesn't sound too bad, until you have 100,000 transactions.

The btree node space could likely be decreased a lot (maybe half?) by compression of the nodes, perhaps something as simple as prefix compression.

Redirection is used to avoid path copying all the way to the root in the btrees. It's stored as an immutable persistent hash trie. Unfortunately, it does have to do path copying itself and that may be part of the problem.

I also noticed that each subsequent run used more redirection space. That's a little curious because it's doing the same set of updates each time. However, as the total number of redirections grows, then path copying will grow.

The "db info" is another hash trie containing aggregate data about each table and index. The information is used for query optimization. It could use redirection instead of path copying but currently it doesn't because it's using the same hash trie implementation as the redirections. It currently stores a "blob" of information for each table.

One improvement would be to delete redirection and db info when a table or index is destroyed. However, this would help the tests more than actual application usage, since the tests create and destroy lots of temporary tables. This might mean actual application usage would not show as high database growth as the tests.

I'm not sure what to try first. It's too bad the space is coming from three similarly sized sources. If the majority had come from one source then that would be the obvious place to start. There's always a chance that there are bugs or design flaws causing part of the problem. Another variable to try adjusting is btree node size.

* You could argue that our application test suite ideally shouldn't be hitting the database at all. But that's not relevant to this post.

** The title of this post comes from Apollo 13

Thursday, January 12, 2012

jSuneido immudb milestone

As of today, my new append-only storage engine for jSuneido (immudb) successfully runs all our application tests. Of course, there are probably still bugs, but it's a big milestone.

And it runs the standard library portion of the tests roughly twice as fast as with the previous storage engine. It also loads databases from a dump roughly twice as fast. And crash recovery is hugely faster.

Overall, I'm pleased with how it's worked out, other than the transactions issues I posted about recently.

As usual, I was surprised when I looked at how long it's been since I started on this - almost exactly a year! Of course, I wasn't working on it full time, and I had to do quite a lot of refactoring on jSuneido to allow pluggable storage engines, but still a lot of work.

Next, I'm planning to work on the transaction issues. Hopefully that won't take a year!

See also: all the posts related to immudb

Sunday, January 08, 2012

Suneido Database Transaction Puzzle

When we converted the first large customer to jSuneido we ran into a lot of transaction conflicts. A lot more than we had with cSuneido. The question was why? (Note: jSuneido is working well, without many conflicts, both in-house and at a smaller client - this problem only showed up under heavy use.)

[Disclaimer - this post is largely to clarify and review my own thinking. My apologies if it's somewhat cryptic.]

A lot of the conflicts could be traced back to application issues that we could fix e.g. having an alert inside a transaction (so the transaction was kept open too long) or not retrying.

But there were still too many conflicts. I ended up adding an option for snapshot isolation rather than serializable snapshot isolation. That eliminated most of the conflicts, but it opens up the potential for inconsistencies and I'm not really happy with it.

My initial assumption was that jSuneido was more "strict" than cSuneido and that was why there were more conflicts. jSuneido implements the algorithm from Serializable Isolation for Snapshot Databases by Cahill, Rohm, and Fekete [CRF], whereas cSuneido does "read validation" i.e. when a transaction commits it verifies that all of its reads are still valid.

But when I thought about the two approaches, I realized that cSuneido's read validation is actually more restrictive than jSuneido's CRF algorithm. Read validation requires transactions to serialize as of their commit time, whereas CRF allows re-ordering (as long as there are no dependency cycles that would prevent it). Otherwise, they are basically doing the same thing.

So I was back to trying to figure out why jSuneido was getting more conflicts.

The CRF approach does admit some "false positives" i.e. spurious conflicts that are not really conflicts. Because it's an "incremental" approach, the conflict flags can get set by transactions that don't end up overlapping or that later abort. As the rate of conflicts rises, this effect will get worse. It's possible this accounts for some of the jSuneido conflicts but I don't think it's the main reason. (Note: cSuneido's read validation only looks at committed transactions so it doesn't degrade in this fashion.)

One major difference is that jSuneido tracks reads and writes by B-tree index node rather than by individual record. This handles phantoms and simplifies merging transaction updates, but it can lead to more conflicts. For example, if a table only has only a few rows, then all updates will be in the the same B-tree node and overlapping transactions will always conflict. Similarly, appending keys at the end of an index is likely to conflict due to updating the same node.

cSuneido handles phantoms by tracking the index ranges that a transaction reads. Read validation then checks whether any concurrent writes from other transactions fell within those ranges.

My current thinking is that tracking reads and writes by B-tree node is what accounts for jSuneido getting more conflicts, although I don't have an easy way to "prove" this. If I'm correct, then the solution is to track index ranges rather than B-tree nodes. But, that's not so easy with CRF since it uses read locks on items - harder to do with ranges. And you have to deal with concurrent changes to B-tree nodes, which would require a complex tree merge.

So I think I'll probably go back to read validation instead of the CRF approach. It does mean more work during commits, and since commits are single-threaded, that will reduce the concurrency of commits. On the other hand, CRF required access to shared locks which reduces the concurrency of reads and writes. And if commits did turn out to be a bottle-neck, I think it would be relatively easy to multi-thread the commit process e.g. using one task per index. (Since indexes are independent, this should parallelize well.)

But it still leaves the question of how to handle concurrent B-tree updates if I'm no longer "locking" nodes. One approach I've been considering is to delay updating the actual "master" B-tree indexes until a transaction commits. Then the updates are single-threaded and there are no concurrent updates to deal with. Again, this puts more work into the commit, but I think this can be addressed as above.

(cSuneido handles this issue by keeping multiple versions of keys in the B-tree nodes.  But this is tricky and doesn't fit well with the append-only approach that I'm concentrating on.)

However, transactions still need to "see" their own updates, so you'd have to have some kind of additional per-transaction indexes that you would transparently "merge" with the master ones. These could be B-tree indexes, but since they are in-memory, there are probably better data structures. One possibility is a "merge tree". I called this a "fractal tree" in a couple of blog posts (here and here) but the simple structure I'm talking about is not really the same as Tokutek's fractal tree. Since those blog posts I wrote an improved version (MergeTree.java) which I'm now using for temporary indexes required by queries.

A merge tree will be fast for inserts (faster than a B-tree) but slower for lookups. Accumulating updates in a per-transaction merge tree and then applying them to the master B-tree during commit will also have the side benefit of applying the updates to the B-tree in sorted order, which is optimal for B-trees.

Thinking about this, I have an idea for another potentially large optimization. Serializability really only cares about the records that the application sees. This is not the same as the records that the low level query execution reads. For example, if you have an index on fields a,b,c and you query on b, the low level code will scan the entire index, perhaps only returning a few records (possibly none). If you just track the index range that was read by the low level code, then it will appear that all the records were read. This will mean a much greater chance of conflicts than if you tracked the records that the application actually received.

However, I don't want to mix levels and have the query execution code worry about transaction read tracking. I think what I can do is have the query code pass a "black box" predicate to the low level database code. Then the transaction read validation can use the predicate when validating reads. i.e. writes by other transactions that do not match the predicate cannot have been seen and can be ignored.

One remaining issue is "bulk" updates since none of these approaches are great for updating a huge number of records. It may be reasonable to allow the application code to "lock" a table. This would prevent concurrent updates, and would allow updating the master B-trees directly, without accumulating updates in memory.

The problem with all of this is that it relies on an embarrassing amount of guess work and seat of the pants intuition. (Based on a fair bit of experience, but even experienced intuition is probably wrong as often as it's right.) Ideally, you'd test out these ideas before jumping into implementing them. But it's just not feasible. Any one of these issues would make a good research project. And even then, you'd only have analyzed a single issue - not how it relates to all the other issues.

The solution space is just too large. Which, of course, is also what makes it fun. It would be pretty boring if it was simple enough to analyze exhaustively, and have no questions about the best approach. The chances of me stumbling on a solution that is better than what is out there is slim, but at least it exists.

The timing actually isn't too bad. I'm just finishing up debugging the append-only storage engine. (So far, the performance looks promising, perhaps as much as twice as fast.) Once I finish that I can take a look at these transaction ideas and see what I can do.

Monday, January 02, 2012

Installing the Energy Detective

It's been an adventure installing the TED 5000 home energy monitor that I got to complement my new solar PV system. (See my Sustainable Adventure blog posts: Solar Powered and Rock Paper Sun.) I'm putting this post on my computer blog because it's more on the technical side. Sorry, it's a long post, I won't be offended if you don't read it. If nothing else, I'm hoping it might help other people with their installations.

I was nervous about installing it in the first place, because I'm not really comfortable messing around in the house breaker box. It didn't help that my last encounter with the house electrical system was when I tried to drill a hole through the back of a closet to run a network cable and managed to hit the electrical wiring inside the wall and short it out. (Annoying Shelley since it left her office without power.) But it's supposed to be straightforward, and I'm supposed to be a technical person. (If I'd been smarter I would have had the electrician install it when they installed the solar panels.)

You have to shut off the main breaker to the house, which means no lights. A headlamp is very handy.

I had to install two MTU's (Measuring and Transmission Units) each with two CT (Current Transformer)  sensors that clamp over the power wires, plus a Gateway unit that connects to your network.

My breaker box has two sections - an upper part with the main breaker, and a lower part with all the house breakers. I assume the idea is that you shut off the main breaker and then you can work in the lower section with no danger. Except I couldn't fit the main CT clamps in the lower section - the only place I could fit them was in the upper section. (where the power is still live on the external side of the main breaker)

I initially only installed the main (usage) MTU because the solar instructions showed two different configurations and I wasn't sure which one I had, which meant I wasn't sure which wires to put the solar CT clamps on, or even if I was supposed to install it in the house breaker box, or outside in the solar breaker box.

Next I had to find an empty power outlet (ideally not too distant from the breaker box) for the Gateway unit. It also had to be close enough to connect to my router. I used the same outlet that my solar panel monitoring unit was plugged into. Both it and the TED gateway use power line communications. (Filling up this outlet didn't make me popular because it was the one that Shelley had been using to plug in chargers for gadgets.)

Of course, my Apple Time Capsule router did not have any more free ports. (I used up the last one for the solar system.) So I had to run out and pick up a switch to get more ports.

Thankfully, I didn't have any trouble with the power line communications, as many people seem to. It probably helps that the office is at the same end of the house as the breaker box. I also remembered reading Jon Udell's blog post about installing a TED. The problem is that some of the TED instructions say to connect the black and the red wires, and some of the instructions say to connect just the black wire. (Not the only issue with their instructions.) Apparently, you should only connect the black one. In which case, why do they have the red one? Are there circumstances when you'd want to connect the red one?

The other thing the printed instructions don't mention is that if you only connect the black wire, then you have to specify that in the configuration. i.e. the default in the configuration doesn't match the instructions, sigh. This issue is fairly obvious because the voltage reads 60 instead of 120 V. Still, it took me a while to discover the setting. At first I assumed I had installed something wrong.

The final step is to connect to the system from your computer. This was a little confusing. It seemed to be a browser based interface, but the instructions also told you to download and run their "installer". I'm always reluctant to install software like this unless I really have to, especially since a lot of the time it's totally unnecessary (like the software that comes with your camera). The instructions said to connect to http://ted5000 with your browser but that didn't work. So I gave in and ran their "installer", which turned out to NOT be an installer at all - it's just a utility that locates the IP address that your Gateway is on.

After all that, it seemed to be functional!

At first, I didn't think it was that critical to monitor the solar system, since it had it's own monitor. But then I realized that I still wasn't really measuring the household energy usage. Instead, I was measuring the "net" i.e. the usage minus the solar generation. And to make it worse, with the default settings for the main MTU, the system assumes that usage will always be positive. But with the solar system, that's plain wrong - when the sun's shining, my usage should be negative, but it was still showing as positive. The reason they do this (absolute value) is so it still works when people install the CT's backwards. (Of course, the instructions don't tell you any of this. I gleaned it from their forum.)

I eventually figured out that I had the first configuration in their solar instructions. I shut off the power and opened up the breaker box again. Eventually I managed to find the right wires in the rat's nest, get the clamps on them, and squeeze them into the box enough that I could get the cover back on. (If it was a suitcase, I'd be sitting on it to get it closed!)

In this configuration my main MTU is configured as "Adjusted Load". What they don't make clear is that this means it handles negative values, which means that suddenly it's critical to have the CT's the right way around. Of course, I had mine backwards. And of course, you don't discover this till you close everything back up and turn the power back on. The instructions tell you to leave the cover off till you verify the system is working, but it's such a tight squeeze getting my cover on, that I feel better doing it with the power off.

Part of the blame for getting them backwards lies in the instructions. The printed instructions that come in the box say to orient the red dots "towards the source of the power". As far as I can tell from my install, that's just plain wrong. The Installation Pictorial on the web site is better - it shows the red dots oriented the opposite way. But the text says to orient the red dots "towards the breaker". That's correct for their picture, but if you ended up putting the clamps on the other side of the main breaker (as I would have if there had been room), then the dots would not be facing the main breaker and the instructions would again be wrong. The video shows the dots the wrong way (AFAIK) and the voice over simply says "make sure you get them the right way", not very helpful.

So I had to reverse the main CT's, which meant opening both sections of the breaker box (you can't take the cover off the top part without first taking the cover off the bottom part) It's easy to put the clamps on the wires, but the main power wires are really heavy and stiff. Although there's a fair bit of room in the top section, it isn't easy to get the clamps turned so the cover will fit back on.

I'm probably overly paranoid, but I did most of this work with rubber soled shoes, gloves, and one hand in my pocket. I'm a belt and suspenders kind of a person, if you hadn't figured that out by now!

It's pretty obvious that getting the CT's backwards has been a common problem - not surprising given their instructions. You'd think by now they would at least have corrected and improved their instructions. They attempted to handle the issue by having the software ignore the polarity in the default configuration. But that just makes it worse in a more complex setup like mine. Why not just add a configuration option to reverse the polarity? I can't see why that would be difficult software-wise. And for customers it would sure be a lot simpler than opening everything back up to physically reverse them. Rather than ignore polarity, the default setup could simply detect the polarity and adjust the setting automatically. Then you wouldn't need confusing instructions about red dots at all.

Finally, I think it's all working properly! Usage shows positive, solar generation shows negative, and the net is either positive or negative depending on whether I'm generating more than I'm using.

In the end I think I turned off the power and opened up the breaker box four times. And every time I turned the power off I had to reset all the clocks in the house.

I didn't bother buying the TED display unit since I knew there were iPhone apps. I've installed the free TED-O-Meter iPhone app which is basic but functional. I bought iTED ($.99) because it had a graph, but the graph doesn't handle negative kW values. I may try some of the others.

One thing I really like about the solar panel system is that I can monitor it from anywhere through the internet. That's because the solar system sends the data to an outside server. The TED system doesn't work that way, instead the data is collected on the Gateway device, which runs a web server so you can access it. That means I can't access it from outside my local network. (Unless I want to punch holes in my firewall and deal with my dynamic IP etc.) It also means all my data is on the Gateway so I'm probably going to lose it at some point e.g. if I update the firmware.

To get around this, the TED system can post the data to an external service. (commonly Google PowerMeter, until they discontinued it.) Unfortunately, this didn't go smoothly. Every time I tried to activate it, it would time out connecting to the server. I tried several different services so I don't think it was a server problem. Eventually I got it to work, but unfortunately I wasn't scientific enough to isolate what the issue was. I tried various combinations of: switching browsers, trying different services, resetting the gateway, entering the url with and without "http://", and turning the OS X firewall on and off. Presumably, it's the gateway sending the data, so my computer and its firewall shouldn't be relevant, but it's possible they have client side Javascript running on the browser that tests the connection during activation.

It's hard to tell which of these third party services is the "best". TED will only post to a single service so I can only try one at a time. I succeeded first with PlotWatt so I'll give them a try. I had emailed them about my connection problem and they responded quickly and helpfully so that's a good sign. The way their system works, they have to collect a certain amount of data before it's functional so I'll have to be patient. Of course, my initial data is a mess because half of it has the wrong polarity!

It would be nice if there were standards for all these home monitoring and automation systems. As it is every manufacturer and every type of system seems to do things differently and none of them work together.

Overall, the TED 5000 systems seems ok. I definitely think they could improve their instructions and documentation. And I think they could simplify installation if they added a polarity option to the software. Personally, I'd prefer if they sent the data to an internet service, but I can see where being able to operate independently could be valuable (e.g. if you don't have an internet connection). I'm glad they use a browser based interface rather than proprietary client software. The user interface is adequate. And it's great that their are third party services and apps.

Overview of my rat's nest.

The main (usage) CT's

The solar CT's in the bottom of the box and the two MTU's mounted outside

Part of the TED display, showing usage of 178w, solar generation of 426w (cloudy), 
with the net result of -248w, negative meaning I'm sending excess power to the grid.