Sunday, March 15, 2009

Leaky Abstractions

Originally, Suneido memory mapped the whole database file at once. This is a simple, fast way to access a file. (Note: This doesn't load the whole file into memory. It uses the virtual memory system to load pages of the file on demand.)

Unfortunately, 32 bit cpu's have a limited amount of address space. Windows only allows a maximum of 2 gb and often you have to share that space with other things.

To allow databases bigger than the address space, I changed Suneido to memory map the database file in sections. Once you've filled up the address space, if you want to map another section, you have to unmap one of the previous ones.

This is a lot like how virtual memory works, and you have the same question of which section to unmap. What you really want is to unmap the section that you're not going to need for the longest time into the future. Unfortunately, there's no way to predict the future. Instead, various heuristics are used, a common one is to unmap the least recently used (LRU) section. In other words, the mapped sections form an LRU cache into the database file.

I used an efficient approximation of LRU called the clock algorithm. We've been using this for a long time and it seemed to work reasonably well.

But recently, we started to get occasional errors from loading large databases. Often this kind of thing is caused by memory or disk errors or corrupted data. But the strange thing was that the error was consistently reproducible. We could bring the dump file back to our office and try to load it and get the same error. And the database wasn't corrupt as far as we could tell - later dumps would load fine.

I started to suspect something to do with the memory mapping. One clue was that we only had this problem on large databases.

Eventually, I discovered the problem. The rest of the database doesn't do anything to "lock" sections in memory while it works on them. Instead, it simply relies on LRU not unmapping anything that's been recently used.

But I was using an approximation of LRU. In certain rare cases the clock algorithm can end up unmapping a recently used page. (That's why it's only an approximation of LRU.) Normally that's not a problem. But because I was depending on LRU, when this rare case happened, it resulted in the error we were getting.

To fix it, I simply implemented a "true" LRU algorithm. It'll be a little slower than the clock algorithm, but given how it's used I don't think this will have a noticeable affect. And the code is actually simpler, which is always nice.

The moral of the story is I got burnt by what Joel Spolsky calls a leaky abstraction.

PS. With jSuneido, I'm considering going back to the old system of mapping the whole database file at once. This would limit the size of the database on 32 bit systems, but it's probably not unreasonable, especially in the future, to require a 64 bit system if you want a big database. This would simplify the code and eliminate bugs like the one above.

No comments: