Friday, June 27, 2008

Lightweight Database Replication for Suneido

Issues

Occasionally we get runaway memory usage in Suneido. This only seems to happen with lots of users, and seems to be related to running large reports. I'm assuming it's due to the inherent problems with conservative garbage collection. Unfortunately, the fix is to restart the Suneido server program, but this is ugly, especially with a lot of active users.

Suneido is single-threaded so we don't take advantage of multiple cores.

Suneido does not have an ODBC interface. One of the reasons for this is that ODBC is quite closely tied to SQL and Suneido isn't SQL. A common reason for wanting this is to use third party report generation or data analysis tools. Or sometimes to interface with other systems.

We generally set up our clients to do on-line backups twice a day. Although this does not stop them from working, it does tend to slow down the system, especially with bigger databases. But it would be nice to do backups more often, since, in the event of a hardware failure, no one wants to lose even a few hours work. (And if the hard drive fails totally, they usually have to fall back on the nightly backup we do to Amazon S3.)

It would be nice in some circumstances to run Suneido on Amazon EC2. But EC2 doesn't have persistent storage (although they're working on it). You can load your database from S3 when you start the server and save it back to S3 when you stop the server. But if you crash, you lose it.

Replication

At RailsConf several of the sessions talked about using MySQL master-slave replication, for example to run on EC2. This got me thinking about replication for Suneido. Replication would, at least partially, address all the issues above.

- it would maintain an up to date backup with (hopefully) minimal impact on performance

- we could replicate to an SQL database, allowing ODBC access etc.

- we could run reports off the replica/slave, making use of multi-core, and reducing the memory issue because you could more easily restart the slave

- we could run on EC2 with the replica/slave to protect the data in case of crashes

Initially, I thought we could get some of these benefits just running a secondary server using the backups we already make twice a day. A lot of big reports don't need up to the minute data. But the problem would be determining when to use the live database and when to use the slightly stale copy. You could let the user pick but that would be ugly. And even they might not know that someone had just updated old data.

You could actually do replication at the application level by creating triggers for every database table and sending the changes to the slave. But it would be error prone and hard to handle issues like transaction rollbacks. It would also be running in the main thread so it could have a performance impact.

My next thought was to take advantage of the way Suneido's database is implemented - when a transaction is committed successfully a "commit" record is added to the end of the database file. So you can simply monitor the data added to the end of the file and watch for new commits.

One concern, that I haven't completely addressed, is whether different "views" of the file will be "coherent". MSDN is a little unclear on this. MapViewOfFile says:
Multiple views of a file ... are coherent ... if the file views are derived from the same file mapping object.
But CreateFileMapping says:
file views derived from any file mapping object that is backed by the same file are coherent
If I had a single file mapping object it might not be too hard to share it with the replication task. But because I map large files in chunks, I have a whole bunch of file mapping objects which makes it harder to share them.

I thought I might be able to use named file mapping objects to make sharing easier but I couldn't get it to work. I'm not sure what I was doing wrong, I didn't sink a lot of time into it.

Initially I thought I'd just create a separate thread within the server process to send the updates to the replica. Simpler than running a separate process and perhaps more likely to be coherent. As long as the replication thread did not access any of the server's variables, then there shouldn't be any concurrency issues. I went a little ways down this road until I realized that my memory management was single threaded. I could try to avoid any allocation in the replication thread but it's too hard to know what might allocate, making this approach error-prone and brittle.

So my current thinking is to make the replication a separate process. For simplicity, I will probably just make this another "mode" of running the same suneido.exe (Rather than creating a separate executable.) Suneido already works this way for standalone, client, or server - they're just different modes of running the same exe. (The exe is only just over 1 mb so it's not like having multiple modes is bloating it!)

Of course, I still have to handle the coherency issue.

What about the slave/replica? I should be able to run a standard Suneido server. The replication process can connect to it as a normal client (over TCP/IP sockets as usual). The basic functioning shouldn't require any changes. I may want to add a slave mode to tweak it to give the replication connection priority and to not allow updates from other connections.

The other piece of the puzzle is allowing a Suneido client to access multiple databases. Currently a client can only be connected to a single server. This will require some changes but they should be fairly minimal.

Replicating to an SQL database will be a little more work and it's less of a priority. (Until we get a client that demands it!)

I may be overly optimistic, but I think I can set this up with minimal effort. We'll see!

No comments: