Sunday, March 15, 2009

"Getting" Git

The Git version control system, originally developed by Linus Torvalds for the Linux kernel, has been getting pretty popular lately, especially in the Ruby on Rails crowd.

There's a preliminary Eclipse plugin and even a TortoiseGit project. SourceForge is even offering it already, whereas it took them forever to get Subversion.

At RailsConf last year I went to a session on Git Internals by Scott Chacon and was quite intrigued. You can see the slides with voice over.

Suneido has it's own version control system that has served us well, but it's pretty basic. Git's distributed style and easy branching would be nice to have.

At OSCON 2006 I had gone to a tutorial on Subversion API's with the idea that maybe we could replace Suneido's version control with an interface to Subversion. I was disappointed to find that the API's are all file oriented. I even talked to the presenter after the session to see if there was any way around this, but he couldn't suggest anything. As far as I can tell, Git is similarly file based. Probably a big reason for this is that these types of systems usually evolve from shell scripts that, understandably, work with files.

My interest in Git was raised by reading Pragmatic Version Control Using Git. On top of that, we've recently had some problems with Suneido's version control. Nothing major but still impetus to think about something better. I started to think about what it would take to implement Git like version control in Suneido. Having enjoyed his talk, I also bought Scott Chacon's PeepCode pdf book on Git Internals which expands on the same material.

The basic ideas behind Git are pretty straightforward, although quite different from most other version control systems. The way it works is attractively elegant and simple. Yet it supports powerful uses.

One area where Suneido's version control has problems (as do other version control systems) is with moving and renaming. Git supposedly handled this, but I couldn't figure out how. There didn't seem to be anything in the data structures to track this. Strangely, neither the Pragmatic book or the Git Internals pdf talked much about this.

But it didn't take long on the internet to find the answer. The Wikipedia article was surprisingly useful. It turned out to be an issue that had seen some discussion. For example this mailing list message and this one.

I was right, the data structures don't record any information about moves or renames. Instead, that information is heuristically determined after the fact, by looking at file contents.

One of the common uses of Suneido's version control is to check the history of a record. (Although we haven't gone as far as to implement a "blame" utility yet.) But the way Git works, this is not a simple task. You have to work back through each version of the source tree looking for changes to the record. But no doubt there are ways to speed this up, maybe by caching which records were involved in each commit.

I worked up enough interest and understanding to take a stab at implementing something this weekend. I have a prototype (a few hundred lines of code) that can save a source tree in a repository and retrieve it again.

Performance is a bit of an issue. Git works by comparing source trees. We have about 15,000 records (think source files) in our main application. Building a Git style tree of the working copy takes some time. But it turns out most of the time is not reading the records, it's calculating the hashes. It shouldn't be too hard to make a cache to speed that up. (You could store the hashes in the libraries but that would be dangerous since something might update content without updating the hash.)

There's obviously still a ways to go before it would be able to replace Suneido's existing version control, but it's enough to convince me that it would be feasible. Just what I needed, another project :-)

No comments: