Choose lossless VCS tools if you have that luxury
Tuesday, June 12th, 2007One of the tough choices VCS designers make is “what do we REALLY care about”. If you can eliminate some use cases, you can make the tool better for the other use cases. So, for example, the Git guys choose not to care too much about annotate. By design, annotate is slow on Git, because by letting go of that they get it to be super-fast in the use cases they care about. And that’s a very reasonable position to take.
My focus today is lossiness, and I’m making the case for starting out a project using tools which are lossless, rather than tools which discard useful information in the name of achieving performance that’s only necessary for the very largest projects.
It’s a bit like saying “shoot your pictures in RAW format, because you can always convert to JPEG and downscale resolution for Flickr, but you can’t always get your top-quality images back from a low-res JPEG”.
When you choose a starting VCS, know that you are not making your final choice of tools. Projects who started with CVS have moved to SVN and then to Bitkeeper and then to something else. Converting is often a painful process, sometimes so painful that people opt to throw away history rather than try and convert properly. We’ll see new generations of tools over the next decade, and the capability of machines and the network will change, so of course your optimal choice of tools will change accordingly.
Initially, projects do best if they choose a tool which makes it as easy to migrate to another tool, as possible. Migrating is a little bit like converting from JPEG to PNG, or PNG to GIF. Or PNG to JPEG2000. You really want to be in the situation where your current format has as much of the detail as possible, so that your conversion can be as clean and as comprehensive as possible. Of course, that comes at a price, typically in performance. If you shoot in RAW, you get fewer frames on a memory stick. So you have to ask yourself “will this bite me?”. And it turns out, that for 99% of photographers, you can get SO MANY photos on a 1GB memory stick, even in RAW mode, that the slower performance is worth trading for the higher quality. The only professional photographers I know who shoot in JPEG are the guys who shoot 3-4000 pictures in an event, and publish them instantly to the web, with no emphasis on image quality because they are not to sort of pics anyone will blow up as a poster.
What’s the coding equivalent?
Well, you are starting a free software project. You will have somewhere between 50 and 500 files in your project initially, it will take a while before you have more than 5,000 files. During that time, you need performance to be good enough. And you want to make sure that, if you need to migrate, you have captured as much of your history in detail so that your conversion can be as easy, and as rich and complete, as possible.
I’ve watched people try to convert CVS to SVN, and it’s a nightmare, because CVS never recorded details that SVN needs, such as which file-specific changes are a consistent set. It’s all interpolation, guesswork, voodoo and ultimately painful work that results often enough in people capitulating, throwing history away and just doing a fresh start in SVN. What a shame.
The Bazaar guys, I think, thought about this a lot. It’s another reason the perfect rename tracking is so important. You can convert a Bazaar tree to Git trivially, whenever you want to, if you need to scale past 10,000 files up to 100,000 files with blazing performance. In the process, you’ll lose the renaming information. But going the other way is not so simple, because Git never recorded that information in the first place. You need interpolation and an unfortunate goat under a full moon, and even then there’s no guarantee. You chose a lossy tool, you lost the renaming data as you used it, you can’t get that data back.
Now, performance is important, but “good enough performance” is the threshold we should aim for in order to get as much out of other use cases as possible. If my tool is lossless, and still gives me a “status” in less than a heartbeat, which Bazaar does up to about 7,000 files, then I have perfectly adequate performance and perfectly lossless recording. If my project grows to the point where Bazaar’s performance is not good enough, I can convert to any of the other systems and lose ONLY the data that I choose to lose in my selection of new tool. And perhaps, by then, Git has gained perfect renaming support, so I can get perfect renaming AND blazing performance. But I made the smart choice by starting in RAW mode.
Now, there are projects out there for which the optimisations and tradeoffs made for Git are necessary. If you want to see what those tradeoffs are, watch Linus describe Git here. But the projects which immediately need to make those tradeoffs are quite unusual – they are not multiplatform, they need extraordinary performance from the beginning, and they are willing to lose renaming data and have slow annotate in order to achieve that. X, OpenSolaris, the Linux kernel… those are hardly representative of the typical free software project.
Those projects, though are also the folks who’ve spoken loudest about version control, because they have the scale and resources to do detailed assessments. But we should recognise that their findings are filtered through the unique lenses of their own constraints, and don’t let that perspective colour the decision for a project that does not operate under those constraints.
What’s good enough performance? Well, I like to think in terms of “heartbeat time”. If the major operations which I have to do regularly (several times in an hour) take less than a heartbeat, then I don’t ever feel like I’m waiting. Things which happen 3-5 times in a day can take a bit longer, up to a minute, and those fit with regular workbreaks that I would take anyhow to clear my head for the next phase of work, or rest my aching fingers.
In summary – I think new and smaller (<10,000 files) projects should care more about correctness, completeness and experience in their choice of VCS tools. Performance is important, but perfectly adequate if it takes less than a heartbeat to do the things you do regularly while working on your code. Until you really have to lose them, don’t discard the ability to work across multiple platforms (lots of free software projects have more users on Windows than on Linux), don’t discard perfect renames, don’t opt for “lossy over lossless” just because another project which might be awesomely cool but has totally different requirements from yours, did so.