Archive for June, 2007

Continuing my discussion of version control tools, I’ll focus today on the importance of the merge capability of the tool.

The “time to branch” is far less important than the “time to merge”. Why? Because merging is the act of collaboration – it’s when one developer sets down to integrate someone else’s work with their own. We must keep the cost of merging as low as possible if we want to encourage people to collaborate as much as possible. If a merge is awkward, or slow, or results in lots of conflicts, or breaks when people have renamed files and directories, then I’m likely to avoid merging early and merging often. And that just makes it even harder to merge later.

The beauty of distributed version control comes in the form of spontaneous team formation, as people with a common interest in a bug or feature start to work on it, bouncing that work between them by publishing branches and merging from one another. These teams form more easily when the cost of branching and merging is lowered, and taking this to the extreme suggests that it’s very worthwhile investing in the merge experience for developers.

In CVS and SVN, the “time to branch” is low, but merging itself is almost always a painful process. Worse, merging a second time from another branch is WORSE, so the incentives for developers to merge regularly are exactly the wrong way around. For merge to be a smooth experience, the tools need to keep track of what has been merged before, so that you never end up redoing work that you’ve already solved. Bzr and Git both handle this pretty well, remembering which revisions in someone else’s branch you have already integrated into yours, and making sure that you don’t need to bother to do it again.

When we encourage people to “do their own thing” with version control, we must also match that independence with tools to facilitate collaboration.

Now, what makes for a great merge experience?

Here are a couple of points:

  1. Speed of the merge, or time it will take to figure out what’s changed, and do a sane job of applying those changes to your working tree. Git is the undisputed champion of merge speed. Anything less than a minute is fine.
  2. Handling of renames, especially renamed directories. If you merge from someone who has modified a file, and you have renamed (and possibly modified) the same file, then you want their change to be applied to the file in your working tree under the name YOU have given it. It is particularly important, I think, to handle directory renames as a first class operation, because this gives you complete freedom to reshape the tree without worrying about messing up other people’s merges. Bzr does this perfectly – even if you have subsequently created a file with the same name that the modified file USED to have, it will correctly apply the change to the file you moved to the new name.
  3. Quality of merge algorithm. This is the hardest thing to “benchmark” because it can be hugely subjective. Some merge algorithms take advantage of annotation data, for example, to minimise the number of conflicts generated during a merge. This is a highly subjective thing but in my experience Bzr is fantastic in merge quality, with very few cases of “stupid” conflicts even when branches are being bounced around between ad-hoc squads of developers. I don’t have enough experience of merging with tools like Darcs which have unusual characteristics and potentially higher-quality merges (albeit with lots of opportunity for unexpected outcomes).

I like the fact that the Bazaar developers made merging a first-class operation from the start, rather than saying “we have a few shell scripts that will help you with that” they focused on techniques to reduce the time that developers spend fixing up merges. A clean merge that takes 10 seconds longer to do saves me a huge amount of time compared to a dirty (conflict-ridden, or rename-busted) merge that happened a few seconds faster.

Linus is also a very strong advocate of merge quality. For projects which really want as much participation as possible, merge quality is a key part of the developer experience. You want ANYBODY to feel empowered to publish their contribution, and you want ANYBODY to be willing to pull those changes into their branches with confidence that (a) nothing will break and (b) they can revert the merge quickly, with a single command.

No negotiations with Microsoft in progress

Saturday, June 16th, 2007

There’s a rumour circulating that Ubuntu is in discussions with Microsoft aimed at an agreement along the lines they have concluded recently with Linspire, Xandros, Novell etc. Unfortunately, some speculation in the media (thoroughly and elegantly debunked in the blogosphere but not before the damage was done) posited that “Ubuntu might be next”.

For the record, let me state my position, and I think this is also roughly the position of Canonical and the Ubuntu Community Council though I haven’t caucused with the CC on this specifically.

We have declined to discuss any agreement with Microsoft under the threat of unspecified patent infringements.

Allegations of “infringement of unspecified patents” carry no weight whatsoever. We don’t think they have any legal merit, and they are no incentive for us to work with Microsoft on any of the wonderful things we could do together. A promise by Microsoft not to sue for infringement of unspecified patents has no value at all and is not worth paying for. It does not protect users from the real risk of a patent suit from a pure-IP-holder (Microsoft itself is regularly found to violate such patents and regularly settles such suits). People who pay protection money for that promise are likely living in a false sense of security.

I welcome Microsoft’s stated commitment to interoperability between Linux and the Windows world – and believe Ubuntu will benefit fully from any investment made in that regard by Microsoft and its new partners, as that code will no doubt be free software and will no doubt be included in Ubuntu.

With regard to open standards on document formats, I have no confidence in Microsoft’s OpenXML specification to deliver a vibrant, competitive and healthy market of multiple implementations. I don’t believe that the specifications are good enough, nor that Microsoft will hold itself to the specification when it does not suit the company to do so. There is currently one implementation of the specification, and as far as I’m aware, Microsoft hasn’t even certified that their own Office12 completely implements OpenXML, or that OpenXML completely defines Office12′s behavior. The Open Document Format (ODF) specification is a much better, much cleaner and widely implemented specification that is already a global standard. I would invite Microsoft to participate in the OASIS Open Document Format working group, and to ensure that the existing import and export filters for Office12 to Open Document Format are improved and available as a standard option. Microsoft is already, I think, a member of OASIS. This would be a far more constructive open standard approach than OpenXML, which is merely a vague codification of current practice by one vendor.

In the past, we have surprised people with announcements of collaboration with companies like Sun, that have at one time or another been hostile to free software. I do believe that companies change their position, as they get new leadership and new management. And we should engage with companies that are committed to the values we hold dear, and disengage if they change their position again. While Sun has yet to fully deliver on its commitments to free software licensing for Java, I believe that commitment is still in place at the top.

I have no objections to working with Microsoft in ways that further the cause of free software, and I don’t rule out any collaboration with them, in the event that they adopt a position of constructive engagement with the free software community. It’s not useful to characterize any company as “intrinsically evil for all time”. But I don’t believe that the intent of the current round of agreements is supportive of free software, and in fact I don’t think it’s particularly in Microsoft’s interests to pursue this agenda either. In time, perhaps, they will come to see things that way too.

My goal is to carry free software forward as far as I can, and then to help others take the baton to carry it further. At Canonical, we believe that we can be successful and also make a huge contribution to that goal. In the Ubuntu community, we believe that the freedom in free software is what’s powerful, not the openness of the code. Our role is not to be the ideologues -in-chief of the movement, our role is to deliver the benefits of that freedom to the widest possible audience. We recognize the value in “good now to get perfect later” (today we require free apps, tomorrow free drivers too, and someday free firmware to be part of the default Ubuntu configuration) we always act in support of the goals of the free software community as we perceive them. All the deals announced so far strike me as “trinkets in exchange for air kisses”. Mua mua. No thanks.

One of the tough choices VCS designers make is “what do we REALLY care about”. If you can eliminate some use cases, you can make the tool better for the other use cases. So, for example, the Git guys choose not to care too much about annotate. By design, annotate is slow on Git, because by letting go of that they get it to be super-fast in the use cases they care about. And that’s a very reasonable position to take.

My focus today is lossiness, and I’m making the case for starting out a project using tools which are lossless, rather than tools which discard useful information in the name of achieving performance that’s only necessary for the very largest projects.

It’s a bit like saying “shoot your pictures in RAW format, because you can always convert to JPEG and downscale resolution for Flickr, but you can’t always get your top-quality images back from a low-res JPEG”.

When you choose a starting VCS, know that you are not making your final choice of tools. Projects who started with CVS have moved to SVN and then to Bitkeeper and then to something else. Converting is often a painful process, sometimes so painful that people opt to throw away history rather than try and convert properly. We’ll see new generations of tools over the next decade, and the capability of machines and the network will change, so of course your optimal choice of tools will change accordingly.

Initially, projects do best if they choose a tool which makes it as easy to migrate to another tool, as possible. Migrating is a little bit like converting from JPEG to PNG, or PNG to GIF. Or PNG to JPEG2000. You really want to be in the situation where your current format has as much of the detail as possible, so that your conversion can be as clean and as comprehensive as possible. Of course, that comes at a price, typically in performance. If you shoot in RAW, you get fewer frames on a memory stick. So you have to ask yourself “will this bite me?”. And it turns out, that for 99% of photographers, you can get SO MANY photos on a 1GB memory stick, even in RAW mode, that the slower performance is worth trading for the higher quality. The only professional photographers I know who shoot in JPEG are the guys who shoot 3-4000 pictures in an event, and publish them instantly to the web, with no emphasis on image quality because they are not to sort of pics anyone will blow up as a poster.

What’s the coding equivalent?

Well, you are starting a free software project. You will have somewhere between 50 and 500 files in your project initially, it will take a while before you have more than 5,000 files. During that time, you need performance to be good enough. And you want to make sure that, if you need to migrate, you have captured as much of your history in detail so that your conversion can be as easy, and as rich and complete, as possible.

I’ve watched people try to convert CVS to SVN, and it’s a nightmare, because CVS never recorded details that SVN needs, such as which file-specific changes are a consistent set. It’s all interpolation, guesswork, voodoo and ultimately painful work that results often enough in people capitulating, throwing history away and just doing a fresh start in SVN. What a shame.

The Bazaar guys, I think, thought about this a lot. It’s another reason the perfect rename tracking is so important. You can convert a Bazaar tree to Git trivially, whenever you want to, if you need to scale past 10,000 files up to 100,000 files with blazing performance. In the process, you’ll lose the renaming information. But going the other way is not so simple, because Git never recorded that information in the first place. You need interpolation and an unfortunate goat under a full moon, and even then there’s no guarantee. You chose a lossy tool, you lost the renaming data as you used it, you can’t get that data back.

Now, performance is important, but “good enough performance” is the threshold we should aim for in order to get as much out of other use cases as possible. If my tool is lossless, and still gives me a “status” in less than a heartbeat, which Bazaar does up to about 7,000 files, then I have perfectly adequate performance and perfectly lossless recording. If my project grows to the point where Bazaar’s performance is not good enough, I can convert to any of the other systems and lose ONLY the data that I choose to lose in my selection of new tool. And perhaps, by then, Git has gained perfect renaming support, so I can get perfect renaming AND blazing performance. But I made the smart choice by starting in RAW mode.

Now, there are projects out there for which the optimisations and tradeoffs made for Git are necessary. If you want to see what those tradeoffs are, watch Linus describe Git here. But the projects which immediately need to make those tradeoffs are quite unusual – they are not multiplatform, they need extraordinary performance from the beginning, and they are willing to lose renaming data and have slow annotate in order to achieve that. X, OpenSolaris, the Linux kernel… those are hardly representative of the typical free software project.

Those projects, though are also the folks who’ve spoken loudest about version control, because they have the scale and resources to do detailed assessments. But we should recognise that their findings are filtered through the unique lenses of their own constraints, and don’t let that perspective colour the decision for a project that does not operate under those constraints.

What’s good enough performance? Well, I like to think in terms of “heartbeat time”. If the major operations which I have to do regularly (several times in an hour) take less than a heartbeat, then I don’t ever feel like I’m waiting. Things which happen 3-5 times in a day can take a bit longer, up to a minute, and those fit with regular workbreaks that I would take anyhow to clear my head for the next phase of work, or rest my aching fingers.
In summary – I think new and smaller (<10,000 files) projects should care more about correctness, completeness and experience in their choice of VCS tools. Performance is important, but perfectly adequate if it takes less than a heartbeat to do the things you do regularly while working on your code. Until you really have to lose them, don’t discard the ability to work across multiple platforms (lots of free software projects have more users on Windows than on Linux), don’t discard perfect renames, don’t opt for “lossy over lossless” just because another project which might be awesomely cool but has totally different requirements from yours, did so.

Further thoughts on version control

Monday, June 11th, 2007

I’ve had quite a lot of positive email feedback on my posting about on renaming as the killer app of distributed version control. So I thought it would be interesting to delve into this subject in more detail. I’ll blog over the next couple of months, starting tomorrow, about the things I think we need from this set of tools – whether they be Git, Darcs, Mercurial, Monotone or Bazaar.

First, to clear something up, Ubuntu selected Bazaar based on our assessment of what’s needed to build a great VCS for the free software community. Because of our work with Ubuntu, we know that what is important is the full spectrum of projects, not just the kernel, or X, or OpenOffice. It’s big and large projects, Linux and Windows projects, C and Python projects, Perl and Scheme projects… the best tools for us are the ones that work well across a broad range of projects, even if those are not the ones that are optimal for a particular project (in the way that Git works brilliantly for the kernel, because its optimisations suit that use case well, it’s a single-platform single-workflow super-optimised approach).

I’ve reviewed our choice of Bazaar in Ubuntu a couple of times, when projects like OpenSolaris and X made other choices, and in each case been satisfied that it’s still the best project for our needs. But we’re not tied to it, we could move to a different one. Canonical has no commercial interest in Bazaar (it’s ALL GPL software) and no cunning secret plans to launch a proprietary VCS based on it. We integrated Bazaar into Launchpad because Bazaar was our preferred VCS, but Bazaar could just as well be integrated into SourceForge and Collab since it’s free code.

So, what I’m articulating here is a set of values and principles – the things we find important and the rationale for our decisions – rather than a ra-ra for a particular tool. Bazaar itself doesn’t meet all of my requirements, but right now it’s the closest tool for the full spectrum of work we do.

Tomorrow, I’ll start with some commentary on why “lossless” tools are a better starting point than lossy tools, for projects that have that luxury.

The number one thing I want from a distributed version control system is robust renaming. Why is that? Because without a rigorous approach to renaming that guarantees perfect results, I’m nervous to merge from someone I don’t know. And merging from “people you don’t know” is the real thing that distributed version control gives which you cannot get from centralized systems like CVS and Subversion.

Distributed version control is all about empowering your community, and the people who might join your community. You want newcomers to get stuck in and make the changes they think make sense. It’s the difference between having blessed editors for an encyclopedia (in the source code sense we call them “committers”) and the wiki approach, which welcomes new contributors who might just have a very small fix or suggestion. And perhaps more importantly, who might be willing to spend time on cleaning up and reshaping the layout of your wiki so that it’s more accessible and understandable for other hackers.

The key is to lower the barrier to entry. You don’t want to have to dump a whole lot of rules to new contributors like “never rename directories a, b and c because you will break other people and we will be upset”. You want those new contributors to have complete freedom, and then you want to be able to merge, review changes, and commit if you like them. If merging from someone might drop you into a nightmare of renaming fixups, you will be resistant to it, and your community will not be as widely empowered.

So, try this in your favorite distributed VCS:

  1. Make two branches of your favorite upstream. In Bzr, you can find some projects to branch in the project cloud.
  2. In one branch, pretend to be a new contributor, cleaning up the build system. Rearrange some directories to make better sense (and almost every large free software project can benefit from this, there’s a LOT of cruft that’s crept in over the years… the bigger the project, the bigger the need).
  3. Now, in the second branch, merge from the branch where you did that renaming. Some systems will fail, but most will actually handle this easy case cleanly.
  4. Go back to the first branch. Add a bunch of useful files to the repo in the directories you renamed. Or make a third branch, and the files to the directories there.
  5. Now, merge in from that branch.
  6. Keep playing with this. Sooner or later, if you are not using a system like Bzr which treats renames as a first class operation… Oops.

Now, this is not a contrived example, it’s actually a perfect study of what we HOPE will happen as distributed version control is more widely adopted. If I look at the biggest free software projects, the thing they all have in common is crufty tree structures (directory layouts) and build systems. This is partly a result of never having had tools which really supported renaming, in a way which Would Not Break. And this is one of the major reasons why it takes 8 hours to build something like OpenOffice, and why so few people have the stomach to step up and contribute to a project like that.

The exact details of what it takes to break the renaming support of many DVCS’s vary from implementation to implementation. But by far the most robust of them is Bzr at the moment, which is why we make such heavy use of it at Ubuntu. Many of the other systems have just waved past the renaming problem, saying it’s “not essential” and that heuristics and guesstimates are sufficient. I disagree. And I think the more projects really start to play with these tools, the more they will appreciate renaming is the critical feature that needs to Just Work. I’ll gladly accept the extra 0.3 seconds it takes Bzr to give me a tree status in my 5,100 file project, for the security of knowing I never ever have to spend long periods of time sorting out a merge by hand when stuff got renamed. It still comes back in less than a second. Which is plenty fast enough for me. Even though I know it will get faster, that extra performance is not nearly as important to me as the overall time saved by the robustness of the tool in the face of a constant barrage of improvements by new contributors.

Fantastic science

Friday, June 1st, 2007

This sort of discovery, I guess, is why I wanted to be a physicist. Alas, after two days at the amazing CERN when I was 18 I was pretty sure I wasn’t clever enough to do that, and so pursued other interests into IT and business and space. But I still get a thrill out of living vicariously in a life that involves some sort of particle accelerator. What an incredible rush for the scientists involved.

Also very glad to have exited the “prime number products are hard to factor so it helps if you generated the beasties in the first place” business.