Continuing my discussion of version control tools, I’ll focus today on the importance of the merge capability of the tool.

The “time to branch” is far less important than the “time to merge”. Why? Because merging is the act of collaboration – it’s when one developer sets down to integrate someone else’s work with their own. We must keep the cost of merging as low as possible if we want to encourage people to collaborate as much as possible. If a merge is awkward, or slow, or results in lots of conflicts, or breaks when people have renamed files and directories, then I’m likely to avoid merging early and merging often. And that just makes it even harder to merge later.

The beauty of distributed version control comes in the form of spontaneous team formation, as people with a common interest in a bug or feature start to work on it, bouncing that work between them by publishing branches and merging from one another. These teams form more easily when the cost of branching and merging is lowered, and taking this to the extreme suggests that it’s very worthwhile investing in the merge experience for developers.

In CVS and SVN, the “time to branch” is low, but merging itself is almost always a painful process. Worse, merging a second time from another branch is WORSE, so the incentives for developers to merge regularly are exactly the wrong way around. For merge to be a smooth experience, the tools need to keep track of what has been merged before, so that you never end up redoing work that you’ve already solved. Bzr and Git both handle this pretty well, remembering which revisions in someone else’s branch you have already integrated into yours, and making sure that you don’t need to bother to do it again.

When we encourage people to “do their own thing” with version control, we must also match that independence with tools to facilitate collaboration.

Now, what makes for a great merge experience?

Here are a couple of points:

  1. Speed of the merge, or time it will take to figure out what’s changed, and do a sane job of applying those changes to your working tree. Git is the undisputed champion of merge speed. Anything less than a minute is fine.
  2. Handling of renames, especially renamed directories. If you merge from someone who has modified a file, and you have renamed (and possibly modified) the same file, then you want their change to be applied to the file in your working tree under the name YOU have given it. It is particularly important, I think, to handle directory renames as a first class operation, because this gives you complete freedom to reshape the tree without worrying about messing up other people’s merges. Bzr does this perfectly – even if you have subsequently created a file with the same name that the modified file USED to have, it will correctly apply the change to the file you moved to the new name.
  3. Quality of merge algorithm. This is the hardest thing to “benchmark” because it can be hugely subjective. Some merge algorithms take advantage of annotation data, for example, to minimise the number of conflicts generated during a merge. This is a highly subjective thing but in my experience Bzr is fantastic in merge quality, with very few cases of “stupid” conflicts even when branches are being bounced around between ad-hoc squads of developers. I don’t have enough experience of merging with tools like Darcs which have unusual characteristics and potentially higher-quality merges (albeit with lots of opportunity for unexpected outcomes).

I like the fact that the Bazaar developers made merging a first-class operation from the start, rather than saying “we have a few shell scripts that will help you with that” they focused on techniques to reduce the time that developers spend fixing up merges. A clean merge that takes 10 seconds longer to do saves me a huge amount of time compared to a dirty (conflict-ridden, or rename-busted) merge that happened a few seconds faster.

Linus is also a very strong advocate of merge quality. For projects which really want as much participation as possible, merge quality is a key part of the developer experience. You want ANYBODY to feel empowered to publish their contribution, and you want ANYBODY to be willing to pull those changes into their branches with confidence that (a) nothing will break and (b) they can revert the merge quickly, with a single command.

23 comments:

  1. zimbatm says: (permalink)
    June 19th, 2007 at 12:33 pm

    Why not go further ? Source code is only one part of a project. You have documentation, the website, the bug tracker… would it be acceptable for those things to also be distributed ? Could we simply `bzr pull` a whole project and then `bzr serve` it on another host ? With bug tracking, wiki et all ?

    Apropos Darcs, it’s exactly what you thought. I used it for some time. It was nice until I started to collaborate. The day it took me more than 2 hours to merge, I stopped using it. So much for my Darcs experience.

  2. Zeno Davatz says: (permalink)
    June 19th, 2007 at 12:47 pm

    I’m curious about what your experience will be like with Bazaar. I’m resting with GIT, firmly and surely. Keep us posted.

  3. Djordy Seelmann says: (permalink)
    June 19th, 2007 at 1:14 pm

    Merging is indeed one of the most powerful operations when it concerns open-source collaboration. But let’s consider another scenario in which RCSs are used more often; commercial software engineering firms which are often characterized by an hierarchical, functional organizational form. Content is created as employees perform their assigned tasks. The project is led by a project manager who probably also uses project management software to assign tasks to employees and track their progress. Merging is not a first-class operation here through the initial development cycle, but starts to become more important as the product is improved and features are added.

    You could say that at first, the foundation of the software is created; a foundation which everyone will need. A centralized revision control system would be the best option here I think, as everyone is contributing to the same foundation and needs everyone elses input as soon as possible. In large development teams, you do not want to be verbally communicating with other employees from who you need to pull to get latest, which would be the case in a distributed system. Most likely the project manager is pulling from everyone, and others can pull from him to get latest. Kinda resembles a centralized system, and opting for a real one will make the process more efficient anyway. In continued development, and after a commercial release, milestones are set and specific tasks are designed; the innovativeness of the firm will highly depend on the`goals and innovativeness of the management which assigns tasks and sets milestones. Mostly focused on the main features of the product, innovative features are left out or only partly integrated due to time and/or cost limitations. In this structure, no man is an island, and in my view a distributed revision control system clashes with the organizational structure of the firm. Mark, how do you think that a distributed RCS would integrate in a large development teams operating in a commercial, software engineering environment?

    Open-source projects like Ubuntu are often characterized by a matrix-project form. Small project groups are created and each one adds features on the fly. This greatly improves innovativeness, but could slow down on efficiency due to communication problems. It is good to see that tools like LaunchPad take the role to solve this problem, so that everyone is aware on who is working on what, what is already being constructed, what is being fixed, and which direction is taken. As we’ve seen in the last decade, the hierarchy presented in centralized revision systems has a negative influence on the concept of open-source collaboration.

    Now back to merging; I feel that the development in a commercial software engineering environment is way more straightforward than it is in the open-source community. It is more straightforward since 1) it hardly happens that two individuals have been making changes to a file simultaneously, and 2) branching is not encouraged. Combining this with my opinion on opting for a centralized RCS in a commercial environment, I feel that merging has a lower priority, and is therefore slower in centralized RCS as in the distributed kind, as the latter requires merging to be a first class operation. Now I could be wrong here, and if so, please correct me.

  4. Joao says: (permalink)
    June 19th, 2007 at 2:04 pm

    I think the development of Bazaar itself, using itself to do the job of course, is proof enough that Bazaar scales well as it would in a corporate environment, as the development of Bazaar has some structure to it which should mimic in part the development of commercial projects.

    What surprises me about Bazaar, though, is how many different branches each developer keeps around, so it’s like like they shy away from branching and merging away. :-)

    BTW, one of the killer features of Bazaar is the plugin support to Subversion, Bzr-Svn, which can be used for getting some work done with Bazaar accessing a Subversion repository. I still haven’t used it enough, though, as I’ve just converted my Mercurial repositories to Bazaar again and haven’t had much need to play around with Bzr-Svn yet, but as folks keep using Subversion no matter what, and as Bazaar is still on its way to replace Subversion in the future, I think Bzr-Svn is a nive have that can come in handy.

    Eventually, Bazaar might become the preferred distributed client for working with centralized servers which use other RCS, as long as the APIs are well understood I think Bazaar should be able to adapt to them. :-)

    Of course, the more man-power Bazaar gets, the farther it will advance.

  5. Josh says: (permalink)
    June 19th, 2007 at 2:04 pm

    RCS-based version control is dead. Does anyone still use a type writer here? We work with streams as first-class objects using Accurev, no file-based branch and label garbage here. I can’t imagine working for another employer using VSS, CVS, SVN, or PVCS again, I have too few hairs on my head left as it is.

  6. zimbatm says: (permalink)
    June 19th, 2007 at 2:10 pm

    @Djordy Seelmann: You would probably better use a small team to build the foundations of your software anyway so DRCS is not a problem at that stage.

    Once the project is big enough to split it in different teams, each one will probably get a manager. Since the manager is responsible for code quality, he/she would probably prefer to pull from it’s developers instead of looking up the changes in the main RCS. This also has the advantage that he can control the release of it’s “component” while having fine-grained commits. How many times did you pull the trunk to find out that another team had broken something ? You then have to go over to talk to them and loose time to get a patch.

    Cheap branching and merging also has it’s advantages. You’ll find out that the missing features you’ve been talking about can be kept in a branch without the fear of loosing it. I remember the times where I had to scratch some code because of due timelines. Now I can keep them and introduce them in the next release. For me, what stifles innovation is when you have to ask your project manager if you can make a new branch. It really doesn’t motivate you to introduce new ideas and for me code speaks much more than a PowerPoint(tm).

    Once you have version one oh out, you create a new branch for that release. Since branching is cheap, the support team can easily create one for each reported bug and work on them simultaneously. The fixed bug can also easily be merged in the older version if you want.

    Naturally, it’s not all nice. I guess that with more freedom, also comes more responsibility. If you’re paranoid, you won’t like the fact that an employee can take the source and use it in another environment like competition. But this is legal matter that doesn’t happen with GPL code :-p

  7. Phil Hagelberg says: (permalink)
    June 19th, 2007 at 4:34 pm

    It’s interesting to read these debates coming from the perspective of a dynamic language developer. When I want to “branch” to make my own features in Emacs or Conkeror [1] I just do my work in a brand new file that redefines the functions I want to change or adds advice [2] or hooks to it. There’s no need for me to go in and modify the original files, I just monkeypatch [3] core functionality from within your own dotfiles. Sharing your changes is as easy as creating an Emacswiki [4] page with your new function on it. Of course, if you want those things reintegrated into the trunk so that everyone can have access to them it’s a bit more work, but the dynamic features of the language make it so that’s not necessary for code sharing. I think that encourages collaboration on a level that’s simply not possible in software written in C. Hope I don’t come off as a Smug Lisp Weenie, ™ but that’s my take on it.

    [1] – http://conkeror.mozdev.org
    [2] – http://www.delorie.com/gnu/docs/elisp-manual-21/elisp_212.html
    [3] – http://en.wikipedia.org/wiki/Monkeypatch
    [4] – http://emacswiki.org

  8. Daniel Watkins says: (permalink)
    June 19th, 2007 at 7:54 pm

    @Djordy Seelmann: I’ve got to wonder if a centralised workflow works best for commercial projects because commercial projects have always used a centralised workflow. Or, rather, is the reason that a DRCS doesn’t look like it would work in a commercial environment because a DRCS has never been used in that environment?

    Not having any experience with a commercial environment, I couldn’t comment. However, I’d be surprised if more flexibility is less desirable than less flexibility in any context…

  9. Road Warrior Collaboration » Merges Shouldn't be Scary says: (permalink)
    June 19th, 2007 at 8:58 pm

    [...] Shuttleworth continues his series of posts on version control tools by pointing out that it is the time to accomplish a merge that is crucial, not how easy it is to create a branch.  This is an extremely important point and in my experience a place where many tools and even [...]

  10. Julio says: (permalink)
    June 20th, 2007 at 12:16 pm

    Mark – if this is a specification for your bzr merging software – shouldn’t you be even more specific about the algorithm? Should it search for the common ancestor of the two merge candidates in the version tree, and aggregating past merges – hilite the remaining deltas? (Won’t mention name of IBM product here).

    Mark – how will your merging software handle the case where two separate branches concurrently have a file with the same name added – which are to be merged? (Ie same name to the file system – but different identifiers to bzr) This is a bug in a lot of commonly used version control systems today.

    Mark Shuttleworth says:

    I think Bzr will just give you a conflict, with the two files in different names, so you can rename them appropriately and thus resolve the conflict. You’ll need to pick names for each of them, and update anything that points to the wrong names. Then commit. Anybody who had one or other file in their branch and merges from you will get the other file and the two files will have the names you set. At least, that’s my recollection, but you should ask on #bzr to be sure.

  11. Djordy Seelmann says: (permalink)
    June 20th, 2007 at 12:39 pm

    @zimbatm:
    Some good points raised there. Still, I think that a centralized RCS is preferred in a corporate environment because it matches its organizational structure. Secondly, it is easier for any lead developer to be updating from the root of a hosted repository (centralized) than it is to be pulling from each developer individually. Looking up the changes in the main RCS is a must for any project manager if he wants to track progress, and to me it seems easier to see this on one screen (through changelogs and revision ‘owners’), than it is to be communicatign with each developer what he or she has done. Tools like LaunchPad make a good middleman in this process, but Subversion, for instance, has already got the tools on-board to aid project management, yet I feel these tools are hardly ever used since it lacks a clear mark-up. My point is, that Subversion or any other centralized RCS is not fully used to its potential, as it could also be used to heavily aid project management. I’ve written a short piece about how this could be established on my blog [ http://entrepreneur-y.com/?p=6 ] (apologies for the blogspam), and what I’m trying to figure out is how an identical system could work with any existing RCS of the distributed kind.

    @ Daniel Watkins:
    In my opinion, there are some reasons why a centralized workflow is prefered in corporate environments, and I’ve stated some of them above. I don’t think it’s a ‘because it’s always been like that’ case, as I feel that project managers frequently rethink their choice of RCS (even Mark has said that in an earlier post, if I recall correctly).

    Secondly, assuming that you are referring to operational flexibility..more flexibility comes with a price; deroutinization. If a firm has high flexiblity while it requires less of it, it would simply mean that your firm is not routinized enough. High flexbility is only prefered in environments that are ought to be highly innovative.

  12. Brandon Casey says: (permalink)
    June 20th, 2007 at 5:42 pm

    Djordy:
    I think you are taking the ‘pull from developers’ model too strictly. It is a possible work flow model but it is not mandated by a distributed RCS. Developers can also ‘push’ into a central repository, and then the lead developer or project manager could access there. No need to poll each developer individually.

    All of the information is there to determine which developer made which modifications. One major difference in the central repository model that you are looking to make use of according to your blog, is locking. Locking in the sense that a single developer ‘locks’ a subset of the repository preventing other developers from making commits to that subset. By checking which developer is holding which locks, a project manager could determine what the developers were working on. I’m sure people can debate whether this sort of locking is beneficial or detrimental till the cows come home.

    I think the same information could be determined by looking at the recent commits of a particular developer. And in which sub-modules (top-level directories) of the project they occurred in. As far as I know, tools for this to create a nice graphic (or table) also do not exist.

    -brandon

  13. ProjectX Blog » Blog Archive » Links says: (permalink)
    June 22nd, 2007 at 3:39 am

    [...] Merging is the key to software development collaboration [...]

  14. RomaCogitans says: (permalink)
    June 27th, 2007 at 8:42 am

    Hi,

    On my Blog I wrote an article from the perspective of a Linux end-user, hence a NON-IT expert:

    http://romacogitans.wordpress.com/2007/06/26/open-letter-to-the-open-source-community/

    I would be pleased if you and your readers could take a look at it and contribute to the discussion.

    See you there!

  15. svnmerge, a tool to manage SVN merges - Kyle Cordes says: (permalink)
    June 27th, 2007 at 9:14 pm

    [...] of merging, Mark Shuttlework recently argued merging is the key to software developer collaboration. To me, this is obviously true, and not only for open-source projects, but for closed-source [...]

  16. Anzor Israilov says: (permalink)
    June 28th, 2007 at 10:10 am

    Rahmet (which means “thank you” in kazakh) Mark for Ubuntu!!! It is great!!!

  17. Dominique says: (permalink)
    June 28th, 2007 at 7:38 pm

    Lol* I have my own wacked out views on OS and how it is currently approached… I’m all for it, but…
    (comment added to RomaCogitans’ and my own blog.)

  18. evanc says: (permalink)
    June 30th, 2007 at 3:09 am

    Hi Mark,

    Are you coming to NY at all during your US trip? Will be nice to see you. (this does not need to be posted)

  19. evanc says: (permalink)
    June 30th, 2007 at 3:10 am

    that was quick

  20. links for 2007-07-10 : Bob Plankers, The Lone Sysadmin says: (permalink)
    July 10th, 2007 at 6:17 am

    [...] Mark Shuttleworth » Blog Archive » Merging is the key to software developer collaboration Dead on. [...]

  21. Mark Shuttleworth on renaming and merging — Version Control Blog — branching and merging, Ideas, SCM features and concepts, Bazaar/bzr, says: (permalink)
    October 8th, 2007 at 8:50 pm

    [...] “Merging is the key to software developer collaboration”; [...]

  22. Fnord. » Blog Archive » Time to git ‘er done? says: (permalink)
    October 10th, 2007 at 10:05 pm

    [...] Shuttleworth on ‘merging is key’: http://www.markshuttleworth.com/archives/126 [...]

  23. biopython: looking for a new VCS? | Bioinfo Blog! says: (permalink)
    February 16th, 2009 at 10:05 pm

    [...] Merging is the key to software developer collaboration (Mark Shuttlework) [...]