The Thing About Git

April 08, 2008

The thing about Git is that it’s oddly liberal with how and when you use it. Version control systems have traditionally required a lot of up-front planning followed by constant interaction to get changes to the right place at the right time and in the right order. And woe unto thee if a rule is broken somewhere along the way, or you change your mind about something, or you just want to fix this one thing real quick before having to commit all the other crap in your working copy.

Git is quite different in this regard. You can work on five separate logical changes in your working copy – without interacting with the VCS at all – and then build up a series of commits in one fell swoop. Or, you can take the opposite extreme and commit really frequently and mindlessly, returning later to rearrange commits, annotate log messages, squash commits together, tease them apart, or rip stuff out completely. It’s up to you, really. Git doesn’t have an opinion on the matter.

Remember a long time ago, at the dinner table, when your kid brother mashed together a bunch of food that really should not have been mashed together – chicken, jello, gravy, condiments, corn, milk, peas, pudding, all that stuff – and proceeded to eat it? And loved it! And then your crazy uncle, having seen the look of disgust on your face, said: “it all goes to the same place!” Remember that? No? Then you were probably the one shoving nasty shit into your face, but the important thing to understand here is that your uncle is crazy. And so is Git.

I’ve personally settled into a development style where coding and interacting with version control are distinctly separate activities. I no longer find myself constantly weaving in and out due to the finicky workflow rules demanded by the VCS. When I’m coding, I’m coding. Period. Version control - out of my head. When I feel the need to organize code into logical pieces and write about it, I switch into version control mode and go at it.

I’m not saying this is the Right Way to use Git: in the end, it all goes to the same place. I’m saying that this is the way I seem naturally inclined to develop software, and Git is the first VCS I’ve used that accommodates the style.

I’d like to run through a short example – on the off chance that my extreme hyperbole has left you unconvinced – that shows how one might first stumble onto some of Git’s more advanced features, and that hopefully also brings to light how easily one could then develop a strong addiction to such features.

The Tangled Working Copy Problem

Suppose that, last night, I start work on some enhancements to the “Leave Comment” forms on this site. I figure this will take all of maybe ten minutes, so I begin pounding away in my working copy. After screwing around for an hour or so, I give up and go to bed, leaving the half-baked changes in my working copy.

The next morning, I coffee up and find my del.icio.us bookmarks not being sucked into the site properly and so I start playing with that mess (this is completely unrelated to what I was doing the night before, mind).

After working out the small problem with sucking in bookmarks, I take a peek at git status to see where my working copy is at:

$ git status
# On branch master
# Changed but not updated:
#
#     modified: models.rb
#     modified: views/entry.haml
#     modified: bin/synchronize-bookmarks
#     modified: js/tomayko.js
#     modified: stylesheets/tomayko.css

I realize, for the first time, that I have two unrelated changes in my working copy:

The experimental comment form tweaks: models.rb, entry.haml, tomayko.js, and tomayko.css. I’m not ready to push this into the live site yet so I don’t want these changes on the master branch.
Bookmark synching fixes: models.rb and synchronize-bookmarks. This needs a commit on master and should be shipped up to the live site, immediately.

The big problem here is models.rb - it’s “tangled” in the sense that it includes modifications from two different logical changes. I need to tease these changes apart into two separate commits, somehow.

This is the type of situation that occurs fairly regularly (to me, at least) and that very few VCS’s are capable of helping out with. We’ll call it, “The Tangled Working Copy Problem.”

Git means never having to say, “you should have”

If you took The Tangled Working Copy Problem to the mailing lists of each of the VCS’s and solicited proposals for how best to untangle it, I think it’s safe to say that most of the solutions would be of the form: “You should have XXX before YYY.”

Subversion: You should have committed the experimental changes to a separate branch before working on the bookmark stuff.
Bazaar: You should have shelved your experimental changes before working on the bookmark stuff.
CVS: You should have RTFM before wasting everyone’s time with such a lame question.

Here’s a general principle I would like my VCS to acknowledge: moving from the present point B to some desired point C should not require a change in behavior at point A in the past. More simply, the phrase: “you should have,” ought to set off alarm bells. These are precisely the types of problems I want my VCS to solve, not throw back in my face with rules for how to structure workflow the next time.

(To be fair, Mercurial handles The Tangled Working Copy Problem without breaking a sweat and others do as well.)

Solving The Tangled Working Copy Problem When Your VCS Won’t

I run into The Tangled Working Copy Problem so often that I’ve devised a manual process for dealing with it under VCS’s that punt on the problem. For instance, if I were using Subversion, I might go at it like this:

Run svn diff over the files with changes I don’t want to commit (the comment related stuff), piping the output into vim.
Remove hunks from the diff corresponding to those changes I want to commit (the bookmark related hunks) and write the diff out to comment-stuff.diff.
Run patch -p0 -R < comment-stuff.diff. This removes the comment related changes from my working copy (-R = “apply diff in reverse”).
Commit the bookmark related fixes sitting in my working copy to the repository.
Run patch -p0 < comment-stuff.diff to reapply the comment related changes to my working copy.
Forget to create branch for comment stuff, again.
Hack on comment stuff for a while.
Find more unrelated brokeness and fix it.
Oops! GOTO 1.

This works well enough when there are no changes to binary files, and the diff doesn’t mind being teased apart, and when there’s only two or three changes tangled up, but it raises the question: what am I paying my VCS for?

The idea of manually managing sets of patches to coerce my patch management program into managing patches is literally absurd.

Viva La Index

Git has this alien thing between the working copy and the repository called The Index. I was entirely annoyed by the concept when starting out - you have no idea why you’re forced to deal with it and you’re always dealing with it. Even after reading multiple accounts of what The Index supposedly was, I continued to be baffled by it, wondering how it could possibly serve any useful purpose at all. That is, until the first time I ran into The Tangled Working Copy Problem.

The Index is also sometimes referred to as The Staging Area, which makes for a much better conceptual label in this case. I tend to think of it as the next patch: you build it up interactively with changes from your working copy and can later review and revise it. When you’re happy with what you have lined up in the staging area, which basically amounts to a diff, you commit it. And because your commits are no longer bound directly to what’s in your working copy, you’re free to stage individual pieces on a file-by-file, hunk-by-hunk basis.

Once you’ve wrapped your head around it, this seemingly simple and poorly named layer of goo between your working copy and the next commit can have some really magnificent implications on the way you develop software.

Solving The Tangled Working Copy Problem With Git’s Index

Let’s review the status of our working copy:

$ git status
# On branch master
# Changed but not updated:
#
#     modified: models.rb
#     modified: views/entry.haml
#     modified: bin/synchronize-bookmarks
#     modified: js/tomayko.js
#     modified: stylesheets/tomayko.css

We want to commit all of the changes to synchronize-bookmarks and some of the changes to models.rb, so let’s add them to the staging area:

$ git add bin/synchronize-bookmarks
$ git add --patch models.rb
diff --git a/models.rb b/models.rb
index be4159d..3efd4ce 100644
--- a/models.rb
+++ b/models.rb
@@ -256,7 +256,7 @@
     class Bookmark < Entry
       next unless source[:shared]
       bookmark = find_or_create(:slug => source[:hash])
-      bookmark.update_attributes(
+      bookmark.attributes = {
         :url        => source[:href],
         :title      => source[:description],
         :summary    => source[:extended],
Stage this hunk [y/n/a/d/j/J/?]?

The magic is in the --patch argument to git-add(1). This instructs Git to display all changes to the files specified on a hunk-by-hunk basis and lets you choose one of the following options for each hunk:

y - stage this hunk
n - do not stage this hunk
a - stage this and all the remaining hunks in the file
d - do not stage this hunk nor any of the remaining hunks in the file
j - leave this hunk undecided, see next undecided hunk
J - leave this hunk undecided, see next hunk
k - leave this hunk undecided, see previous undecided hunk
K - leave this hunk undecided, see previous hunk
s - split the current hunk into smaller hunks

In this case, I staged (y) about half of the hunks (the ones that were bookmark related) and left the other hunks unstaged (n). Now my index has all of the changes to synchronize-bookmarks plus half of the changes made to models.rb.

I like to review that the changes in the staging area match my expectations before committing:

$ git diff --cached
[diff of changes in staging area]

I also like to verify that my unstaged / working copy changes are as I expect:

$ git diff
[diff of changes in working copy that are not in the staging area]

Everything looks good, so I commit the staged changes:

$ git commit -m "fix bookmark sucking problems"

I’m left with only the experimental comment enhancements in my working copy and am free to move them onto a topic branch, or maybe I’ll just let them sit in my working copy for a while. Git doesn’t care.

Taking Control of Your Local Workflow

We’ve seen how to use git add --patch to pluck specific changes out of the working copy and stage them for the next commit, a nice feature that elegantly solves a once-tedious problem and that makes possible a previously forbidden style of development. There’s more where that came from, though. Here are some related concepts that you will want to also introduce yourself to:

git add --patch is actually a shortcut to features in git add --interactive, a powerful front-end for managing all aspects of the staging area. The git-add(1) manual page is a treasure trove of worthwhile information that’s often passed over due to the traditional semantics of VCS “add” commands. Remember that git-add(1) does a lot more than just add stuff - it’s your interface for modifying the staging area.
git commit --amend takes the changes staged in the index and squashes them into the previous commit. This lets you fix a problem with the last commit, which is almost always where you see the technique prescribed, but it also opens up the option of a commit-heavy workflow where you continuously revise and annotate whatever it is you’re working on. See the git-commit(1) manual page for more on this.
And then there’s git rebase --interactive, which is a bit like git commit --amend hopped up on acid and holding a chainsaw - completely insane and quite dangerous but capable of exposing entirely new states of mind. Here you can edit, squash, reorder, tease apart, and annotate existing commits in a way that’s easier and more intuitive than it ought to be. The “INTERACTIVE MODE” section of the git-rebase(1) manual page is instructive but Pierre Habouzit’s demonstration is what flipped the light on for me.

That’s really all you need to know above and beyond Git’s fundamentals to start dominating your local workflow. From here, you may want to explore some of the various other concepts and utilities specifically designed to augment your local workflow:

People seem to get a lot of utility out of git-stash(1), which lets you move changes from your working copy into a lightweight holding area to be reintroduced some time later. I personally haven’t used it much in practice, and I used Bazaar’s rough equivalent of git-stash(1) (bzr shelve) frequently. I find that the staging area removes the need for stashing in a bunch of cases and when I really do need to get stuff out of my working copy and somewhere safe, I just create a topic branch.
I haven’t played with it yet but StGIT (“Stacked Git”) looks seriously interesting from the examples in the tutorial. I tend to visualize version control concepts as series of patch operations so I’d probably feel more at home with this style of front-end.
There’s a section of the Git User’s Manual called The Workflow that describes, at a fairly low level, the various interactions between the working copy, the index, and the object database.