My adventures with git-filter-branch

composition.al 2017-01-11

Not too long ago, I found myself in a tricky situation with git. I had an enormous git repo containing a few things I wanted to keep and many, many things I didn’t, including some large files that probably never should have been checked in. I wanted to create a fresh copy of the repo that included only a small subset of the files from the the original repo’s master branch (and only the history pertaining to those files).

git-subtree wasn’t the right tool for the job, because I wanted to include some but not all of the files in certain directories. I would have liked to use the BFG Repo-Cleaner, but it wasn’t the right tool either, because it has a “your current files are sacred” assumption, and I didn’t want to keep everything in HEAD. In fact, in the original repo, HEAD had about 5000 files (with many more files that weren’t in HEAD but had existed in history at various points), and there were only about 100 files that I wanted to keep. So it was time to take the ol’ git-filter-branch chainsaw out for a spin.

The plan

Since I only wanted to keep about 100 files, I decided to manually create a whitelist of the files I wanted to keep. Then, I figured, I could use the whitelist to generate a (long!) blacklist of every other file in the repo. Finally, I’d create a new branch and use git filter-branch to remove every file on the blacklist from the history of the new branch, and I could push that branch to a newly created GitHub repo.

The trickiest part was coming up with the git filter-branch command, but after staring at the docs for a while and looking at a few Stack Overflow answers, I came up with what seemed to be an appropriate incantation:

1
git filter-branch --index-filter "cat /tmp/files-to-remove.txt | xargs --delimiter='\n' git rm -r --ignore-unmatch" --prune-empty

I used the --index-filter option, which takes as its argument the command that you want to use to, uh, filter the index. In my case, that command was a git rm command. Since there were so many files to remove, I used xargs to feed the files from the blacklist to git rm. The --ignore-unmatch option to git rm allowed it to succeed when attempting to remove a file that doesn’t exist. The --prune-empty option to git filter-branch filtered out any commits that were left empty after all the unwanted files in were removed from history. Finally, the --delimiter='\n' option to xargs ensured that file names containing spaces wouldn’t mess things up.

Why did I need the -r option to git rm? All I know is that, without it, I seemed to get an “index filter failed” error from git filter-branch after a few dozen rewrites. The fact that I needed -r will, alas, become important later.

Attempt one

The below was more or less my first attempt at a Python script to implement the above plan using the git-filter-branch incantation I came up with. See if you can spot all the bugs!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/usr/bin/python

import os
import subprocess

WORKING_DIR = subprocess.Popen(["pwd"], stdout=subprocess.PIPE).communicate()[0].rstrip()
REPO_ROOT = subprocess.Popen(["git", "rev-parse", "--show-toplevel"], stdout=subprocess.PIPE).communicate()[0].rstrip()

# list all files in repo
os.chdir(REPO_ROOT)
os.system("find . > /tmp/all-files.txt")

all_files_fh = open("/tmp/all-files.txt", 'r')
all_files = [f.strip() for f in all_files_fh.readlines()]

# get a whitelist of files we want to keep
files_to_keep_fh = open("files-to-keep.txt", 'r')
files_to_keep = [f.strip() for f in files_to_keep_fh.readlines()]

# create a blacklist of files to remove
files_to_remove = filter(lambda(f): f not in files_to_keep, all_files)

print "Total files:     ", len(all_files)
print "Files to remove: ", len(files_to_remove)
print "Files to keep:   ", len(files_to_keep)

assert(len(files_to_remove) + len(files_to_keep) == len(all_files))

files_to_remove_fh = open("/tmp/files-to-remove.txt", 'w')
for f in files_to_remove:
    files_to_remove_fh.write("%s\n" % f)

# remove blacklisted files with git-filter-branch
os.system("git filter-branch --index-filter \"cat /tmp/files-to-remove.txt | xargs --delimiter='\n' git rm -r --ignore-unmatch\" --prune-empty")

# go back to where we started
os.chdir(WORKING_DIR)

When I ran this script, it seemed to be doing the right thing. However, when it finished running (after about fifteen minutes or so — it takes a while to rewrite 1100-some commits), it appeared to have filtered out everything! I was flummoxed by this behavior and asked some friends for help.

It was David Turner who finally figured out what was going on. The file /tmp/files-to-remove.txt ended in a newline character. When xargs encountered the newline character, it thought another argument was coming up, and then it hit EOF. So, xargs passed the zero-length argument to git rm -r — and, helpfully, git rm -r '' has the same behavior as git rm -r ., which removes everything.

I pointed out to David that git rm’s behavior in this case was annoyingly different from that of rm, and he agreed. In fact, he said, he would go so far as to describe it as a bug in git. “So,” he said, “you hit an edge case in xargs plus a bug in git. Brutal.”1

As it turns out, Emily Xie fixed the bug, as her first contribution to git, while on sabbatical at the Recurse Center in spring 2016. The patch is now in git v2.11.0, which came out about a month ago. (Having been an RC resident myself, I’m delighted that almost everyone involved with this patch has some sort of RC affiliation!) But all that happened too recently for it to have been any help to me when I was trying to fix this problem, so I got rid of the ending newline in the blacklist file by changing

1
2
for f in files_to_remove:
    files_to_remove_fh.write("%s\n" % f)

to

1
files_to_remove_fh.write("\n".join(files_to_remove))

which resolved the issue of everything getting filtered out.

Attempt two

After that, when I ran the script, the filtered branch looked like it only had the hundred-some files in it that I wanted, or at least that’s what I saw when I looked at HEAD. But when I tried pushing that filtered branch to GitHub, I got the dreaded “this exceeds GitHub’s file size limit of 100.00 MB” message. (The original, enormous repo had never been hosted on GitHub.) I knew that none of the files I was keeping were anywhere near 100 MB, and so apparently there were large files somewhere back in history that weren’t getting filtered out.

Happily, this bug was more straightforward than the first one. The problem was that I was using find . to list all of the files in the repo — but, of course, this only listed files that were present in HEAD, and not files that weren’t in HEAD but had existed at some other point in history. At that point,I learned that there’s an easy way to list all the files that have ever existed in a git repo, and so I replaced

1
os.system("find . > /tmp/all-files.txt")

with

1
os.system("git log --pretty=format: --name-only --diff-filter=A | sort -u > /tmp/all-files.txt")

which resolved the problem of mysterious large files remaining in history.

Third time’s the charm?

At this point, I was pretty sure my script was correct. My whittled-down branch had turned out to have 350-ish commits, as opposed to the 1100-some commits on the original master branch. These 350-ish commits were the ones that pertained to the 100-ish files I was keeping. At first, it seemed a little surprising that nearly a third of the commits in the repo would have affected those 100 files I’d kept, since they represented less than 1% of the more than 12,000 files that had ever existed in the repo. But it makes sense, because, after all, those 100 files I wanted to keep were the important ones!

I used the script a few times and pushed the resulting whittled-down branch to GitHub without incident. But then, one day, I tried using the script noticed that there were some files that were mysteriously not being filtered out of HEAD, and I couldn’t for the life of me figure out why. In fact, this bug, although it turned out to be the most embarrassingly obvious of the three, actually took me the longest to figure out.

As a hint, here are the first few lines of output of the script:

1
2
3
4
$ ./filter-script.py
Total files:      12506
Files to remove:  12392
Files to keep:    114

Compare those numbers with this:

1
2
$ wc -l /tmp/files-to-remove.txt
12283 /tmp/files-to-remove.txt

We’re supposed to be removing 12,392 files, so why did the blacklist file only have 12,283 lines? Because not all 12,392 filenames were being flushed to the file! The fix for this bug was closing my open file handle before running git filter-branch.

Here’s the whole script as it now stands:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/usr/bin/python

import os
import subprocess

WORKING_DIR = subprocess.Popen(["pwd"], stdout=subprocess.PIPE).communicate()[0].rstrip()
REPO_ROOT = subprocess.Popen(["git", "rev-parse", "--show-toplevel"], stdout=subprocess.PIPE).communicate()[0].rstrip()

# list all files in repo
os.chdir(REPO_ROOT)
os.system("git log --pretty=format: --name-only --diff-filter=A | sort -u > /tmp/all-files.txt")

all_files_fh = open("/tmp/all-files.txt", 'r')
all_files = [f.strip() for f in all_files_fh.readlines()]

# get a whitelist of files we want to keep
files_to_keep_fh = open("files-to-keep.txt", 'r')
files_to_keep = [f.strip() for f in files_to_keep_fh.readlines()]

# create a blacklist of files to remove
files_to_remove = filter(lambda(f): f not in files_to_keep, all_files)

print "Total files:     ", len(all_files)
print "Files to remove: ", len(files_to_remove)
print "Files to keep:   ", len(files_to_keep)

assert(len(files_to_remove) + len(files_to_keep) == len(all_files))

files_to_remove_fh = open("/tmp/files-to-remove.txt", 'w')
files_to_remove_fh.write("\n".join(files_to_remove))
files_to_remove_fh.close() # flushes, too

# remove blacklisted files with git-filter-branch
os.system("git filter-branch --index-filter \"cat /tmp/files-to-remove.txt | xargs --delimiter='\n' git rm -r --ignore-unmatch\" --prune-empty")

# go back to where we started
os.chdir(WORKING_DIR)

This last issue, although it should have been the most obvious of the three, actually took the longest to debug. I think that the issue here was psychological. Since my code contained a big, hairy git filter-branch incantation, a part of me just assumed that the bug would be somewhere in that line, and would be due to some subtle, arcane git thing. But, no, it was just that I forgot to close the damn file handle.

As a programming languages researcher, I probably ought to be saying something here about how the existence of this bug is a really good argument for languages with linear type systems that would require file handles to be closed exactly once, and how using such a language would rule out this class of bugs. As a human being who writes code, though, let’s be honest; it’s not like I’m going to stop using Python and its ilk for little scripts like this. (It probably would have been a good idea to open the file with a with statement so it would get closed automatically, but with doesn’t feel very natural to me.)

Wouldn’t it be cool if there were some kind of tool with very low overhead from the programmer’s perspective, maybe some kind of wrapper for Python, that would translate the fragment of my code that uses file handles into a linear logic program and then attempt to give it a type, and produce a sensible error (“Perhaps you forgot to close a file handle?”) if it couldn’t? A principled linter, if you will? Maybe someone’s already doing this! If so, I would love to see if their tool would have caught my bug.

  1. Furthermore, as Benjamin Gilbert pointed out, xargs would normally ignore the trailing newline, because it’s typical for text files on Unix to end with one. But I was using the --delimiter='\n' argument to xargs in case there were any filenames in the repo containing spaces, which turns off that behavior!