Cleaning up oversized github repositories for R and beyond

R-bloggers 2014-06-26

(This article was first published on Robin Lovelace - R, and kindly contributed to R-bloggers)

The version control system Gitis an amazing piece of software for tracking every change thatyou make to a project and saving its entire history.It is incredibly useful, for users of R and otherprogramming languages, leading it shoot from 0 market sharein 2005 (when it was first released)to market domination in one short decade.

However, Git can cause confusion. Even (or at times especially)when used in conjunction with a nice graphical user interfacesuch as that provided by GitHub,the main online repository ofGit projects worldwide and home to over10 million projects,Git can cause chaos.Like Linux (the operating system wasincidentally created by the same prolificperson),Gitassumes you know what you’re doing.If you do not,watch out!

Partly knowing what I was doing (but not fully) I set up arepository to host a tutorial onmaking maps in R.I was pretty relaxed about what went in there and soon, therepository grew to an unwieldy 60 Mb in size and over 20 Mbjust to download the automatically createdzip file.(It is now a sprightly2.6 Mb Zipped, wahey!)Needless to say this did not help my aim of makingR accessible to everyone, a tool for empowerment(as thisinspiring article about R for blind people shows it can be).

So I decided to act to clean things up. In the hope it’ll be usefulto others, what follows is a description of the main steps I took tosort things out.

cleaning-in-action

Step 1: delete files in the current project

The first stage was simply to identify and delete excessively sized filesin the current version of the project. For this there is no better programthan Baobab, which shows you wherebloat exists on your system.

That was only part of the problem though: as shown in the image ofdisk usage from Baobab below, most (80%, almost 50 Mb)of the space was taken up by the .Gitfolder itself. This meant files I’d changed in the past were taking upthe most space and. Git is not designed to allow you change the past but to save it…

b4-clean

Step 2: use the BGF

Next up is the BFG ‘repo cleaner’.This is just a small java program that cleans up unwieldy commitsusing a command line interface.

In order for it to work, you need to mirror your repository,using the --mirror flag when you clone. The first step was thus:

    $ git clone --mirror git@github.com:Robinlovelace/Creating-maps-in-R.git

Next, you run this (in a Linux terminal,as illustrated by the $ sign), changing the size depending on what you wantto keep:

   $ java -jar ~/programs/bfg-1.11.7.jar  --strip-blobs-bigger-than 1M  .git

This successful cut the size of the project in half,making it far more accessible, as shown in the figure below.Note, the changesmade by the BFG only translate into disk space savingsafter running the following commands(suggested in the BFG usage section):

    $ cd Creating-maps-in-R.git/    $ git reflog expire --expire=now --all    $ git gc --prune=now --aggressive

after-clean

One issue

The only issue I encountered was this message:

    ! [remote rejected] refs/pull/1/head -> refs/pull/1/head (deny updating a hidden ref)

Although this was repeated several times, it didn’t seem to influence the successof the operation: I’ve halved the size of my GitHub repo and roughly 1/8thed the size of the zip file people needdownload to run the tutorial code. So the issue seems to be a non-issue in the grand scheme of things.

Conclusion

Ideally we’d all be like Linus Torvalds and makeno mistakes.But unfortunately we are human and prone to mistakes, which areactually one of the best ways of learning. Thanks to softwarelike BFG and many helping hands through the open source community,99 times out of 100 these mistakes are no big deal. I hope thispost will help others toshrink unwieldy git repositories anduncrustify their lives.More importantly I hope this leads to better design from the outset:the experience has certainly made me think about project design carefullyincluding saving giant .RData files externally and keeping new objectsin a project to a minimum. According to Joseph Tainter,the marginal costs of added complexity now outweigh the benefits forindustrial civilization. Lets hope R users and otherprogrammers, at the very least, can simplifyour lives sufficiently to avoid collapse. Hopefully then the rest of society willfollow!

To leave a comment for the author, please follow the link and comment on his blog: Robin Lovelace - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...