The Problem
The project I’m working on is very large, and has a repository to match, due in part because during the early life of the project, the database and some large binary files were added and then removed from the repo. Now, at 800MB compressed, the repo is way too large causing problems when attempting to clone in new environments, time consuming for re-indexing in IDE’s and needed to be reduced in size.

We needed to keep much of the repository history intact. The project has had and continues to have several developers working on it, and we decided that exporting the project and creating a new repository with no history was not the best option for us. We therefore embarked on a plan that we thought would help to get the repository down to a manageable size.

The Repo
The bulk of the repo is a Drupal installation with hundreds of custom modules, contrib modules and features. In addition we had some folders, vendor libraries and documentation on the same level as the drupal docroot that needed to be  retained.

|----sites          <-- we needed this primarily
|----robots.txt     <-- we also needed these
|--hooks            <-- we also needed these
|--library          <-- we also needed these
|--vendor           <-- we also needed these
|--utils            <-- we also needed these

We knew that the large file(s) were originally added to the docroot and below. At the beginning of the process, the repository with it's branches, tags and backup reflogs, hooks etc compressed was 803M

The Process

  1. So first and foremost, we clone the repository into a new space on the working computer
    git clone mynewrepo
  2. Traverse into the mynewrepo directory and the commands we are going to use are done from there.As we were working with a drupal repository this would be normally the docroot
  3. We detach the repository from it's remote origin. This was done to prevent any pushes going back to the original remote repository. Doing this means we are only working with the clone of the repo that is on our working machine and any work we do won't affect the original source(s)
    git remote remove origin or git remote rm origin
  4. We didn't want to bother with tags so all the tags were removed.
    git tag -l | xargs git tag -d
  5. Next we use the filter-branch command to get rid of all but the folder we want. IT should be noted that  as we have removed links to the remote origin, and we are only filtering the HEAD branch, no other branches will be left after this command is used.
    git filter-branch --prune-empty --subdirectory-filter docroot/sites HEAD

    If you wished to keep the branchs, use -- --all instead of HEAD

  6. Finally, to reclaim the space, we need to delete the backup reflogs
    git reset --hard
    git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
    git reflog expire --expire=now --all
    git gc --aggressive --prune=now

At the end of this process, the sites folder was whittled down to 113M a vast improvement, but we were not done there.


Morlene Fisher is an experienced Digital Consultant based in London, UK. She assists clients with finding the best digital solutions for their organisations and brings together many expert digital consultants to provide a service you can rely on.