SVN to Git Migration with history

FancyFennec
8 min readFeb 3, 2023

--

How to migrate from Subversion to Git and keep the history

I recently had to fight through a migration from SVN to Git. Since it was quite painful and I dislike the idea of others having to suffer through the same pain, I thought that it might be a good idea to share what I did.

Here are essentially the basic steps of what I had to do:

  • Export SVN project to Git by using git-svn
  • Continue updating export until day X
  • Clean up the repository and push it to the remote

While none of the steps above seem very complicated, there are some pits you can easily fall into.

Exporting SVN Project to Git

Short version: You just use git-svn. The documentation is here. And here is the command that I ended up using

git svn clone svn://svn.company.com/svnrepos --preserve-empty-dirs --prefix=svn/ --include-paths="^project1/trunk|^project2/trunk|^some_root/trunk/project3|^archive/project2/trunk|^archive/some_root/trunk/project3" --authors-file authors-transform.txt <export_name>

Now I will explain why this is what I ended up with, and I do hope that what you will not have to do the same.
Another thing to mention. I tried to do this on both Linux and Windows and came to the conclusion that it is not feasible to do the migration of a large project on Windows, due to Windows occasionally locking files. The command above took roughly a day to execute, and I had issues on Windows every other hour. If you can, do it on Linux. If you have to use Windows, I wish you the best of luck. Maybe you won’t have the same issues.

Export all authors from the svn project

Before we can use git-svn, we need to extract the authors from SVN. SVN only saves the name of the author who committed a change. In Git, you have to provide the name as well as an email address. Therefore, you somehow need to provide a mapping from username to username/email. In other words, we have to provide git-svn an authors-file that is passed to git-svn through the authors-file argument.

git svn clone svn://svn.company.com/svnrepos --authors-file authors-transform.txt <export_name>

The file that we have to create kind of looks like this:

name1 = name1 <email1@company.com>
name2 = name2 <email2@company.com>
name3 = name3 <email3@company.com>
name4 = name4 <email4@company.com>

To do that, you can first export the authors from your SVN workspace with the command line:

svn log — quiet | grep “^r” | awk ‘{print $3}’ | sort | uniq

This just generates a list of names that we now need to expand.

name1
name2
name3
name4

We somehow need to get the email addresses of all of those people. Depending on the size of your company and how well your company is organised, this will be more or less painful.
In my case, the company was not well organised and I ended up gathering the data from different sources.
Anyway, here is what I think will be the most likely thing that you will have to do:
Export the users from Outlook (or whatever your company uses) and write a script that generates the authors-transform file. I ended up writing a python script.

Deal with empty directories

SVN keeps track of empty directories. Git does not. You can at least preserve the empty directories by adding preserve-empty-dirs to git-svn. This will at least create an empty placeholder (the default is a .gitignore). You will have to figure out how to deal with the empty directories. But the fact that your project requires empty directories is probably an indication that you have a messy setup and that you should fix it.

git svn clone svn://svn.company.com/svnrepos --preserve-empty-dirs --authors-file authors-transform.txt <export_name>

Btw, you can then identify empty directories in your SVN workspace (assuming that the workspace is clean) with

find . –type d -empty

Losing SVN history on exporting

I sadly identified an issue where git-svn did not preserve the history during the export. This occurs when certain folders in svn have been deleted. From my understanding git-svn tries to recreate paths, and in case that it is not able to it just fails.

Disclaimer: I am not sure whether what I am going to present here is the best way to deal with git-svn not preserving the history.

The Problem seems to have something to do with deleted directories and to some degree I was able to reproduce it. Consider the following SVN project structure:

├── project1
│ └── trunk
│ └── content1
├── project2
│ └── trunk
│ └── content2
├── some_root
│ └── trunk
│ └── project3
│ ├── content3
│ └── content4
└── archive

We wanted to have all content under project1, and also, we were only interested in moving its trunk to git. Furthermore, we wanted to archive the old structure. This is how our SVN looked like after the restructuring.

├── project1
│ └── trunk
│ ├── content1
│ ├── content2
│ ├── content3
│ └── content4
└── archive
├── project2
│ └── trunk
│ └── content2
└── some_root
└── trunk
└── project3
│── content3
└── content3

My first attempt was to simpy export project1’s trunk

git svn clone svn://svn.company.com/svnrepos/project1/trunk --authors-file authors-transform.txt <export_name>

This export runs through just fine, but when we looked at the history we noticed that all commits before the restructuring are missing.
It seems as if the problem lies in git-svn not being able to reconstruct paths outside of project1, meaning that everything outside project1 it will just get lost.
So, my first idea was, to simply add the archived projects to the export. You can do so with the include-paths parameter.
This is how my next failed attempt looked like:

git svn clone svn://svn.company.com/svnrepos --preserve-empty-dirs --prefix=svn/ --include-paths="^project1/trunk|^archive/project2/trunk|^archive/some_root/trunk/project3" --authors-file authors-transform.txt <export_name>

Sadly, this didn’t work at all!
The history still got lost, and now my project had a messed up structure.

Here is how I finally fixed it.
I just added empty directories where the Projects originally were…

├── project1
│ └── trunk
│ ├── content1
│ ├── content2
│ ├── content3
│ └── content4
├── project2
│ └── trunk
├── some_root
│ └── trunk
│ └── project3
└── archive
├── project2
│ └── trunk
│ └── content2
└── some_root
└── trunk
└── project3
│── content3
└── content3

Then I also include those empty paths, and somehow it is able to reconstruct the history.

git svn clone svn://svn.company.com/svnrepos --preserve-empty-dirs --prefix=svn/ --include-paths="^project1/trunk|^project2/trunk|^some_root/trunk/project3|^archive/project2/trunk|^archive/some_root/trunk/project3" --authors-file authors-transform.txt <export_name>

And somehow the entire history is back!
It looks like adding empty directories allows git-svn to reconstruct the paths and ultimately fixes the problem.
Remark: I would actually be happy about a better solution. I just threw stuff against the wall and hoped that something would stick.

Remark: We checked whether we found other places with similar gaps in the history. Sadly this really seems to occur, whenever folders got moved in a similar fashion. SVN can reconstruct the history, but it gets lost when you do the export.

Exporting ignored files

You probably want a .gitignore. If you just want all the files that are ignored by SVN, run:

git svn show-ignore > .gitignore

Nice and easy.
But maybe you want to look at it. Imo, the generated .gitignore is quite ugly and requires some cleanup.

Continue updating the export until day X

The repository that we created is still connected to SVN. That also means that you can continue to update it, while people still use SVN.
To update the repository, run:

git svn fetch
git svn rebase

Repository Cleanup

There are two main tools that I identified that are useful for cleaning up the resulting repository.

Why do we even want to clean up the repository. In our case, the main issue was repository size. The entire SVN repository was around 25GB.
Since we only exported trunk, it ended up being only 8GB. This is still unrealistic to work with. I tried it… A simple git log command took ~30mins to finish.

Git LFS

Git large file storage is an open source extension for Git. It allows us to store files in an external storage and refer to them in our repository through pointers. This makes the repository way faster to work with. To enable Git LFS we have to first install the extension.

git lfs install

Then, for example if we want LFS to track PSDs we can run

git lfs track "*.psd"

If we want to track multiple files, we can do that in a single command. This will run a bit faster than adding them one at a time

git lfs track "*.exe" "*.jar" "*.zip" "*.pdf" "*.png" "*.jpg"

This will create a .gitatributes file that we then need to commit

git add .gitattributes && \
git commit -am "added tracked files to git lfs"

Sadly, this will not remove files that are already checked in, it only removes them from HEAD. But we can now remove them from the history with BFG repo cleaner.

BFG repo cleaner

After many years of development, there is a high chance that things got checked into the version control that are not supposed to be there. And one advantage that SVN has over Git, is that it is less painful to work with large binaries in the repository. This also causes dumb stuff to be checked in. If you want to get a good overview over all the large files that are checked in, you can run the following command (copy pasted like a true hero from stack overflow):

git rev-list --objects --all |
git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
sed -n 's/^blob //p' |
sort --numeric-sort --key=2 |
cut -c 1-12,41- |
$(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

You will get a nice human readable list of all the sins ever committed to you repository.

In our case the funniest ones were the following:

  • Entire nodejs versions
  • Matlab
  • Hundrets of large log files
  • Entire databases
  • Thousands of jars and executables

Imo, none of these belong in a version control, and need to be removed. But sadly, some of these files were needed for the project to run. So we moved the required binaries to LFS and cleaned up the history.

To clean up the repository I used BFG repo cleaner. I only used two BFG flags, delete-files and delete-folders. There are many others, but in our case it was quite obvious which files need to be deleted.
These are the two commands that I ran:

java -jar bfg.jar --delete-files {"*.exe,*.war,*.jar,*.zip,*.7z,*.dll,*.pdf,*.log,,*.xlsm,*.xls,*.png,*.jpg"} <export_name>
java -jar bfg.jar --delete-folders{"matlab,node_modules"} <export_name>

The actual size of the repository doesn’t get reduced much only by BFG. To actually get the size down we have to run git garbage collection:

git reflog expire --expire=now --all && git gc --prune=now --aggressive

After BFG and git gc, the size of the repository came down to ~800MB and half of that in LFS. Not perfect, but you can actually work with a repository of this size.

Preparation for Day X

Essentially, I just wrote a script that automates the migration. I did not do a full SVN export over and over. I just continued updating the repository, copied the export, and run the final migration steps on the copy. The rough outline of the script is the following:

  • Update SVN export
  • Copy SVN export to new location
  • Remove redundant folders
  • Track files with Git LFS
  • Clean up repository with BFG
  • Finishing touches
  • Push clean repository

The script ran every night (or whenever we wanted) and when we identified issues, we fixed them, pushed them to SVN.

Good Luck

I hope that this will help others with migrating from SVN to Git.

--

--

FancyFennec

I am a Software Developer by day and Game Developer by night.