Home

Reproducible git bundles

December 2023

As part of a backup script for my entire system I wanted to back up all my git repositories. Some of them don't have a remote anywhere, so I cannot rely on implicit backups in The Cloud. The naive solution of simply backing up the entire file-system tree is clearly not desirable since that would clutter the backup with useless build artifacts. One solution is to create a fresh clone (with --mirror), but that will typically consist of many small files which isn't ideal for backups, either. Conveniently git can create a single-file archive of a repository via git bundle create which appears to be the perfect fit for backups.

After a few test runs of my backup I noticed that a small but fixed subset of repositories are getting backed up despite having no changes made. That is odd because I would think that repeated bundling of the same repository state should create the exact same bundle. However:

$ git bundle create -q /tmp/a --all
$ git bundle create -q /tmp/b --all
$ md5sum /tmp/a /tmp/b
44891a87b08b75c1b518889fcba73204  /tmp/a
a37a81710313cdbb5f065ddb2c797630  /tmp/b

Huh?

Turns out that for some repositories bundling is nondeterministic. After browsing some vaguely-related Stackoverflow answers and several AI hallucinations later I decided to dig into the bundles to see what the differences are. Bundling reorganizes all git objects into a single pack, which is an internal git data structure to aggregate many objects into one file and apply some optimizations like delta coding along the way.

Both bundles contained a single pack in my case but each pack had a fairly different structure, as revealed by git verify-pack -v! At this point, having looked at many outputs from git bundle create, I had a suspicion:

Enumerating objects: 733, done.
Counting objects: 100% (733/733), done.
Delta compression using up to 8 threads
Compressing objects: 100% (598/598), done.
Writing objects: 100% (733/733), 97.15 KiB | 6.48 MiB/s, done.
Total 733 (delta 419), reused 0 (delta 0), pack-reused 0

What caught my eye was Delta compression using up to 8 threads—parallelism is a classic source of inherent (as opposed to accidental) nondeterminism. Testing this hypothesis should be quite simple. Having browsed my fair share of git manpages in the past I knew that many commands have a --threads option, but git bundle does not. Luckily any git subcommand can be run with ad-hoc configuration options, and the relevant option here turns out to be pack.threads.

$ for i in $(seq 1 100); do \
> git -c 'pack.threads=1' bundle create -q /tmp/bundle-$i --all; \
> done
$ md5sum /tmp/bundle-* | cut -f 1 -d ' ' | uniq -c
    100 4898971d4d3b8ddd59022d28c467ffca

Success!

Forcing git to be single threaded makes the output deterministic. In case you are wondering, using just one thread instead of eight does not have any discernible performance impact on my personal repositories, so I'm very happy to trade performance for reproducibility.


Return to top