Incremental backups of Gmail takeouts

December 2025

In an earlier writeup I discussed how to create reproducible bundles of git repositories such that a file-based backup strategy can operate incrementally. My next target in this vein is Gmail Takeout: your Google account can be locked for arbitrary reasons, legitimate or otherwise, so it's imperative to have a regular backup of your mail. Google's Takeout service is a straightforward way to achieve this. In my case I have about 20 years of mail history in the account, going back all the way to the invite-only beta. Surprisingly that amounts to 5.7GiB only, with attachments being the driving factor, of course, all delivered in a single text-based mbox file.

This is completely fine for a one-time snapshot, but if you want to back this file up regularly with something like restic, then you will quickly end up in a world of pain: since new mails are not even appended to the end of the file, each cycle of takeout-then-backup essentially produces a new giant file. It would be nice if incremental backups only added the delta of mails that are actually new since the last backup.

I considered (and actually implemented) multiple different solutions to this problem. In one approach, I parsed the entire file and stripped out the attachments in order to store them as separate files, only leaving a link in the corresponding mail. This works reasonably well because the attachments account for the overwhelming majority of data. However, I was not particularly happy with this solution because parsing the file correctly is not trivial and resulted in a lot of complex code. The mail format is very forgiving so you end up with many special cases around peculiar behavior of mail clients. To give you an idea, consider that there is no length encoding; it's all multipart boundaries, and they can be nested, too. Actual attachment data also comes in a variety of different encodings, even file-name encoding is done in several micro formats. In the end I got the correct number of mails with my parser compared to the Gmail interface (accounting for threaded view), but the complexity of that code felt wrong.

What I eventually settled on instead is a simple chunking heuristic based on the From ... line in front of every mail. The catch is that this line can also appear in the body of a mail. This results in slight oversplitting: every mail boundary is also a chunk boundary, but not every chunk boundary is a mail boundary. In other words, one mail may be partitioned into multiple chunks. Each chunk is then saved as a file, content addressed by its MD5 sum. Content addressing makes the approach resistant against mail reordering in the mbox file. We could have used the Gmail mail id for this purpose, but the uniform distribution of the content hash enables easy creation of well-distributed subdirectories such that no single directory contains too many files. To ensure recovery of the original mbox file we finally record the sequence of chunks as encountered. With this we satisfy the requirement that new mails only add new chunks plus a new sequence of chunks, the latter being fairly negligible in size.

With my low-traffic Gmail account, I end up with about 99.8K chunks ≈ mails from 50.6K threads. This is tolerable enough for me, but I can see bigger accounts having 10× or 100× mails, at which point the number of chunks may become a concern from a file-system perspective. One mitigation would be reducing the chunking frequency by introducing arbitrary additional conditions, e.g., only split when hash('From' line) is even.

You can find the app implementation on Github.

Return to top