Minimal System Backups with rdiff-backup and Yum

I'm falling in love with rdiff-backup. This tool gives you the best of both incremental and mirror backups, uses rsync/rdiff libraries to increment modifications to files, and best of all is written in Python. rdiff-backup has made backups sexy again (again?). I'm just completely all about backups now. Not so much for functional reasons—I've only had to go to the backups once or twice to restore something, and it was a beautiful experience—but because the tool just rocks and makes me want to back stuff up. If Ben Escoto (primary author, rdiff-backup) could hack together some tax return software I might not have to deal with an audit this year.

I could wax about rdiff-backup for a couple of pages but to make a long story short, it has inspired me to believe that backups are an old concept that still has lots of room for innovation and can be fun to code for.

I'm already in a serious love affair with Yum and have been trying to contribute to this project in anyway I can. This tool filled a huge gap in the Red Hat offering by providing a customizable package management system similar to the apt-get tool Debian users have been enjoying for years. Yum is extremely easy to use; to the point that you might think that it's light on functionality. No, it's not like that. Yum is the Bruce Lee of system utilities in that it resembles some silly little Chinese guy that you would think would be no match for, say, Kareem Abdul-Jabbar, and then you get the flex, and it pulls a full distro upgrade with a single command (yum upgrade).

I should note here that apt-get can be used on RPM based systems as well but I am apt-get ignorant and will be refering to Yum exclusively for the rest of this entry. It is extremely possible that apt-get or even up2date could provide the the functionality needed to restore a system from RPM.

Handling RPM Managed Files

More pertinent to the topic at hand is the fact that Yum takes care of pulling RPMs from remote repositories, working out dependencies, and performing installations given a list of package names. What this means to the backup artist is that in order to be able to restore a wrecked system to it's previous set of packages you need only backup your yum config file (usually /etc/yum.conf), which contains your repository configuration, along with the names of all packages installed on your system. We can get the list of packages easily enough:

$ rpm -q —all > list-of-packages

Once we have our yum.conf and package-list backed up, a fresh machine with just the minimum requirements to run Yum, should be able restore to it's previous state with something like the following:

$ cat list-of-packages | xargs yum install

You have to swish the concept around in your head a little bit but you can think of the many RPM repositories that are getting thrown up nowadays as shared backups for common stuff. This greatly reduces the amount of files that need to be backed up by each person because they can always be obtained from a publicly available source.

What about Unmanaged Files?

So we can backup every single RPM managed file (that hasn't been modified) with a very small footprint. Now we get into the more tricky part, which is backing up all the stuff RPM either doesn't manage or manages but determines has been modified. We will use rdiff-backup to increment the files for all the benefits stated previously, but there are a few things missing from the current rdiff-backup/yum toolset.

Design Goals

What we need is a library that, given a directory, will tell us what files and sub-directories are unmanaged or managed-but-modified. This tool should be able to utilize some kind of database or cache of file information (possibly slocate's database or rpm's).

Once we have a mechanism for determining what files are unmanaged, a nice framework for defining backup sets should be provided. I'm thinking something along the lines of having an /etc/backup.d directory that would have config files for each backup set. The config files would specify the root directory to backup, where the backups should be stored, and a list of additional exclusions and/or forced inclusions relative to the backup directory.

# sample backup config file
[info]
name=home directories
source=/home
dest=root@backup-host:/backups/$hostname/home
backuptype=rdiff-backup
frequency=1d                # how often should we backup?
retain=5d                   # how long should increments be
                            # kept around?

[files]
- .phoenix/**/Cache         # exclude mozilla firebird cache
- .Xauthority               # we don't need that either..
+ .Xresources               # forcibly include .Xresources

A couple of concepts to point out here. Source and dest are pretty self explainatory. backuptype would allow other backup tools to be used in place of rdiff-backup. Maybe we just want to straight rsync the files, or maybe somebody feels that incremental tarballs are still relevant <g>. Lastly, the files list contains a list of files to include or exclude. I really like rdiff-backups file selection syntax; it is simple and powerful.

There should also be a sane default backup set containing all unmanaged/modified files on the system minus cruft like /var/{run,lock}, cache directories, and anything else that doesn't have backup value. The only aspect that should require configuration is the destination of the backup. Reasonable defaults for all other aspects should be provided.

Summary

A Minimal Backup System would provide a simple-as-in-easy-to-configure yet powerful tool that could act as an almost turnkey backup solution for most small scale GNU/Linux installations that use RPM for package management. A positive side-effect of having RPM aware backups is that it further promotes the use of RPM to package common/unchanging files as the more that is managed by RPM the less space is required for backups.