Backup procedures based on the tar program have many advantages: they are very simple to implement on every operating system and the single files they produce can be moved from one computer to another in lots of different ways, including IRC and email. Tar archives are perfect when you don’t care about which directories some files were in, or how much time passed between one revision and the other.
Another tool called rsync, creates backups of a different nature. Instead of single files, with rsync you get complete, 100% faithful copies of one or more directory trees. Live snapshots of this kind are much better than tar archives when what you want is to be always able to quickly recover from some serious disaster (be it a disk crash or files deleted by mistake), or need to track with the smallest possible effort the evolution over time of all your files. With rsync you get the equivalent of a time machine able to bring you back just to that moment before you erased that file by mistake, or when all the scattered notes of your thesis where in the state that made you draw some conclusion.
There are plenty of rsync scripts available online (see below). What you’ll find here is why and when to use rsync instead of another backup method, plus some explanations to quickly understand what those scripts do and adapt them to your needs.
Main features of rsync
If all rsync were able to do was to copy files or whole folders from one place to another, on the same or on a remote computer, there would be no real reason for its existence. This software, however, has several unique features that make it worth a try. The first one is what you may call diff-only copying: only actual changed pieces of files are transferred, rather than the whole file. This and other tricks like on the fly compression make incremental backups or mirroring much faster, while usage of ssh for encryption keeps your files private. The best part is that (just like with tar) you get all this by always using one simple command inside automatic scripts which:
work the same no matter which Linux distribution or desktop you prefer to use.
are very efficient efficient for incremental backups on remote computers.
but backup in the same way both on external disk just plugged into your home computer or on some remote server
Basic usage of rsync
Rsync is really easy to use. No matter how much an rsync command looks complicated, it always has the same basic structure: "rsync OPTIONS SOURCE DESTINATION"
. That’s it, really. All the rsync scripts you’ll find online are just applications of some variation of this one command.
We’ll deal with OPTIONS in a moment. SOURCE and DESTINATION can be either two local folders on the local computer or a local one and a remote one. A remote SOURCE or DESTINATION, instead, has the format user@computer:/path
. You can specify which procotol to use for the file transfers by using either one (SSH) or two (rsync) colons after the remote host name or IP address. The SOURCE can also have one of these two forms, if you only want to transfer some files or directories:
"/home/marco/pictures /home/marco/documents"
find /home -name "*.t2t"
You can either list specific file or folders (first example) or give a command to run on the computer where the SOURCE is, in order to generate the list of files to be transferred before the actual transmission starts.
Most important rsync options
-a: stands for "archive mode". This option makes rsync work recursively, preserving all
metadata (group, permissions, modification date...).
-e: run the transfer over an encrypted ssh connection
-n: try a dry run on the backup server, just to show what WOULD be done
-v: show which files are transferred or updated
-x: tells rsync to not cross partition boundaries.
-z: compress the data to use less bandwidth over the network.
--progress: display a progress meter to show how much time the backup will take.
--delete: delete any file in DESTINATION which are not in SOURCE anymore, to keep the
two directories perfectly equal
--delete-after: removes any files that have been deleted on the live server
--exclude-from="list_of_files_to_exclude": exclude from the backup all the files in
that list (you can even specify patterns like "*~" to exclude all files ending with a ~ character)
What are inodes and hard links?
Computer files are blobs of bytes that, in general, have no internally written name and are attached to so-called inodes of the file system. At least on Unix, file names are something external, to the actual files. They are just pointers,called “hard links”, to the inode that hosts that particular blob of bytes. What is interesting here is that there is no limit to the number of hard links, that is names, that the same file can have. All the hard links to the same file are functionally equivalent, regardless of what is their position in the filesystem tree. The contents of a file are only stored once, so you don’t use twice the space. If that file has two hard links corresponding to the filenames A and B, if you change the content, owner or access permissions of A, you’re changing B in the same way and vice-versa.
How rsync uses hard links
Imagine having an rsync script that automatically makes a complete backup of all your hard disk on another drive four times a day. Such scripts are perfect for disaster recovery: if you cancel a file or a whole folder by mistake, or if your motherboard fries while you’re finishing an important report (it happened to me…) you have only lost the last three or four hours of work, in the worst case. The same scripts are also perfect when you need or want to know, for whatever reason, what all your files looked like last friday at 9am.
The obvious problem with backups of this kind is that they can consume huge amounts of space and time: you’d make a new copy of every file every few hours, even with files that never changed in months.
Rsync, however, has an option called --link-dest
created just to solve this problem. When you use it, rsync never makes extra copies of the same file. If rsync has backed up a file once, every time it finds it unchanged and the --link-dest
option is active, it will only create in the DESTINATION folder a new hard link to the first and only integral copy of that file. Therefore, you’ll get a bunch of folders called something like Friday_6am, Friday12pm, Friday6pm… that all look like if they contain a copy of each file, but in practice consume very little disk space more than one full backup!
To know more…
As I said, the main purpose of this page is to give enough information to help you quickly understand the way rsync works and when it can be useful for you. Now you should be able to adapt to your needs any of the fine scripts contained in these pages: