Check File Copies with Linux Scripts

For several years, I’ve had an original Buffalo LinkStation NAS as my main fileserver. Being on 24×7, it’s gone through several fans and at least one hard disk, but it’s now time to retire it in favour of a LinkStation Duo which will give me more space, RAID capabilities and faster transfer speeds.

Naturally, as my main fileserver, it’s backed up. However when I copy the files to the new Duo, how do I know that they’ve all copied correctly and none have been missed? There are hundreds of thousands of files and checking each one by hand would be pointless.

Linux has lots of tools that tell me how much disk space is used, such as du, df and filelight, but they don’t always report back consistently between filesystems. Mostly for reasons of speed, they report the total size of the file blocks used to store a file and as block sizes can vary between filesystems, the total number of blocks used for a set of files will be different. For example, I have two folder sets that I know are identical and du -s on one reports 210245632 and on the other 209778684.

Fortunately, there’s an extra command line flag that will change the behaviour to take longer and sum the actual bytes. In this case, du -sb will return 214813009920 bytes on both filesystems. On the whole, I can be reasonably confident that if the total number of bytes used is the same between two filesystems, then all the files have copied correctly.

But what if the total number of bytes don’t match? How can I find the missing or truncated file? After thinking and tinkering, what I want is to get a list of files with each filesize from the old and new filesystems and then compare the two. And here’s how you do it (each section here goes on one line).

find /home/old_folder -type f -printf '%s %p\n' |
sed 's/\/home\/old_folder\///' | sort > old.txt

find /home/new_folder -type f -printf '%s %p\n' |
sed 's/\/home\/new_folder\///' | sort > new.txt

diff -wy --suppress-common-lines old.txt new.txt

If you aren’t used to Linux, this can look a bit scary, but it’s not really. The first two lines create the text files with all the files and the third line compares the two. The the first two lines are much the same in that they do the same commands but on different filesystems. There are three sections, find to list all the files, sed to chop the directory path off, and sort to get all the files in some sort of order. Here are some explanations.

find – finds files
/some/folder – where to start finding files
-type f – only interested in files (not directories or links)
-printf ‘%s %p\n’ – only print the filesize (%s) and the full pathname (%p) on each line (\n)

sed – processes text
‘s/x/y/’ – means replace x with y. In our instance, it’s replacing the leading folder path with nothing. It looks worse than it is because the slashes in the path need to be escaped by a preceeding backslash, so you get these \/s everywhere.

sort – sorts text
> file.txt – copy everything into a text file.

One of the clever things about Unix-like operating systems is that you can pass information from one command to another using a pipe. That’s represented by the | symbol, so the find command gets the information on files and files sizes, passes it to sed to tidy up which then in turn passes it on to sort before sending it to a file.

After running this set of commands on the old and new filesystems, all that needs done is to compare the files. Let’s look at the third and final command.

diff – compares files line-by-line
-w – ignore whitespace (spaces and tabs)
-y – compare files side-by-side
–suppress-common-lines – ignore lines that are the same
old.txt new.txt – the two files to compare

So what might the output of the diff be? If all the files copied correctly, you’d get absolutely nothing. Other possibilities are that the file partially copied or didn’t copy at all. Here’s what the output might be like.

598 i386/compdata/epson3.txt | 500 i386/compdata/epson3.txt
598 i386/compdata/onstream.txt <

The numbers at the beginning of the entry are the number of bytes, so the first line shows that the epson3.txt is only 500 bytes long in the new file but 598 in the old. The second line shows that onstream.txt is present in the old file but not in the new, as the arrow points to the old file.

To close the story, did I find that I had lost any files? Yes, I did. I discovered a couple of small files that hadn’t copied at all because of non-standard characters in their filenames. The filenames were acceptable to Windows but not Linux and I’d used my Linux PC to do the copying. Fortunately, the files were saved and all the scripting was worth it.

Read more on Linux at Geek News Central.