Categories
Admin Computers

Improved Website Backup

I’d previously written about backing up this site using a couple of bash scripts, ssh and rsync. It’s actually been working just fine. But, being me, I couldn’t leave well enough alone.

You see, I’m looking at these bzipped tar files that are in the 130MB range accumulating in my home directory and I start thinking that it’s an awful lot of space I’m just taking up. Initially, I figure I’ll just prune the directory every week or so because, really, now that I’m satisfied with the look, the site really won’t change much other than additional content. So as long as I’ve got a week’s worth or data, possibly creating a monthly snapshot as well or something, my backup needs should be more or less fulfilled.

That’s when I realize that I really don’t need to backup the site every day per say, just the content. And that’s all in the database dump. So I realize, on a daily basis, all I need is to update the database backup portion because the site portion of the backup is really fairly static. From a content perspective, the main change would be due to uploading pictures for posts.

These are the sorts of things that happen to programmers. We get an itch and we just can’t not scratch it.

The change to the script up on the server is trivial. Rather than tar’ing the database dump file onto the website archive, I just leave them as separate files.

The real action takes place in the script on my home computer. By the time I was done, I’d created a bash function, mucked with sed and was calculating MD5’s of the backup files to determine if anything had changed so as not to waste the time creating redundant backups! My bash scripting foo is now far beyond what I’d ever thought necessary.

Here’s the new script:

#!/bin/bash

function md5_parse()
{
    local __md5_result=$(echo $1 | sed 's:\(^[a-z0-9]*\) .*$:\1:')
    echo $__md5_result
}

HOST=mysite.org
SITE_BU=backup-website.tar.bz2
DB_BU=backup-db_dump.bz2

TIMESTAMP=$(date +%Y%m%d%I%m)

# run the remote script
ssh $HOST './remote-backup-script.sh'
rc=$?
if [ $rc != 0 ]; then
    exit $rc
fi

# make md5 has snapshot of files before rsync transfer
MD5_PRERSYNC_SITE_BU=$(md5_parse $(md5sum ~/path/to/$SITE_BU))
MD5_PRERSYNC_DB_BU=$(md5_parse $(md5sum ~/path/to/$DB_BU))

rsync $HOST:$SITE_BU :$DB_BU ~/sahd/ && \
ssh $HOST rm $SITE_BU $DB_BU && \

if [ "$(md5_parse $(md5sum ~/path/to/$SITE_BU))" != "$MD5_PRERSYNC_SITE_BU" ]; then
    cp ~/path/to/$SITE_BU ~/path/to/$SITE_BU.$TIMESTAMP
fi

if [ "$(md5_parse $(md5sum ~/path/to/$DB_BU))" != "$MD5_PRERSYNC_DB_BU" ]; then
    cp ~/path/to/$DB_BU ~/path/to/$DB_BU.$TIMESTAMP
fi

exit

The basic gist remains the same as previous: keeping working copies of the server files (stored in SITE_BU and DB_BU) locally so as to take advantage of rsync‘s transfer algorithm. The details have changed, starting with there are now 2 bzipped working copies: one for the website files and one for the database dump file.

Going through all of this, the md5_parse function was a result of good programming practice. Without it, I’d have the same sed operation sprinkled throughout the file. Yuk. It’s possible I could have managed achieve the same affect using a different technique, but it wasn’t obvious.

I’d never mucked with sed prior to this, but it turns out to be pretty simple to use. All my sed usage does is pull out the MD5 hash from the output string that md5sum produces. It’s a pretty vanilla substition/ regex usage. The only thing I’ll note here is, for readability purposes, I took advantage of sed‘s delimiter substitution and used the colons instead of the usual forward slashes. Otherwise, the sed usage would look like:

sed 's/\(^[a-z0-9]*\) .*$/\1/'

Really now, which would you rather read?

For determining if the file content had changed, I had to go the MD5 route. Even though rsync won’t modify the file if it hasn’t changed on the server, it still affects the modification timestamp that stat returns; that detail threw stat under the bus. I realize that it’s still not 100%, but it’s close enough for my purposes.

Moving right along, the rsync line now grabs the 2 files from the server and once the transfer is complete they are removed from the server. Then, for each file, the MD5 comparison is performed to check if the timestamped copies should be made. Hmmm, maybe I could make that into a loop…

I expect the database dump file to continue to be updated daily. Partly because of my posting habits, but also because of comment spam. Bastards. The website file backup file, on the other hand, should only change when I’ve added pictures or made other website modifications, like a new page or something. Seeing as the database dump is <1MB, and the website backup is ~128MB, the change should go a long way towards saving diskspace on my machine here at home.

Itch scratched.

For now.

One reply on “Improved Website Backup”

Leave a Reply

Your email address will not be published. Required fields are marked *