The importance of testing database restore processes

Posted on: 20 November 2009

It's been said before, but good judgement comes with experience, but experience comes from bad judgement

We all know the golden rules on database backups:

  1. do backups
  2. test that you can restore from those backups

But how many of us actually do that second part? I've recently gained some experience (ahem!) on the importance of step 2.

We have a system here which has a moderately large database (about 250GB). Until recently, it was running on a server with 400GB of disk space available to the database server. Unfortunately, what with InnoDB's reluctance to release space which was freed up by purging old data (the purge wasn't a straightforward delete - it was an update of a longtext field to an empty string - we're looking at better ways to do this!), the database files were using around 350GB of disk space, which along with MySQL's binary logs, was getting rather uncomfortable.

With insufficient disk space to optimise the table causing the biggest problem with unreleased empty space, our only option was to rebuild the database by running a mysqldump, dropping the database, and restoring from the dump.

Ideally there would be an additional step in there: test the restore process, right between the "run mysqldump" and "drop database" steps. Unfortunately, as a small startup, we're pretty limited in our available hardware, and had nowhere to test this restore. I did check that the mysqldump finished successfully, reported no errors, created a reasonable sized dump, and that the dump file looked OK by a manual inspection of the first few hundred lines of SQL.

I think we all know where this story is headed.

So, we dumped the database, did the checks we could, dropped the database, and started the restore. 24 hours later, I'm checking the progress, and can see that three quarters of the tables (representing 98% of the data by volume) have restored OK. Unfortunately, we've been at this point for two hours now. A quick call to our hosting provider tells me that the restore process has completed, and reported no errors.

Cue sick feeling in pit of stomach.

About this point, I check the end of the mysqldump file, and find that it mysteriously ends after about three quarters of the tables have been restored. No errors, no warnings, no mid-SQL statement end - the file just ends, apparently tidily.

We now start checking the nightly backups over the past month - even if we can't get the up-to-date restore of those missing tables, a restorefrom the previous night, or even the previous week, would be much better than nothing.

This is where we learn that the nightly dump has finished at that point every night.

Cue very sick feeling in pit of stomach.

There then followed several days of painstaking manual reconstruction of the missing data based on inferences from the data we had restored.

Always, always, always test that a restore process is going to restore properly before it becomes too late.