Adventures in backups

I’ve been quiet for the last few months, mainly because I’ve been working on a Backup project, with not so much focus on Virtualisation.

Prior to this, I’d mostly left it to the professionals, as it had generally fallen into the remit of the storage teams, but when I finished off my previous projects, and the music stopped, the only chair remaining was on a ‘behind schedule’ backup capacity project.

I’m not going to go into the why’s and wherefore’s of why the project was stuck in limbo, but I decided to share my thoughts on what I’d learned from it.

  • Requirements, Requirements, Requirements
    If a project doesn’t have them, how do you know if you’re successful. This can, and almost certainly will, lead to scope creep in several directions if not nailed down.

    • Capacity
      • How much data are we trying to protect
      • How much replication traffic will there be between sites
      • How many simultaneous streams do we need to support
      • How many servers are we backing up
      • How much will de-duplication save
    • What kind of data are we protecting
      • VMs
      • Filesystems
      • Databases
    • How will we transfer the data
      • Network
      • SAN
    • What are we protecting against
      • Accidental deletion
      • Filesystem corruption
      • Loss of a server
      • Loss of a storage subsystem
      • Loss of a site
      • Rogue agents within the business
    • What’s the minimum RPO and maximum RTO we are aiming for
      • This will be affected by backup size/duration/policy
    • Are there any specific security requirements
      • Encryption – minimum cipher strength, on data and/or control traffic
      • Authorisation – granularity of access control

    I’m sure those who have spent more time dealing with backups than I have, could easily add to this list!

  • Get people to think about what they are requesting backups for, and the impact of taking them, or not taking them.
    If you don’t put some constraints on what should be backed up, you might end up trying to backup the world. Nervous admins will invariably ask for everything to be backed up, “just to be on the safe side”, when the service might be recoverable through an automated build process.

    Some questions to think about are:

    • Why do we need to backup this data (or server)?
    • Why do we need to backup this data now, if it hasn’t been backed up before?
    • Can we not recover the data or server any other way?
    • How would we recover the service, if the data is restored from backup?
    • What would be the impact if the data is lost and we couldn’t recover it?
    • What is the failure scenario we are aiming to recover from?
    • How quickly does the data need to be recovered? (RTO)
    • How recent does the backup need to be, to be worthwhile? (RPO)
    • How long do the backups need to be kept? (Retention)
    • Who owns the data?
    • Is the data subject to any Compliance legislation? (eg PCI DSS)
    • When can the backups be taken?
    • Is there any impact to the service when the backups are running?
    • Is there any impact to the service if the backups are not working?
    • If a backup is missed, do we need to reschedule it?
    • Do the backups need to go off-site, or to a different geographic region?
    • What is the size of the backup?
    • What is the delta change?
    • Will there be a regularly scheduled restore test?
    • Who can request a data restore?
    • Who can request expiry of the stored backups?
    • Who can request removal of the backup policy?

    If you don’t put some constraints on what should be backed up, you might end up trying to backup the world. This set of questions can help you verify the need for a backup, as well as important constraints and factors that will be needed for running them in day-to-day operations.

  • Other thoughts
      • Before you put the new backup service into production usage:
        • Get all your install/config/upgrade automation tested
        • Get everything on matching versions
        • Run vulnerability scans and fix any issues
        • Involve your operational teams
        • Get your processes and procedures agreed
      • Plan out and prioritise your project tasks, make sure you deliver what is required, save anything else for ‘phase 2’
    • And finally, learn to let go! Tie up (or hand over) any loose ends, and let the operational staff run the backups.
      Ok, I’m finding this one difficult, it’s become my baby for the last 4 months, but I’m trying.

      PLEASE TAKE AWAY MY ACCESS SO I CAN’T KEEP CHECKING IT’S ALL STILL WORKING!!!!