AlwaysOn Availability Groups and SQL Server Jobs, Part 25: Final Remarks on Backups

In the last few posts (posts 19 – 24) I’ve covered some high-level concerns and reviewed some details about how to implement and manage database backups against AlwaysOn Availability Group databases. Most of the information I’ve provided was pretty basic or elementary – along with some additional guidelines for things to watch out for when managing backups. Which is to say that backups are never something you can simply take for granted and they’re never something you just take lightly. As such, I’ve got a few parting thoughts about backups before we move on to the next part of my series of posts on AlwaysOn Availability Groups.

Best Practices for AlwaysOn Availability Group Backups

As you might suspect, best practices for managing backups when AGs are involved isn’t much different than best practices for ‘normal’ databases – except there are a few wrinkles or potential pitfalls thrown into the mix to keep you on your toes.

Things you’ll want to do or ensure when managing backups for AlwaysOn Availability Groups:

DO make sure to regularly test your backups (once a week or once a month might be fine in some environments, and once a day might not even be enough in other environments – only you will have a solid idea of just how critical your data is in terms of needed protections; and just remember that Highly-Available data is NOT the same as data that’s protected from disasters like certain forms of corruption, software glitches, ‘stupid user’ errors, and a host of other potential disaster types that might render your redundant data invalid in a jiffy).
DO make sure to document all processes, procedures, and details needed to restore your data in an emergency. Such documentation should be written by your organization’s senior-most engineer (who can account for the greatest number of contingencies – and document ways to deal with or account for them) but should be written to the level of one of your junior-most techs – as that’s likely who will be ‘on call’ or in the hot-seat when your databases hit the fan.
DO make sure you’ve got SLAs – or RPOs and RTOs in place to help define and guide your expectations around uptime and continuity. Without these details clearly defined and in place, there is simply no way you will ever be able to be successful. (You might recover from the most spectacular of disasters, but no one in Management will ever have had a clue that SQL Server can and will go down for periods of time and no one will expect any ‘real’ down-time because, in their minds, you’ve paid a TON of $$ for SQL Server hardware and licenses to make sure that your databases are “always on”).
DO make sure that your documentation provides information on how to troubleshoot the widest variety of possible scenarios imaginable (cluster failures, database corruption, corrupt/busted master databases, suspect databases, and the likes).
DO make sure that your disaster recovery docs provide information on who to escalate problems to (including up to date contact information) should things go ‘outside of plan’ (which… they likely will – because you ARE dealing with a disaster).
DO make sure to document the exact specifics of where backup files should be and how you’ve architected your AlwaysOn Availability Group backups (i.e., how the preferred replica was defined, where the files go, and so on).
DO make sure to keep this documentation up to date.

In addition to those basic best practices for managing backups, there are a couple of things to watch out for (or gotchas) along the way that you’ll want to watch out for too:

If you do a good job of testing how Availability Groups work BEFORE you deploy them into production, you’ll quickly note that the need to RESTORE from backups is quite seriously painful – because you can’t restore over the top of a database that’s part of an Availability Group (meaning that you’ll have to pull it out of the AG or tear down the AG entirely before you can restore – and, of course, once you’ve restored, you’ll have to re-initialize synchronization of your database again by moving it back into an Availability Group).
Because of the hassles (and major potential for down-time) associated with the above, you REALLY need to have a 3^rd Party Log Reader on hand to address situations where data becomes ‘logically corrupt’ or where someone or something manages to ‘screw up’ the integrity of your database – because a 3^rd party log reader will let you recover from such a disaster (if you’ve got solid backups) without requiring you to ‘start over from scratch’ and without forcing you to take systems off-line OR without requiring you to recover EVERYHING back to a point-in-time before the data was hosed, and so on.
You’ll also need to watch out for backup ‘fragmentation’ (as outlined in post #20) and make sure you’ve correctly ‘synchronized’ your backup jobs and logic across all of the hosts for your replicas (as outlined in posts 21 – 24).

What’s Next?

With backups now addressed, it’s time to turn to part IV of this series – where I’ll provide an overview into some more advanced considerations (like using Master Servers, multi-subnet AGs, etc.) and where I’ll also cover some alternative options to addressing the problems of job management and synchronization – and then conclude with some more concrete examples of specific ways to tackle some of the details outlined in previous posts.

Part 1: Introduction
Part 2: Putting AlwaysOn into Context
Part 3: Defining Batch Jobs
Part 4: Synchronizing Server-Level Details
Part 5: Setting Up Failover Alerts
Part 6: High-Level Options and Rationale for Handling Batch Jobs
Part 7: Detecting Primary Replica Ownership
Part 8: Dynamic Detection Gotcha #1
Part 9: Dynamic Detection Gotcha #2
Part 10: Dynamic Detection Gotcha #3
Part 11: A Fork in the Road
Part 12: Issues with Enabling and Disabling Batch Jobs
Part 13: Creating and Populating a Batch Job State Table
Part 14: Establishing Synchronization Checks
Part 15: Setting Up Linked Servers
Part 16: Job Synchronization Checkup Logic
Part 17: Creating Jobs to Check on Synchronization
Part 18: Health Checks for Availability Groups
Part 19: Availability Group Database Backups
Part 20: Avoiding Backup Fragmentation
Part 21: Assigning Backup Preferences
Part 22: Executing Backups
Part 23: Maintenance Plan Backups and SSIS Packages
Part 24: Synchronizing SSIS Packages

Comments

Plain text