The Smart DBA's Guide to SQL Server Disaster Recovery, Part 1

Knowing how to restore database files is not disaster recovery. Scheduling regular backups— while a "must do"— is insufficient for proper disaster-recovery planning in SQL Server environments. As a SQL Server administrator, you need to know how to develop disaster-recovery procedures and plans that properly protect a range of SQL Server–powered business processes—databases that range from "never go dark" to "restore when you can." The good news? You don't need to know how to create an enterprise disaster-recovery plan.

Part 2: The Smart DBA's Guide to SQL Server Disaster Reocvery

You need to focus on SQL Server protection and learn how to contribute SQL Server disaster-recovery methods to your company's more comprehensive business-continuity/disaster-recovery plan. Even if your company has no such plan, you want to be able to plan and execute a robust process for protecting SQL Server assets.

Just as performing SQL Server backup and recovery is different from restoring folders and files from a file server, SQL Server disaster recovery has unique aspects that—if not considered in a corporate disaster-recovery plan—can affect recovery and corporate viability. This two-part article touches on topics that SQL Server people should consider when contributing to an enterprise disaster-recovery plan, as well as practices they can adopt in their own SQL Server management techniques.

For our purposes, we’re going to apply our 7D Method to SQL Server disaster-recovery. Originally developed to help SQL Server administrators manage the lifecycle of their databases, the 7D Method is a systematic decide and execute process whose seven stages include Discover, Design, Develop, Deploy, Day-to-Day, Defend, and Decommission. For this exercise, we’ll use only the first five Ds. In this article, we'll cover Discovery and Design, and in Part 2, we'll show you how to Develop, Deploy, and maintain your SQL Server disaster-recovery plan in a Day-to-Day environment.

The End Is Near!

Disaster recovery can be a scary term. It implies horrible things happening: visions of tornadoes carrying your data center to Oz, or your rack of database servers sinking beneath floodwaters streaming down from the Mississippi River. In truth, such catastrophic events rarely occur. It’s much more likely that you’ll be brought down by the simplest of disruptive events—a power outage. Whether you’re facing fire, flood, tornado, blizzard, terrorist attack, pandemic, or a straightforward power outage, your job is to keep the servers operational and connected to the network.

All that being said, two big misconceptions surround disaster-recovery planning: first, that preparing for the worst kind of catastrophe properly prepares you to deal with any lesser disruptions, and second, that large enterprises are fine with the same disaster-recovery plan development that SMBs perform.

Figure 1 shows the levels of disruption and the relative frequency of occurrence—signified by the width of each band—that you might expect from minor, significant, serious, and catastrophic interruptions to your core processes. Notice the frequency of catastrophic interruptions. Catastrophes—interruptions lasting longer than 10 days—have historically made up a very tiny percentage (1 to 2 percent) of disruptions. The biggest sources of disruptions—those that cause up to a full day of downtime—are minor.

Figure 1: Sample interruption categories

When you’re building your plan, focusing on a worst-case scenario doesn't adequately prepare you for more common minor interruptions. A cascade of minor interruptions can be just as devastating as a single major event, including putting the average SMB on life support.

Focusing on SQL Server

If you work for a critical private-sector firm such as an infrastructure service provider (e.g., a power company), your company probably has a disaster-recovery plan in place. Your challenge is to dovetail your SQL Server disaster-recovery plan into that of the enterprise. Critical infrastructure companies are required to have contingency plans. Such is not the case for most small-to-midsized businesses (SMBs); SMBs rarely have responses to severe interruptions prepared in advance, nor do they have business-continuity specialists—or budgets. This article is aimed squarely at SQL Server administrators.

But you’re probably wondering how you can possibly cram yet another project into your already crowded to-do list. Sadly, for some of you, a SQL Server disaster-recovery plan might not be optional. If you’re an infrastructure service provider, financial institution, government entity, or health-care company—or if you provide critical services to these entities—you might have to devise a plan. Your company might be required by law or by contract to have a disaster-recovery plan.

The good news? Most of you are already performing SQL Server disaster-recovery activities with backups, standby and failover servers, log shipping, database mirroring, and security policies. These are tactical SQL Server disaster-recovery activities; now, you need to pull them together into a strategic plan for survival. If your organization already has a plan, you have something that you can integrate into and build on.

SQL Server disaster-recovery planning must be customized for each organization. Think about what contribution you and your team can make to crisis communication and management. What techniques can you develop and share for business-unit and technology-recovery procedures? What can you do to mitigate the impact of supplier and distributor failure? What ideas can you come up with regarding relocating and restoring critical IT infrastructure? What can you add to testing and validating incident recovery?

The First D: Discover

To align stakeholders and confirm consensus about why you're doing the project, you should start every project with the Discover phase. In a SQL Server disaster-recovery planning project, you should understand and acknowledge the real objectives behind your labors: You want to ensure that your firm survives any level of interruption; you want to keep auditors and regulators happy; you want to keep the boss happy; and you want to keep your job. Any or all of these drivers are valid for your project.

A SQL Server disaster-recovery plan can't be hatched in a vacuum. The team you assemble will be your SQL Server Continuity Committee. Because your focus will be preserving the continuity of your SQL Server database operations, you should assemble those people who understand database systems, applications, OSs, networking infrastructure, security, and upstream and downstream data sources. Be sure to include representatives from the business end of the organization. Their perspectives are invaluable. You’ll need to select a project manager to orchestrate activities and resources. The more experience the project manager has with disaster-recovery and continuity planning, the better the results, but anyone with project-management experience will do.

The Discover process can be broken down into four steps: inventory (e.g., business functions, resources, people); identify priorities; implement the SQL/DR plan; and integrate into the enterprise DR plan.

Step 1: Inventory. Your first task is to inventory your core business functions and processes. You must approach discovery from a business perspective even though you’re developing a SQL Server disaster-recovery plan. During the Discover phase, you’re going to use a top-down approach to identify the core business processes and the activities that support these core functions. You can then work your way down the levels of importance, discovering and identifying non-core functions and the activities that support these operations. Finally, you’ll want to identify any cross-references and dependencies among core functions and between core and non-core functions. When you’re done, you’ll have a very good picture of what the company does.

Your second task is to map your technology resources (e.g., databases, applications) to each of the functions you just identified. While you’re at it, why not include any underlying hardware, software, and networking infrastructure, plus anything else you might know about, such as special instructions for activation and use? In a subsequent step, you’ll have these business functions prioritized for you by the committee, and when that happens, you’ll know what technologies support each business function.

Your third task is to inventory your personnel and their job roles. Who does what? Have them tell you, and don’t be surprised if you find out that Joe isn’t just a programmer—in fact, he’s been doing database backups for over a year now. A finding like that would be great, because redundancy in SQL Server skills is necessary as part of a continuity plan. If you don’t already have it, skills cross-training needs to be part of the implementation plan. But for now, simply find out who does what, who can fill in for whom, and who has overlapping skills and which skills have recurring certification requirements. You get extra points if you already have someone (besides yourself) with adequate skills who can step in and keep the place operational. (This is called the hit by a bus plan.)

Your fourth task is to inventory continuity features and functions already in place. Remember, backups don't equal continuity! If you’re backing up but not testing the restore, shame on you! Restorability is what matters. Do you have standby and/or failover servers? Are you doing log shipping or database mirroring? Windows clustering counts, but when that cluster fails, it’s going to take much longer to get the cluster (and the databases it houses) operational again than to get a single server running. Are your databases virtualized? If a virtualized database or virtual machine (VM) fails, how quickly can you start a new VM with SQL Server intact and running, and how much data loss will it entail? Are your SQL Server machines hosted at a hardened facility outside your geographic area, or are they on-premise?

Your fifth task is to identify hardware and networking infrastructure that needs to be upgraded or replaced, and prioritize the list. Identify physical plant components (generators and security systems, especially) that would be essential for local and/or remote continuity, and prioritize them. Identify key personnel and skills training still needed, and prioritize that list.

Finally, no inventory would be complete without a full materials inventory. You might already have something like this in your files. It’s normal to have to do a materials inventory—hardware, client computers, licensed and non-licensed software, and networking components—on an annual basis. Only this time, include everything on the physical premises that's essential for continuity: air conditioning, security access, work space accommodations, and so on. Make sure you can locate copies of all software, make sure you have backup copies of all software installation disks and registration keys, and make sure you have a record of the details of each hardware and networking component, so that when they fail you can quickly replace them with similar or compatible models. Your Property and Casualty insurance policy might have an asset inventory that can serve as a starting place. Consider downloading free inventory-discovery tools for identifying workloads and devices for server consolidation.

Step 2: Identify Priorities. The most important databases might not be the largest or those with the most number of transactions executed against them in a day’s time. Rather, they’re the ones that support the most critical business functions. During the first part of the Discover phase, you inventoried the core business functions. Using this list and your mapping of resources that support these core functions as a guide, you need to determine the categories of data criticality:

Class A—What data must be available 24×7×365?
Class B—What data must be available within 1 day?
Class C—What data must be available within 3 days?
Class D—What data can wait for a month or more before it can be brought back online?

What do you need to complete the plan, to fill out the gaps in each list? Does your office have generators that can kick in during power failures (the most common type of continuity interruption), and how long will they keep the systems alive and functioning? What about air conditioning? How long will the AC stay operational following a power outage?

What about weather priorities? If your company is located in a part of the country that is subject to weather events (e.g., blizzards, hurricanes, tornadoes), what’s your "foul-weather plan"? It’s not reasonable to ask your staff to navigate icy or flooded roads, risking life and limb, to get to the office; neither is it reasonable to close the office when you don’t have to. A simple plan that includes web-enabled apps for accessing data stored on servers located at hardened facilities—with the staff remotely accessing that data from home when the roads are impassible—solves both situations. But it takes planning. The great news is that a plan like this is adaptable for use in a disaster situation. If the building/block/campus becomes unusable, the data and servers are safely tucked away at the hosting facility, your staff is working from home or from some satellite office that you can quickly set up, and business continues as usual.

Before you can go to the Design stage or implement any of your findings, you must have a visible buy-in commitment from senior management. In a large organization that has a conventional plan in place, the decision-makers can sleep better at night knowing that they’re ready for a disaster that might never occur—and employees can feel better knowing that the company is looking out for them. That’s not how it works in an SMB. The cost/benefit ratio of a traditional disaster-recovery plan is seldom feasible for them. Business owners and managers are used to being at risk and are driven primarily by the bottom line. It’s a challenge to answer the question, “How will a SQL Server disaster-recovery plan help my business?”—particularly when an SMB owner asks the question. Even an SMB that requires a high level of trust from its customers might be reluctant to put a disaster-recovery plan in place unless it's mandated. The SMB incurs cost for something that isn’t visible, about which few if any customers will ask. But, if an SMB is to survive disruptions, it must have a tested plan. And that plan, or components of the plan, must be represented as budgetary line items, and the line items must persist in the budget, year after year.

The Second D: Design

Now that you have an idea of what you’re going to need to build a good SQL Server disaster-recovery plan and continuity scheme, it's time to bring all the lists together into a plan and prioritize the components. You’re creating a SQL Server disaster-recovery plan—not a disaster-recovery plan for your entire organization—so the order in which you’ll execute tasks might not be the same as you would expect to find in a full enterprise disaster-recovery plan.

Web Table 1 shows a sample template you can use during both the Discover and the Design phases. In the first section of the template, you can list your organizational business processes and functions by name and classify them as core or non-core. Create additional classes such as core support if that helps you get your arms around the big picture. In the second section, you can list the technology resources that you have in your shop. Then, cross-reference business processes to technology resources, using either resource names or associated numbers. Fill out the rest of the Inventory Resources section, using information you’ve collected about personnel/job roles/skills, continuity features already in place, infrastructure that needs upgrading or replacing, and the materials inventory. Nothing is sacred about this template; you can modify it to your own specific needs, then use it as a checklist once you have all your Discover phase inventory lists recorded.

Next, prioritize the data based on the class of required availability, ranging from Class A to Class D. For each data set, include the name of the data repository in as much detail as you need, even to the level of server_name.database_name.schema_name.tablename. If the data resides outside the SQL Server system, include the full path and file name, or at least the file folder name. List whatever continuity features you already have in place for each data set, and describe any upgrades or enhancements you feel are necessary to ensure the appropriate level of continuity.

Finally, perform a cursory design of your protection scenario, once again built on the data availability A–D classification. For all classes of data, list the business processes supported, names of the technology resources involved (not just SQL Server systems but network infrastructure, file servers, VPNs, web servers, and so on), what continuity enhancements will be needed to meet the availability classification, and the funding requirements and timelines involved.

Armed with a completed spreadsheet that describes the current situation and what’s needed to meet data availability requirements, you can now go to management and argue for funding for your SQL Server disaster-recovery plan. Then—or perhaps concurrently—you can take your report to the SQL Server Continuity Committee and discuss with them and with the project manager how your SQL Server plan dovetails into the enterprise disaster-recovery plan.

Next Time, We'll Go 3D

Here’s the reality of the situation: No matter how good your plan is, if the rest of the company falls apart, it means nothing. You can connect to the power grid, but if the power company is down, it means nothing. You can connect to the Internet, but if the DNS servers are offline, then the Internet is crippled and your work is minimized. Does this mean you shouldn’t do anything? Absolutely not! As you’ll see in Part 2, a simple approach will give you a solid foundation. By completing the Develop, Deploy, and Day-to-Day steps as the last steps of this exercise, you’ll have a viable framework. You’ll get a direct line of sight into one of the most important activities documenting a SQL Server disaster-recovery plan. Stay tuned!

Comments

Plain text