The Smart DBA's Guide to SQL Server Disaster Recovery, Part 2

You need a disaster-recovery program specific to your SQL Server assets—a program developed and maintained by SQL Server professionals. You're not necessarily looking for a general high-availability plan; that’s another topic for another time. No, you want a data-centric disaster-recovery program that concentrates on SQL Server availability, backup, and continuity. And this plan needs to have a laser focus on baking resiliency into your SQL Server systems so that, in the event of a local or regional disaster, your organization can continue its operations.

We’re applying our 7D Method to the SQL Server disaster-recovery process. This method is a decide and execute process comprised of seven stages, though we're using only the first five Ds for this application. In "The Smart DBA's Guide to SQL Server Disaster Recovery, Part 1," we covered Discover and Design. In this article, we'll continue through Develop, Deploy, and the Day-to-Day management of the SQL Server disaster-recovery program you create.

Get Robust!

In Part 1, we spoke of the relative frequency and impact of four classes of interruption. Traditional disaster-recovery planning focuses on a “smoking pile of rubble” scenario, which assumes that by preparing for the worst, all lesser incidents will be adequately addressed. We challenge that assumption!

Our 7D Method will help you outline a more robust program—one that includes disaster-recovery planning, testing, analysis, and continuous improvement. At the end of this series, your take-away will be a broader understanding of SQL Server disaster recovery, not just a plan-development process. While you're developing your program, we suggest you divide it into two perspectives: one for emergencies that require the servers to be in production offsite (i.e., for serious interruptions) and one for disruptions that don't require leaving the building (i.e., for less significant interruptions).

Each of these perspectives has different triggers for tuning your disaster-recovery program’s written plan and beginning the recovery processes it specifies. Adopting two perspectives means applying new rules, deploying new resources, and determining whether external resources are necessary or if you can recover using the resources you have on hand.

Sidebar 1: KPIs, Outcome Indicators, and Activity Indicators

The Third D: Develop

Optimal preparedness from both of the aforementioned perspectives requires that you measure and report using two types of key performance indicators (KPIs): Outcome Indicators (things that confirm that your program worked) and Activity Indicators (day-to-day protection practices or “in process” metrics). With an admitted shift in perception, your program will benefit if you borrow from the world of Maintenance, Repair, and Overhaul (MRO), whose focus is reliability. The MRO world analyzes incidents as chronic or sporadic.

Chronic incidents. Chronic incidents are variations from an accepted range of performance. The desired outcome is to take the range of variance back to accepted tolerances. Chronic incidents often fall into an “accepted losses” category—they're negative deviations from performance norms but aren't emergencies in and of themselves. They require attention, but they don't warrant dropping everything to correct the situation. They might fall under a “planned downtime” category, in which systems need to be taken offline for patches, updates, hardware refreshes, and so on. They’re often accepted as part of the job, such as the warning messages from a failing disk in a RAID array or warnings that you’re running out of space in the SQL Server transaction log.

Sporadic incidents. Sporadic incidents are typically more severe, and they're more infrequent and unpredictable. Sporadic incidents demand urgent attention. They're time-consuming to resolve and almost always carry a high dollar loss or human cost. We strongly encourage you to include near misses as well as actual incidents in your reviews for this set because these occurrences are generally few and far between. You’ll want as much testing of your preparation for such events as you can get, so that when the real event occurs, everyone has already practiced their roles.

Both chronic and sporadic incidents can be stepping stones whose resolution requires higher levels of performance. Getting back to normal is adequate, but it isn't progress. In most cases, normal is the status quo—not continuous improvement.

Let’s get started with activities that you can do within the next 30 days to jumpstart your SQL Server disaster-recovery program. When you finish, you'll be done with the Develop stage.

Figure 1 shows a SQL Server disaster-recovery program framework. It's a concept drawing that lays out the core processes and key activities of your program. The six columns represent the core processes; under each process, you should list the key activities necessary to successfully execute the process. Although they're presented here as sequential, these activities can be performed in parallel. For example, under Prepare, you can be budgeting at the same time that you’re forming your disaster-recovery committee. You’ll need two of these visual models—one for onsite-level incidents and one for offsite-level incidents. Try to limit the number of key activities to 10; otherwise, the model will be too complex to be useful as a tool. Each key activity will have one or more associated tasks (not shown).

Adopting this visual tool accomplishes several things. First, you get an alignment tool for business sponsors and the IT team, a common language and meaning that shows a general view of desired outcomes, deliverables, and performance directives. Second, you get an instant answer to the question “Where are we?”—which eliminates long report writing. We suggest coloring the activity/owner blocks red, yellow, or green to show progress (and quality, if you have best practices defined).

Peer-Oreiented Model

Assign ownership of each activity to a single individual. This person will handle the activity and all the tasks required to perform it, including information requests from other activity owners. Each person will interact directly with peers, each of whom will be handling other key activities, thus eliminating the usual decision-making chain, waiting for marching orders before work can get done. This is a peer-oriented model that's well suited to the informational requirements of a world-class SQL Server disaster-recovery program.

A word of advice: Spend 70 percent of your time with the framework's onsite perspective, and 30 percent of your time working the offsite perspective, since most operational interruptions will be local. Your offsite perspective should limit itself to recovering your SQL Server environment to an offsite location; make sure it integrates well with your enterprise disaster-recovery initiative, as well as your offsite data-center recovery plan (if your organization has one).

The good news is that you’re already doing many of the tasks that are part of a robust SQL Server disaster-recovery program: backups, log shipping, database mirroring, clustering, using third-party solutions. But have you looked at those tasks through the lens of disaster recovery? What questions do you have about these tasks? What don’t you like about what you’re currently doing? What assumptions have you made about these operations, about your databases, and about your ability to prepare, prevent, respond, recover, investigate, and report?

Which SQL Server competencies do you need to implement your SQL Server disaster-recovery program? What expertise, skill, knowledge, and training must you and your resources obtain and maintain? Which system reports must you know how to run and interpret? What SQL Server functions must you know how to execute? What SQL Server decisions will you need to make during each incident type? What SQL Server administration traps can you predict and avoid? These are the kinds of questions that you should be asking yourself as you populate the SQL Server disaster-recovery framework.

Sidebar 2: Why Not Use Traditional Project Management for SQL Server Disaster Recovery?

The Fourth D: Deploy

The Deploy phase consists of performing six core processes in pairs, as Figure 2 shows:

Prepare & Prevent—These processes translate to documenting, testing, and training. You'll document roles, people, and processes; test responses to incidents and near misses; and train in plan implementation and evaluate that training.
Respond & Recover—These processes equate to the coordination and facilitation of people and processes during an incident or near miss.
Investigate & Report—These processes mean performing root-cause and failure analysis. You'll analyze information-sharing processes, do “lessons learned” exercises, analyze performance and execute follow-up, and manage change as it becomes necessary.

In Part 1 of this series, we covered the Discover and Design phases of our 7D Method, so by now your program should have at least placeholders—and hopefully extensive documentation—for how you perform the Deploy phase's six core processes, their activities, and their associated tasks. As you can see in the figure, the lessons you learn from incidents and near misses feed into both benchmarks and leading indicators; these lessons raise the bar on benchmarks and strengthen the leading indicators. Therefore, the day-to-day best practices for managing and defending your SQL Server installation become better and better. Over time, this methodology should control and reduce the number of chronic incidents in your environment and enable you to better cope with sporadic incidents. The Deploy end-game is to know what to do when the time comes and to be able to move on it.

You might wonder when you'll know that you've finished with the Deploy phase. That moment will come when you've done one end-to-end real-time exercise for an onsite and an offsite incident, using the processes you laid out in your written plan. Once you've tested your incident-response processes, conducted a performance review, documented and shared lessons learned, updated your written program, scheduled training for individuals who need remediation, and given your recovery program an “inaugural voyage,” you can move to day-to-day operations.

The Fifth D: Day-to-Day

The final phase is simply program management, including periodic review, testing, and continuous refinement of recoverability using activity indictors. You’ll find that day-to-day operation involves a fine balance of people and processes. Unlike SQL Server applications, people can’t be automated out of the Deploy phase's six core disaster-recovery processes. What to do?

You need to balance process-management requirements and people-management requirements. Process management focuses on traditional process execution concepts (Plan/Do/Check/Act), whereas people management focuses on individual performer and team process execution concepts (Should/Can/Will/Did). “Should” is an agreement about what should happen. “Can” is proper resourcing. “Will” is individual motivation and commitment to perform. “Did” is performance evaluation and assessment, targeting continuous improvement.

Sidebar 3: Quick-Start Checklist

Final Checklist

How will you know when you’re done? How will you know when you’ve got a SQL Server disaster-recovery program that will work when you need it? Each organization has a different set of recoverability requirements, but we think it’s safe to say that if you’ve done the following, you’re as prepared as you can be:

You’ve written a service level agreement (SLA) that guarantees a return to operations within so many hours or days, and you can abide by it.
You’ve assembled a team of experts, and you have access to talent for disaster-recovery plan writing, testing, and incident management.
You’ve written a well documented SQL Server disaster-recovery plan, and you test it annually as a real-time event, quarterly as a table-top exercise.
You have access to replacement systems and facilities.
You have a confirmation that the disaster-recovery facility will be available as long as necessary until you can return to your original operations site.

Remember, it’s not enough to just perform backups. All disaster-recovery programs include backups, but backups alone aren’t a program. Backups are required, but they’re insufficient. Even if your data is recoverable, you won’t have a SQL Server disaster-recovery program if all you have are backups. Finally, know that the limit of this program is your SQL Server environment; this isn't an enterprise disaster-recovery program. If your company employs a business-continuity professional, be sure to make good use of this free resource.

Sidebar 1: KPIs, Outcome Indicators, and Activity Indicators

A key performance indicator (KPI) is a measure of performance commonly used to help an organization define and evaluate its success, typically in terms of making progress toward a long-term organizational goal or set of goals. Outcome Indicators answer the question, “What happened?” These include the number of “dry run” tests of system resiliency you’ve executed each quarter or each year; the number of people ready to respond as a result of cross-training or enhancement of their SQL Server competencies, skills, and experience; or a report of your performance in a major test or real incident compared with documented benchmarks and baselines. Activity Indicators answer the questions, “What's happening?” and “What do we want to happen?” These include knowing that your SQL disaster-recovery plan exists and is up-to-date; that it's distributed and discussed with line-of-business executives on a quarterly basis; that a wide range of scenarios is inventoried and used for “table top” discussions as well as real in-process tests; that emergency crew competencies and performance are evaluated, reviewed, and approved annually; that incident-response resources (computing and storage, workload processing, people) are continuously available; and that a system for alerting within the enterprise exists and tests show that it functions within acceptable parameters.

Sidebar 2: Why Not Use Traditional Project Management for SQL Server Disaster Recovery?

There are several reasons why traditional project management is insufficient for your SQL Server disaster-recovery planning needs. First, projects have an end date; this SQL Server disaster-recovery program will never end. Second, projects have a project manager—that is, a decision-maker—whereas SQL Server disaster-recovery frameworks don't. Every activity owner is the decision-maker for that particular activity, the go-to person for questions or resources about that activity. As the SQL Server professional, you coordinate and facilitate your SQL Server disaster-recovery program activities going. Whereas a project manager’s conditions for success are reduced to “on time, on budget, works as promised,” the SQL Server disaster-recovery program manager must ensure the recovery of operations dependent upon SQL Server data. These are two very different sets of “marching orders.”

Sidebar 3: Quick-Start Checklist

Backups done and tested periodically
Critical database and apps restore sequence documented
Scheduled jobs, DTS/SSIS packages, etc., scripted out and safely stored
Associated Windows jobs (FTP, Windows scheduled jobs) documented and safely stored
Current copy of passwords for SQL Server logons and encryption keys stashed in a safe place
Windows backups of relevant log data (SQL Server Engine, Change Data, etc.) made and secure
Disaster-recovery program budgeted for (beyond backup) as a persistent line item
Disaster-recovery facility—cold site, warm site, hot site—configured and functional
Standby or failover servers in place and operational
Standby or failover servers continuously tested
Required firewall and perimeter security in place at disaster-recovery site
Documented procedure to transfer DNS settings during failover readily available
“Remote into disaster-recovery site” plan in place for SQL Server People group
SQL Server People primary computer for managing SQL Server systems is portable and can travel with SQL Server person; it's set up to access both the primary site of operations and the disaster-recovery facility.

Comments

Plain text