Many people break the subject of high availability into two parts—disaster prevention and disaster recovery—and discuss the topic as if every step in a high-availability solution fits neatly into one arena or the other. However, while I was planning for this article and trying to determine which activities constitute disaster prevention and which constitute disaster recovery, I found that the line between the two isn't a neat one. I also realized that to distinguish between disaster prevention and disaster recovery, you need a clear definition of "disaster" for your organization. (See also, "Don't Confuse High Availability with Disaster Recovery").
If you define disaster as something outside the technical realm, such as a flood or earthquake, then preventing disaster isn't possible, at least for the IT staff. In such a case, you'd focus on disaster preparedness. But if you concentrate on disasters directly involving technology, drawing the line between prevention and recovery is difficult. For example, if you define the loss of your SQL Server as a disaster, you might implement a clustered SQL Server environment so that a SQL Server on another machine automatically takes over for the failed server. That failover lets you recover from the disaster of losing your primary SQL Server. However, most people think of a clustering solution as disaster prevention.
Similarly, most people think of backup-and-restore strategies as disaster recovery. However, if you define a disaster as a loss of data and if a complete restore from backups prevents data loss because of a disastrous event, did the disaster occur, or did you prevent it? Because you need good backups to perform a complete restore and you have to perform backups before a disaster occurs, you can think of an effective backup strategy as a disaster-prevention technique.
So what exactly constitutes a disaster? For the purpose of this article, a disaster is a loss of production data or downtime that causes a loss of productivity. For your own purposes, if you think a particular event is a disaster for your business, it's a disaster.
When I distinguish between disaster prevention and disaster recovery, I suggest that prevention is what you can do before something happens and recovery is what you do after. The "something" could be an event that's preventable, such as multiple disk failures, or it could be an event that's unpreventable, such as an earthquake. In the latter case, you don't look for ways to prevent the event; you look for ways to be prepared and prevent the event from causing an unacceptable loss of data or productivity. The sidebar "Is True Recovery Always Possible?" describes the personal experiences of one of the world's leading SQL Server disaster-prevention experts in the aftermath of the September 11, 2001, attack on the World Trade Center and underscores how important planning, preparation, and perseverance are to ensuring that your organization can survive a disaster. Let's look at some best practices that will help you prepare for the worst.
Strategies for Disaster Prevention
Dealing with disaster takes many forms, but I've come up with my top five best practices for disaster prevention. Adhering to these practices will help you plan and implement a disaster-prevention plan for your organization.
After planning for—or at least considering—the worst possible disasters and determining what steps you can take to minimize the risk, evaluate the cost-benefit ratio of taking those steps. If protecting your data against a catastrophic but low-probability event—such as a meteor strike—would cost a lot, you might decide to take the risk and not provide specific protection for that event. Document your decision so that coworkers and successors know that you at least considered the possibility. However, you might discover that protecting your data against some low-probability events doesn't cost much. For example, you might already store database backups in a location across town, but what would happen if the meteor destroyed the entire town? If your company already ships regular backups for another system or application to another town, the cost of adding tapes of your SQL Server backups to that shipment might be negligible. So although the likelihood of a meteor strike might be small, the cost to save your data if that unlikely event occurs might also be small.
Also, be sure to plan for the infamous "user error" in your list of possible disasters. User errors can be the most difficult disasters to prevent and can take the longest to recover from. Users commit such errors as accidentally deleting data rows, tables, or an entire database. SQL Server makes it easy for users with high levels of privilege to do lots of damage. Although referential integrity, schema binding, and triggers can help prevent inadvertent deleting of rows or dropping of tables, SQL Server provides little protection against dropping an entire database. Unlike in earlier SQL Server releases, in which the DROP DATABASE command didn't actually remove the physical files from the file system, when you drop a database in SQL Server 2000 or 7.0, SQL Server deletes the disk files holding the data, so the database is really gone.
Users can also create havoc by implementing the wrong procedure. Many procedures have similar names, and if a typing error causes a user or an application to call the wrong procedure, you might not notice the damage for some time. You'll lose more time while you track down the problem, and even more time before you start correcting the problem and undoing the damage.
Note that ineffective security can foster user errors. Weak or nonexistent passwords, along with allowing too many users too much privilege, can compromise system and data integrity. Accidental deletions are also more likely when users have higher permission levels than they need.
After you've created operational processes, you need to verify that your processes do what you need them to do. Make sure your test servers are configured exactly like your production machines—with the same hardware and the same workload—so that the test machines can accurately predict the capabilities of the production machines in the event of a disaster. When you run simulations of complete system failure and measure how long a complete system rebuild and restore takes, be sure that the time the test machines take to complete the operations is an accurate predictor of how long the operations will take on your production system.
Many solutions for avoiding downtime depend on your hard-disk system. The disks are the most active part of any system—and no system is foolproof. Even if you implement the best RAID system, the hardware system could still fail and make the entire array unavailable. With RAID, you also need to remember that fault tolerance might apply to only one drive failure, so you have to consider the possibility of a second drive failing. Don't get complacent and think that the chance of a second drive failing soon after the first is remote. If you purchased all your drives at the same time, the other drives are the same age as the failed drive and might also be nearing failure. And remember that even the most redundant drive system won't protect you against user errors. If a user accidentally drops a production table, all the redundant drives will reflect that drop. If you don't know a drive has failed and you don't replace the drive before a second failure occurs, you risk complete system failure. Make sure that you regularly check your logs for a message in the log that tells you a drive has failed or that you receive automatic notifications when failures occur.
The two most common warm-standby solutions are log shipping and replication. A great feature of both of these solutions is that they can have serendipitous performance benefits. For example, in some cases, log shipping can provide a second read-only copy of your data that you can use to off-load some of the reporting work from the production server. And depending on the type of replication you implement, you can frequently use replication to distribute your entire workload across more systems so that each system performs more effectively.
No one solution or plan can provide complete disaster protection for your system, so you need to put in place people, processes, and technology that combine to give your organization the greatest possible chance for survival. These suggested practices can give you a place to start planning for the survival of your organization's precious data.
Document all your planning decisions (including your list of possible disasters) and recovery procedures, and store copies of the documentation off site. Remember that high system availability involves people and processes as much as it involves technology. If you don't clearly define the processes for disaster prevention and disaster recovery, or if the people involved don't understand the processes, all the technology in the world won't help you.
Although you could consider planning a backup-and-recovery strategy as part of one of the previous best practices, I list this as a separate strategy because it's so crucial—even if you do nothing else, you must save the data. If your only management experience with SQL Server is through Enterprise Manager, you might think backup is a simple one-step operation or that once you define a maintenance task, you don't need to think any more about it. But if you want to be truly prepared to avoid disastrous loss of data or productivity, you need to learn all you can about SQL Server's backup and restore capabilities. SQL Server has three recovery modes, each of which affects backup and restore operations in different ways. SQL Server lets you back up the entire database, the transaction log, and individual files or filegroups. In addition, you can make differential backups of the database that let you back up just the data that has changed since the last full database backup. And in SQL Server 2000, you can also perform differential file or filegroup backups.
- Make a list of possible disasters.
- Create effective operational processes.
- Understand and implement technologies to prevent downtime.
- Consider warm-standby solutions.
- Plan your backup-and-recovery strategy.
SQL SERVER MAGAZINE ARTICLES
"All About Restore," May 2002
"Safe Transit," September 2002
"Two-Way Replication with SQL Server 7.0," July 2000
MICHAEL D. REILLY
"Top 11 Backup Tips," September 2001
"Split-Mirror Backup and Restore" December 2001
"Log Shipping in SQL Server 2000, Part 1"
"Log Shipping in SQL Server 2000, Part 2"
KIMBERLY L. TRIPP
"Before Disaster Strikes," September 2002