Disaster Recovery Plan Testing 101

You’ve written your disaster plan and distributed it to your staff. You’ve included all the points required for a decent plan: assigning a disaster recovery team and coordinator, creating detailed recovery procedures and instructions and call trees with employees and vendors. You’ve covered hot topics such as pandemic planning and long term back-up power. Now you sit back and wait for a disaster to show how ready your systems are for it, right?

Not so fast. True, you’ve done the hardest part, which is getting your plan ready. But how ready is it? One way to know is to put it through its paces and test it before it’s really needed.

But it’s not enough to test your backup tapes once in a while or cycle your generator once a year. Testing your disaster plan thoroughly involves testing all systems, people, and processes for their readiness and resiliency to help you see the gaps in your plan (and every plan has gaps). Testing also verifies that the information in your plan is correct, and it lets you improve your plan over time so that it’s a living document, not a dusty binder on a shelf.

So where do you get started? Let’s talk about the various levels of disaster recovery testing routines and the proper process to follow when testing your plans so as to get the most out of your testing.

Ways to Test Your Disaster Recovery Plan
Just as there are many types of disasters (natural, man-made, and the most common and most likely, hardware or software failure), there are many types and levels of disaster plan testing, ranging from simple to more involved (and typically more expensive). If you haven’t done much disaster recovery planning, you can start off with the simple methods and work your way up. Eventually you can develop a multi-year plan to do fully integrated tests of your disaster recovery plan every year.

Disaster Recovery Checklist. Develop a simple checklist and walk through it to make sure that every item is in place. This is not unlike the hurricane preparedness checklists that those of us on the Gulf Coast consult every year during hurricane season.

Your list might contain items such as “Generator working—check;” “Fuel for generator stored safely or under contract—Check;” “Backup tapes stored off site—Check.” This is a simple process that you can do with minimal time and staff involvement. You should be doing this step no matter what.

Disaster Recovery Walk-Through. Kick your disaster recovery testing regimen up a notch by involving your staff and walking through your disaster recovery plan with all key players present. Do a simple group reading of your plan, making sure everyone is aware of all elements. It also gives staff members the opportunity to ask questions and voice concerns about the plan.

As simple as it seems, many companies fail to do this basic step. Making sure everyone has read the plan in a group setting is vital to understanding and retention so everyone knows what to do when the time comes. Call trees should be walked through to make sure they make sense. Vendor lists and other information should be examined to make sure all the data in the plan is up to date.

Disaster Recovery Tabletop Test. This test extends the walk-through test, adding staged scenarios to see how the plan would work in real-world circumstances. Forcing your team to actually discuss what they would do in certain circumstances puts stress on the plan and can show the gaps.

By throwing these scenarios at your staff, you can see how your plan allows for different circumstances and unexpected situations. The mock scenarios can go from simple to actual simulated situations. Don’t publish the scenarios beforehand, but spring them on your staff. This mimics the way a real disaster comes upon us.

You can also come up with certain “curve balls,” such as a faulty generator or a backup failure. See what happens when things don’t go according to plan. In my tabletops tests, I’ve had groups draw straws to remove selected staff members who are “incapacitated” by flu or some other pandemic to see how the plan reacts to the impacts of staffing losses.

When the person who knows everything is suddenly unavailable (as often happens in a disaster), who takes over and does he or she have access to everything that’s needed? These are questions a tabletop test can answer about your disaster recovery plan.

Disaster Recovery Technical Tests. Here is where disaster recovery testing gets interesting: No more meetings in conference rooms—you’re testing real systems in real-life situations. It can run the gamut from simple backup media tests to complex, hot site operational switchovers.

This is where you find out what systems you can recover successfully, according to your written plan. Most companies do some level of technical testing but many could do a lot more. Let’s examine technical testing in depth and look at some guidelines for testing.

Technical Testing Your Disaster Recovery Plan
There are two levels of technical testing: parallel and live. Parallel testing is where you back up or restore a system that’s running parallel to your production system, so you don’t affect any regular processing. This is the safest way to test your technical systems.

However, it does require that you already have redundant servers in place or are willing to fund spare servers. And parallel testing doesn’t truly assure you that you can recover the production system.

Live testing involves actually downing the main system and attempting to recover it. This type of testing is also known as “full interruption” testing. It gives you a true measure of a system’s recoverability. However, it’s expensive in terms of down time, and it’s risky: What if you can’t recover the production system?

Some situations won’t allow a true live test since a failed test could be as bad as a real disaster and cause lives to be at risk. For example, some healthcare, government, and military systems (e.g., air traffic control), can’t be live tested due to public safety or regulatory concerns.

You can do technical disaster recovery tests on many different systems, though you usually don’t test all your systems at once due to the risks and complexity. Most companies rotate their different technical tests, doing one a quarter or bi-annually so that they get through all technical system tests every year or two. Here are some basic types of testing to include in your technical disaster recovery testing regimen:

Backup media restoration. There are two main ways to test backup and restore. The first one involves doing random data item restores such as restoring a few files from selected file folders. This tests the integrity of your backup media. You should do this with some regularity and not wait for a formal disaster recovery test, though you’ve probably already done this on the job for some hapless employee who deleted something by mistake. However, don’t wait for the opportunity—schedule it in with your normal weekly or monthly log review.

The second type of testing involves actually restoring an entire server. This ensures you’re backing up everything you need and in the right manner. Sometimes a server has to be restored in a particular order (OS first, then database, then application program). Often, complicated programs such as SQL Server or Exchange Server don’t react well to being put on different hardware or OS versions than they were on originally.

Restoring a server can involve two different levels of difficulty: You can restore onto an existing similar server or do a bare-metal restore, restoring totally from scratch with only your backup media to work with. Using ghosting or disk image software can make this process a little easier.

Both of these methods of restoring a server require some kind of back-up program and backup servers to work with. And both of them will reveal any flaws in your backup plan and show additional complications or time that it might take to do your restores. During your testing is when you should find these things out, not when the building burns down with your servers in it.

Failover and failback. If you’re running in a redundant or high-availability environment, you should regularly test that capability by initiating a failover operation. Make sure that not only does your system fail over to its backup but that you can successfully fail back to your main production system when the disaster is over.

Test of power backup (generator/UPS). All the backups and redundant servers in the world won’t do you any good if your computer center doesn’t have power. Most organizations have a good uninterruptible power supply (UPS) system as the first line of defense and a generator for long-term power backup when grid power is out.

You should test your UPS systems for their ability to carry load. The batteries in these units typically don’t last more than a few years. You are probably constantly adding equipment to your racks as well.

Make sure that your UPS systems keep up. They should hold your entire computer center long enough for your generators to kick in or for you to safely power down your computers if you don’t have a generator.

The easiest way to test this in a smaller computer center is to pull the plug and see what happens (make sure you’re ready for downtime first). In larger environments, monitoring and testing software can assist with this.

Generators should be regularly started up and serviced. Again, some of the larger units can be programmed to do this automatically. But you might want to force the issue and cut the building power and see how fast the generator kicks on.

You should probably run it for an extended period (say a day or more) to monitor fuel usage and heat and exhaust dissipation, to make sure it will run for the long term. Many companies in Houston were running on generator power for weeks after Hurricane Ike. Finally, make sure you have sufficient fuel for them. If your fuel vendor fails to show up, then you are out of luck.

Hot/warm site. If you have contracted with an outside company or plan to use your own facilities for a hot or warm site recovery, you should test these capabilities on a regular basis. Most companies that offer such services will allow you to do this and should be able to accommodate you, though there may be charges for such a service. If they don’t, you should question their ability to provide the service.

A standard test of the service involves cutting over to the recovery site and having staff on hand to process a set of sample transactions. The closer you can get to regular work volume, the more you will get out of your test.

Don’t forget the staffing element of all these plans. Your test should be done with personnel in place. Make sure everyone knows what to do and that different team members are sufficiently cross-trained.

Operations Testing. The best way to test your disaster recovery plan technically is with real users doing real production. If possible, non-IT people should be part of your technical tests. Just bringing the server up and being able to log into it as an administrator isn’t a true test: Put real users on it and make sure it can handle them with no hidden complications.

Common issues include bandwidth and processor capabilities on backup servers, authentication and user rights, and outside connectivity. If you’re testing as a single administrator, you aren’t going to see some of these errors. Being local to the server (on the console) will make you miss most connectivity issues.

Some environments might not allow this kind of risk of downtime, so you might have to assemble a group of test users and a test data set. Make sure these test users are sufficiently heterogeneous (sales, accounting, field service), from different parts of the network, and using a diverse set of features.

Testing Methodology
After you figure out what kind of test you want to do and when, it’s time to plan your test. Badly planned disaster recovery tests can turn into real disasters in a hurry when a downed system fails to come back. You will get a lot more out of your test if you follow these steps for successful disaster recovery testing:

Plan. Think about what could go wrong. In other words, have a disaster recovery plan for testing your disaster recovery plan, especially for mission-critical apps.

Write down your test plan. Account for any possibilities where it could go wrong, and note what you expect to get out of the test. Specify what counts as a successful test or an unsuccessful test. Usually there are multiple categories, and a test is not a simple failure or success.

Also, make sure you have a plan for how to return to normal operations. If you failed over to a back-up server, how and when do you fail back to the production unit? The devil is in the details, and the more details you have covered, the more likely your test will run safely and successfully.

Notify. Make sure that you notify all potentially affected users of the system that you intend to test it and how it might affect their productivity. Most tests are done on the weekends or late at night to minimize any downtime. Another benefit of this approach is that if something does go wrong, you’ve got some time to make it right before the mass of employees show up at work.

Of course, in some environments such as a hospital there really is no “good” time for downtime. Also, some situations might call for doing mid- week tests when vendor representatives are available or tech support can be expected. If your test goes longer than you expect or you encounter problems that will affect users, make sure you update them with progress reports and expected return to normal operations.

Execute. Execute your test according to your plan and use your written disaster recovery plan to recover. Does the recovery track according to plan? Are steps left out or not well documented? Here is where institutional knowledge can be developed and put to paper so all can benefit. And that leads to the next step.

Record. Make sure someone is assigned to record the test and its results. If possible, have a report format to capture the results so that you won’t be dealing with someone’s unintelligible notes after the fact.

Documenting disaster recovery tests is one of the areas where most companies fall down. If you don’t have a record of what happened, how can you expect to learn from it afterwards? And that leads to the next section which is the “lessons learned” meeting.

Review and improve. Now you dissect the test, see what went right, what went wrong, and how to do better next time. Closing the loop on your test in this manner is the best way to get future benefit from your tests.

Make sure you assign specific action items to address, then review those items to make sure they got taken care of. Do it fairly soon after the test so details are still fresh on everyone’s mind. This cycle of review and improvement is the final step in making sure your disaster recovery plan evolves into the future.

Testing, Testing
Obviously your testing can involve endless variations with different configurations, systems, applications and organization types. But in the end, the concept is the same: Test, test, then test some more.

The more you test, the more likely you will be ready when the real disaster occurs. And as we all know, it’s not a matter of if, but when that hurricane hits or that system crashes or whatever disaster Fate has in store for your organization.

Comments

Plain text