It's 2:00 A.M. on a Monday morning, and your cell phone rings. The water fountain on the floor directly over your server room has malfunctioned, and your organization's servers and routers are standing in water, as are most of your employees' workstations. The office opens at 8:00 A.M. What do you do in the meantime?
Situations like this one separate IT departments that have planned for disaster from those that haven't. For the latter group, the situation I've described is more than a disaster—it's an absolute disaster. When total data loss is possible, the absence of a disaster recovery program can put a business at risk, particularly small-to-midsized businesses (SMBs), which often don't have the financial wherewithal to survive unexpected catastrophic events. Although disasters are inevitable and, to a degree, unavoidable, being prepared for them is completely within your control. Increasingly, IT has become the focal point of many companies' disaster planning. Creating a program to preserve business continuity and recover from disaster is one of the central value propositions that an IT department can contribute to an organization.
A 6-Step Plan
In the terminology of disaster planning, two phrases are common: Business Continuity Planning (BCP) and Disaster Recovery Planning (DRP). Although many people use these phrases interchangeably, they represent different concepts. BCP traditionally defines planning that ensures an organization can continue operating when faced with adverse events. DRP is actually a subset of BCP and traditionally focuses on recovering information and systems in the event of a disaster. As an example, the failure of a hard disk in a database server is an event that potentially affects business continuity but doesn't result from a disaster. However, a water-pipe break that floods a server room and submerges the database server is a threat to business continuity and within the scope of disaster recovery planning.
BCP and DRP can be complex; in fact, large organizations dedicate groups of people to them. But without getting into detailed risk analyses and other complexities that usually accompany BCP and DRP in large companies, all organizations can benefit by following six steps to create a program that will preserve business continuity and facilitate recovery in the event of disaster.
Step 1: Identify Critical Business Activities
The first step in BCP and DRP is to identify your organization's critical business activities—those things that must occur on a daily basis in order for your business to stay in business. For example, a customer service call center must be able to receive calls, look up customer records, and create new incident records for customers calling in. A law firm will need to be able to access client information and electronic schedules, send and receive email, research online law libraries, and make and receive telephone calls. As you work through this step, you'll need to partner with your organization's key business decision makers to identify the activities that are essential to your organization's continued functioning. Your organization's BCP will center on preserving continuity of operations by recovering these services.
Step 2: Map IT Systems to Critical Business Activities
With the identification of your organization's key business activities, you can determine which IT systems these activities depend on. For example, to enable the customer service call center to look up customer records and create new records for incoming calls, the database servers that store the records and the line-of-business applications that access them must be available. In turn, some degree of core network infrastructure will also need to be operable for this critical business activity to take place. These are the IT systems that you must be able to keep operating by quickly recovering them after a disaster.
Step 3: Model Threats Posed by Predictable and Plausible Events
Nearly all disasters and failures in business continuity are predictable to a certain degree of precision and plausible within a certain degree of reason. Such events can be natural, such as an earthquake or flooding; human-caused, such as an accidental fire or deliberate sabotage; or mechanical, such as a hard disk failure or a water pipe bursting. For example, if a customer service call center is located in Wakita, Oklahoma, it is plausible that the center's IT systems could be in the direct path of a tornado. Likewise, for any company that relies on technology, it is predictable that computer hardware will eventually fail.
After you identify your critical IT systems, you can begin modeling the threats posed to these systems by predictable and plausible events. Threat modeling lets you apply a structured approach to identifying threats with the greatest potential impact to your business continuity and their mitigation. List all the ways that critical IT systems might be disrupted and which events must happen for each threat to be realized. For example, something that would disrupt the call center's business continuity might be the customer record database's inaccessibility. Events that could cause such inaccessibility include computer hardware failure, a power failure, or something more severe, such as destruction of the data center by a tornado.
Step 4: Develop Plans and Procedures for Preserving Business Continuity
Now that you've listed your critical business activities, identified the IT systems your business depends on for carrying out those activities, and brainstormed the possible and plausible events that could disrupt IT services, you can use your threat model to determine countermeasures to preserve business continuity. Four primary BCP countermeasures exist: fault tolerance and failover, backup, cold spares and sites, and hot spares and sites.
Fault tolerance and failover. This countermeasure relies on the use of redundant hardware to enable a system to operate when individual components fail. In IT, the most common fault tolerance and failover solutions for preserving IT operations are hard disk arrays, clustering technologies, and battery or generator power supplies.
Backup. On- and offsite backup programs are a core countermeasure in DRP. Backup gives you the ability to restore or rebuild recent data to a known good state in the event of data loss.
Cold spares and sites. Cold spares are offline pieces of equipment that you can easily prepare to take over operations. For example, you might maintain a set of servers that aren't connected to your network and that have your company's standard OS installed and configured. In the event of an emergency, you can complete the configuration and restore or copy necessary data to resume operation. Similarly, a cold site is a separate facility that you can use to resume operation if a disaster befalls your primary facility. Often, a cold site is nothing more than a large room that can accommodate desks and chairs. For most SMBs, cold sites aren't cost-effective.
Hot spares and sites. Hot spares are pieces of equipment that are ready for immediate use after a disaster. For example, you might continuously replicate a critical database's data to remote facilities so that client applications can be redirected to the data replicas if necessary. Hot sites are facilities that let you resume operations in a very short amount of time—typically, a hot site is operational within the time it takes for employees to arrive at the facility. Hot sites have real-time or near real-time replicas of data and are always operational. Because hot spares and sites are expensive to maintain, only organizations that must be operational in a disaster, such as a public safety organization, use them.
Step 5: Develop Plans and Procedures for Recovering from Disaster
Not all events are predictable or plausible. There is perhaps no better example of this kind of event than the September 11, 2001, attack on the World Trade Center. For these types of disastrous circumstances, as well as for other severe disasters in which total data or service loss from primary systems is possible, you must create plans and procedures for recovering systems. Because recovering from a disaster is stressful, having well-documented, tested, and practiced procedures in place beforehand is essential. Similarly, rehearsing recovery procedures can help you verify that the data on backup media is usable and restorable. Be sure to store copies of your DRP procedures offsite with your verified backups. For most organizations, bank safe deposit boxes are the most effective, affordable, and secure remote storage solution for verified backups and DRP plans.
Step 6: Test Business Continuity Plans and Practice Disaster Recovery
Test, test, test. When it comes to BCP and DRP, the very nature of the circumstances that necessitate their existence dictates that the plans, procedures, and technologies you use to preserve business continuity must work when they are required. Conduct planned and spontaneous drills to test your BCP and DRP. These drills might include failing over cluster nodes on a monthly basis, restoring cold spare servers periodically, or even conducting full cold- or hot-site disaster simulations. At an absolute minimum, perform DRP restoration of critical data from offsite backup media periodically. Off-site backup media is your last line of defense against total data loss.
6 Steps Away from Disaster
By following these steps, you can help your organization create a BCP and DRP program that will shield it from the risk of natural, human-caused, and mechanical disasters. When the cell phone rings at two in the morning, the last thing you want to be doing is brainstorming ways to recover data from a server and backup tapes that have been under water for 30 hours or, even worse, recovering from the physical destruction of your data center after disastrous circumstances.