For Seth Copeland, taking the job of IT director for Tanner & Haley Resorts in Kansas City, Missouri, meant jumping in to the development of a disaster recovery plan to support the company's 160 users. "When I first took over this position in October 2005, there was really no plan; we simply used tape backup," Seth said. Adding to the complexity of implementing such a plan, Tanner & Haley filed for Chapter 11 bankruptcy in July 2006 (and is in the process of being acquired by Ultimate Resort)—just as the newly developed disaster recovery plan was going into effect. Taking advantage of cost and space savings, Seth decided to implement a virtualized disaster recovery solution. I spoke with Seth about how he used virtualization to provide a budget-friendly, two-tiered approach to high-availability and recovery.
What triggered the development of a disaster recovery plan for the company
at this time?
For me, the motivator event was when a tornado touched down not too far from the office. Most disasters in Kansas City are the local ones, tornados, that type of thing. I think there may be a fault line somewhere, but I don't know. The Missouri River might flood, but we're pretty far from the river. If somebody nukes Kansas City, then we're probably out, but I don't worry too much about that.
Explain your current disaster recovery solution. Does
the plan define procedures according to a disaster's severity?
Our plan uses VMware products and Double-Take Software's Double-Take for Virtual Systems. I have a server here in my data center that runs VMware GSX Server, and I replicate all my individual servers onto it locally. The GSX Server machine serves as my local, high-availability server, in case I lose a single box. If my mail server dies or my database server goes down, the local high-availability server takes over. Additionally, all those virtual servers on the local high-availability box then replicate over to our Overland Park, Kansas, site via the Double-Take software, to four boxes there, on which we put about four or five servers each using VMware GSX Server. The virtualized server here at my main data center takes care of a single server failure. The Overland Park site takes care of site failure.
My two main concerns were how fast could I get the system back up and whether the virtual site could handle everybody coming into it. Could it handle the load? And it did. For our recovery time objective (RTO), we generally shoot for about two hours, losing no more than the last 10 minutes of data. That's what I set up, looking at the business needs.
Losing one disk, losing the mail system, the reservation system—that's a small disaster. The plan takes care of that by just failing the systems over to the local server.
The plan also takes into account an event that wipes out this building, where employees can't even come here and people have to work from home. In that event, we'll need to get the call center back as quickly as possible, so we'll have employees take home their office phones, reroute our phone lines to corporate headquarters in Connecticut, and connect to the Mitel Networks VoIP phone system to run the call center out of their homes.
Why did you make virtualization the basis for your solution?
The main justification was the cost savings. After we filed Chapter 11, every dollar counted. Virtualization software has come far enough and the hardware is powerful enough that I have no worries about running our recovery servers in virtual machines. I mean, if I had to build my data center from scratch, I'd definitely do so by virtualizing more. So that's why we opted to go with virtualization, really—to save on hardware costs and save on rack space in our disaster recovery site.
In addition to restoring call-center services, what are some other components
of your business continuity plan?
I've given the individual managers telephone plans—that is, how I'm going to communicate with them in the event our systems go down, because it's going to be a little while before our BlackBerries and email come back online.
I also ask the managers to prioritize a few items: Who are their most important people— whose service should be restored, and in what order—and what are their most important systems? If on the first day of the disaster, I can get 10 of your people back on, who are your most important 10?
How's the solution running so far? have you had the opportunity to use it in
an actual disaster yet?
We haven't had any issues with the solution. We test it every few months. In testing, my two main concerns have been how fast can I get the system back up and can the virtual site in Overland Park handle the load of all our users switching over to it. And it performed well on both counts. We haven't had to use it in real life yet. I hope we never have to.