Continuous data protection (CDP) systems have gotten a lot of attention in the Exchange Server world over the last year or two. After the devastation caused by Hurricanes Rita and Katrina, many organizations that had previously been satisfied with their disaster recovery arrangements started to look for better protection. Conventional backup systems are like the spare tire on your car: When your tire goes flat, you need a properly inflated, usable spare and the related tools on hand, but changing a tire is still a hassle—especially if you have to do it in bad weather, alongside a busy road, or under other less-than-ideal conditions. It would be much nicer to have a dashboard button that, when pushed, would automatically fix your tire for you. That's the basic notion behind CDP: making disaster recovery easier with less data loss by increasing the frequency at which data items such as Exchange databases are backed up.
Do you need CDP? Any time you add complexity to a network or system, you're increasing the risk of its failure. However, your organization might benefit from the degree of protection that CDP offers.
RTO and RPO
Before we discuss how CDP works and how to implement it, we need to define two common terms in the disaster recovery world: recovery time objective (RTO), a measurement of how long you're willing to wait for a restore; and recovery point objective (RPO), the point in time to which you want to recover. Think of the RPO as the maximum amount of data you're willing to lose. For example, if you use a CDP product that copies data every hour, you should be able to restore to within the last hour, losing only up to an hour of data.
CDP solutions are generally designed to do two things: minimize the RTO by providing tools for quickly restoring a copy of the data; and provide the finest possible granularity for RPO by using either continuous copying of data or frequent intermediate replication checkpoints.
Host-Based vs. Storage-Based
You'll discover some crucial differences in how CDP products operate. The first difference is that some products protect data by using software that runs on the server you want to protect, whereas other products operate beneath the OS's notice because they run on a SAN controller. The first class of products is called host-based and the second class is called storage-based.
Most host-based systems use what's known as a file system filter driver. The system installs a driver that sits below the Windows I/O management subsystem (itself a part of the kernel), tracking which data items on a given volume are written to and copying those data items to a remote system over the network. Host-based CDP products typically protect a set of files or folders, although some can protect entire volumes. The responsibility generally lies with you to make sure that host-based CDP software is pointed at the correct set of folders to capture your Exchange databases and transaction logs, although some products are more Exchange-aware than others.
Some host-based systems implement transaction-level replication by monitoring changes to the Exchange database with Messaging API (MAPI). These products often have the advantage of not requiring any software on the Exchange server itself; however, they typically require a gateway server that aggregates the transactions and acts as the replication target.
Storage-based CDP has the advantage of taking place on the SAN; you don't install or maintain drivers or other components on the servers. In theory, these systems should have a minimal effect on Exchange because they function without any connection to Exchange or Windows. In practice, storage-based systems have three primary drawbacks. First, they're expensive. Second, you must be using SANs (and generally you have to have identical SAN controllers on either end of the connection—a further expense). Third, they sometimes limit the number of users you can host on a protected Exchange server because the way they copy data to the remote system creates disk latency.
Synchronous vs. Asynchronous
CDP systems copy data in one of two ways. Typically, when the Exchange Information Store (IS) makes a write request to the Windows I/O manager, the IS continues its work without waiting for the write to complete; at the time of the write request, the IS registers a callback function, and the I/O manager calls that function when the write finishes. This method is known as asynchronous I/O because the completion of the write is disconnected from what the requester is doing.
In synchronous I/O systems, the requester issues a write request, then waits for the write to finish. Synchronous I/O systems are simpler to code than asynchronous I/O sytems, and it's easier to predict their behavior. However, they tend to be slower than asynchronous I/O systems, which is why Exchange uses asynchronous I/O.
These concepts might seem esoteric, but they play a central role when you're deciding which CDP solution to deploy. Microsoft's support policy for CDP and replication products makes the distinction explicit. Think about what happens when data from the source system is copied to a replica, whether it's on the same machine, on a SAN, or across a LAN or WAN. When the source issues a write request, the data has to be written to the local disk, but it also has to be copied to the replica. If the source system's write request doesn't complete until the remote write is finished, that's a synchronous CDP operation. If, as is more common, the source write and remote write take place independently (i.e., they're not coupled in a predictable sequence), that's an asynchronous CDP operation.
What Microsoft Says
As you might expect, Microsoft has a pretty clear stance on the use of CDP products. The Microsoft article “Multi-site data replication support for Exchange 2003 and Exchange 2000” at http://support.microsoft.com/kb/895847 describes what's supported:
- If you use an asynchronous solution—whether host- or storage-based—Microsoft expects you to use the CDP vendor as the first line of support for the replicated data. If you encounter problems, Microsoft might ask you to show that the problem isn't caused by the CDP technology, possibly by removing it.
- If you use storage-based synchronous replication, Microsoft's policy depends on whether you're using a geographically dispersed, or stretched, cluster. The bottom line is that if you're using a stretched cluster, all your hardware must be certified for use in stretched clusters (according to the searchable list at http://www.microsoft.com/whdc/hcl/search.mspx). For solutions that are on the list, Microsoft provides full support, except that the storage or cluster vendor must provide support for the storage and replication components of your deployment. If you're not using a stretched cluster, Microsoft recommends but doesn't require that you use hardware that appears on the certification list, but the hardware and storage vendors are still on the hook for primary support.
- If you're using host-based synchronous replication, it's essentially the same as an asynchronous solution—unless you happen to use a configuration that appears on the Wolfpack Hardware Compatibility List (WHCL), in which case it's supported like a storage-based synchronous product.
Is this confusing? Well, yes. As a practical matter, what these support statements mean is that Microsoft doesn't guarantee that it can help solve problems if those problems are caused by (or even influenced by) the use of CDP products. Microsoft will try to help, but if the problem can be traced to the CDP solution, or if “less disruptive troubleshooting" methods (a charming phrase!) don't identify the problem, you might have to remove your CDP solution to continue troubleshooting.
Microsoft also provides some deployment guidelines for CDP products at http://www.microsoft.com/technet/prodtechnol/exchange/guides/E2k3DataRepl. The guidelines state three basic criteria for choosing an asynchronous solution, which I'll quote here:
- It can maintain the write-order consistency of all devices in a storage group, including being continuously consistent with each other;
- It has been proven to be recoverable, preferably in both a lab and a production environment;
- It is being provided by a vendor with a support plan in place for the replicated data.
CDP for Exchange 2003
How can you get continuous protection for Exchange Server 2003? The answer depends on what you're trying to accomplish and how much you can afford to spend. Several vendors offer CDP solutions for Exchange 2003, including EMC, Double-Take Software (formerly NSI Software), HP, SteelEye Technology, and XOsoft (now part of CA). Some SAN vendors also offer hardware-based solutions that work with Exchange.
When you're choosing a solution, the big things to consider involve what happens after you have a failure. Provided that you have enough bandwidth, and that you carefully monitor the replication solution, most products will do a sufficient job of replicating your data from one location to another. However, Exchange 2003 doesn't provide any native support for failing over operations to a remote site unless you're using clustering. Therefore, a true CDP solution will need to have some kind of failover mechanism, whether the product includes it or you have to do it yourself.
Failover requires several interlocking steps, including updating your mail exchanger (MX) record to point to the recovery server so that inbound mail flows, rehoming mailboxes by adjusting the homeMDB attribute of the affected users' objects in Active Directory (AD), and updating Outlook client profiles to point at the new server. When you're evaluating CDP products, be sure to test each product by failing production operations over to it, then failing them back. If you can't do this easily (or if the product doesn't meet Microsoft's three criteria above), you probably shouldn't use it.
CDP for Exchange 2007
Exchange Server 2007 marks a radical departure from Exchange 2003 in many ways. One of the most important changes is that it includes native support for two different CDP methods: local continuous replication (LCR) and cluster continuous replication (CCR).
LCR copies storage groups to different disks on the same server. This type of replication helps protect against problems with the original storage group's physical storage, and it protects against some types of on-disk corruption, such as failed or corrupted writes. LCR replicas provide fast restores (provided you've fixed the problem that caused the original failure), which is great if you have a short RTO, and they might allow you to take fewer full backups to secondary storage such as tape. You can create backups from the LCR replica instead of from the production database, which can be a significant time-saver. However, LCR failover requires manual action.
CCR is designed to provide full replication of data between nodes in a cluster. The way CCR works is ingenious: You set up a two-node Exchange cluster that uses a special network share called a file share witness to keep track of the cluster state. The witness can be on any server accessible across the network, although Microsoft recommends using a server in the same AD site as the cluster. The two cluster nodes don't have to share any storage. All previous versions of Exchange require the use of shared storage in clusters. To copy data from one node to the other, the CCR feature uses the same basic log-copying mechanism as LCR.
Both of these technologies work with a single database per storage group. Therefore, they're best suited for protecting high-value data instead of entire servers. In using LCR and CCR, you're also limited in the ways you can protect public folder databases; this isn't a big problem because public folders already include their own replication mechanism. CCR requires that you use the same location and paths for the storage groups and databases on both nodes, just as conventional clustering does.
Both LCR and CCR benefit from a little-noticed Exchange 2007 change: Transaction logs are now exactly 1MB, down from the 5MB size we've always had before. The smaller size makes replication performance more efficient.
And the Winner Is . . .
The majority of CDP solutions for Exchange use the host-based asynchronous approach. Products that take this approach generally offer the best balance between deployment flexibility, protection capability, and ownership cost. After all, buying software to protect your Exchange servers is almost guaranteed to be less expensive than buying a new SAN to use as a replication target!
With that fact in mind, you have some things to think about when choosing a host-based asynchronous CDP solution. First, you must clearly define your RTO and RPO and decide which is more important. Would you rather have extremely fast restores or lose less data? If restore speed is crucial, you might want to design a system that uses Microsoft Volume Shadow Copy Service (VSS) or a similar point-in-time copy mechanism as the primary means of backing up your data.
You could design a hybrid solution. For example, you could use your favorite backup utility to stream an Exchange backup to a disk file, then use the CDP solution to replicate that file to a remote site. This approach avoids many of the pitfalls of direct Exchange replication, but its RPO granularity is limited by the interval at which you take the original backups. Still, for many companies, such an approach is better than keeping local backups on site.
To further complicate matters, Microsoft recently introduced a beta of System Center Data Protection Manager Version 2.0 (DPMv2). The original version could be used with Exchange, but only if you wanted to use a conventional backup program to stream Extensible Storage Engine (ESE) data to a backup disk file; DPM then could replicate the file to the DPM server. DPMv2 can directly protect Exchange, but I haven't had a chance to test it thoroughly yet.
It's too early to tell which CDP vendors will update their products to work with Exchange 2007, especially given that Exchange 2007 includes its own CDP functionality. Taking the time now to understand how the technology works will help you have the necessary tools on hand when your system's tires go flat.