High Availability in Exchange 2007 Is Well Within Your Reach

Executive Summary:

In Exchange Server 2007, high availability comes in three flavors: cluster continuous replication (CCR), single copy cluster (SCC), and local continuous replication (LCC). Learn about high availability and asynchronous log shipping and replay in this primer on Exchange 2007 high availability options.

High availability has traditionally carried a hefty price tag. A few years ago, I taught a class at EMC's Boston headquarters. I was surprised to learn that the typical-looking computer setup for the class was actually valued at a hefty $12 million—the cost of a system that, in EMC’s words, “you cannot bring down, even with an ax.” As you consider what your organization might need in terms of high availability, perhaps you're already worried about the cost of the hardware and software required, not to mention support for such a solution. In the past, these were viable concerns, but with Exchange Server 2007 you can find solutions that won’t break your budget and won't require your Exchange team to pursue PhDs in clustering. We’ll start our exploration of Exchange 2007 high-availability solutions with an overview here, then in upcoming articles we'll delve into the topic in more detail by walking through how to configure Local Continuous Replication (LCR) and Cluster Continuous Replication (CCR) in Exchange 2007.

What’s High Availability?
Often, the concept of high availability is confused with that of “uptime,” although the two are not one and the same. High availability is not just about the system being up, but about its being available and accessible to users and, in the case of your Exchange servers, ready to send and receive mail.

The typical expression of availability is usually given in percentages of uptime. The way this is determined is by taking the number of unplanned downtime minutes for a year, then dividing this by the number of minutes in a year (about 525,600). The higher-end availability percentages are

99.9 percent = 43.8 minutes of downtime/month or 8.76 hours/year
99.99 percent = 4.38 minutes of downtime/month or 52.6 minutes/year
99.999 percent = 0.44 minutes of downtime/month or 5.26 minutes/year

These percentages can give you an idea of what level of availability you might be looking for from your systems, and from there you can determine what you need to do to provide that level of availability. There is no perfect solution because each business situation is unique, but all organizations would like to achieve a high level of uptime. Regarding the concept of downtime, some admins consider planned downtime as acceptable—for example, to perform maintenance on the systems—because such downtime occurs within a controlled, organized setting. However, high availability takes into account an environment being available even if a specific server has to be taken offline for maintenance.

I should note a distinction between high availability and disaster recovery. High availability involves preparation for a predefined set of failures (e.g., a disk fails, a power supply burns out, the network connection goes down), whereas disaster recovery indicates the need to restore from backups. This need to restore occurs when you exceed the set of predefined failures mentioned earlier; for example, if one system in your cluster crashes and it fails over to your secondary system and that one crashes too, you're looking at a disaster recovery solution.

High Availability in Exchange 2007
In Exchange 2007, high availability comes in three flavors. Each offers a different level of protection, with different hardware and software requirements.

LCR. LCR, a single-server solution, uses asynchronous log shipping and replay from one set of disks, and more specifically, from a storage group (SG) with one database, to another disk, as Figure 1 shows. It requires a manual switch to move from the primary copy of the data, in the active SG, to the secondary copy, in the passive SG

CCR. CCR is a clustered solution that requires only two nodes in the cluster, where one is the active node and the other is the passive node for automatic failover, as Figure 2 shows. This solution allows for two different systems and two different sets of storage, offering a greater level of availability because it eliminates single points of failure. Asynchronous log shipping and replay is used in this solution to keep the database up to date between the active and passive copies of the data. CCR is relatively easy to set up, in contrast to more complex, hardware-level geoclusters—that is, clusters of nodes separated by great distances. However, one of the limitations to CCR (really of Windows Clustering) is that you either have to configure the cluster of both servers within a single datacenter or stretched between two datacenters (only if they are both using the same IP subnet). This limitation will be handled in Windows Server 2008.

Single copy cluster (SCC). SCC offers a similar solution to what Exchange Server 2003 offered: multiple systems with a single SG that's shared between the cluster nodes, as Figure 3 shows. Again, we have one active server with one or more passive servers waiting for a failover. Because SCC doesn’t provide a redundant SG, it doesn’t use log shipping and replay.

Understanding Asynchronous Log Shipping and Replay
Transaction logs have been, and continue to be, essential to the operation of the Exchange server. When a message is sent to the Exchange server, it's in the system memory to begin with and gets written into a transaction log before being written to the Exchange database when the system load permits. If a server failure occurs, you'll lose the contents of the system memory, but the transaction logs will remain intact and can be used to update the database when the system comes back online.

Exchange maintains a single set of logs for the databases in an SG. Transaction logs are created in a sequential manner called a log stream. These log files are now 1MB in size in Exchange 2007(as opposed to the 5MB log files in Exchange 2003). The reduction in file size was one of the changes made to support continuous replication. A log stream can contain up to 2,147,483,647 log files. (This upper limit stems from the number of log file names that can be created, not from the amount of disk space on the server.)

In Exchange 2003, log files are named like this: E nn fffff.log (with nn being the prefix that changes from one SG to the next and fffff being limited to about one million logs) under the larger, 5MB size. In Exchange 2007, however, the naming convention changed to E nn ffffffff.log, offering the ability to handle over 400 times more data. So even though the files are smaller, in the end, because the naming convention is made larger, you have the ability to add more files before running out of log.Note that this is per SG, not per server. The standard license for Exchange 2007 enables you to create up to five SGs and to mount up to five databases. The enterprise license for Exchange 2007 lets you create up to 50 SGs and mount up to 50 databases.

Log shipping is the process whereby logs are copied over to their secondary location and replayed into a copy of the database. The result is a failover backup (either an entire server ready to step in as an active server or simply a failover set of data ready to be used in the event of a single server failure that's prepared with nearly all the transaction log data and is ready to go. The reason this process is called “asynchronous” is because there's an interval of timewhen a transaction log is still open and active and can't be shipped over to the secondary copy of the logs. So some data might be lost. In the case of an LCR solution, you have to decide if that loss of data is acceptable.

In a CCR solution, another server can help mitigate the loss. This is the Hub Transport server role and it contains a feature called the transport dumpster, which retains a predefined amount of mail messages that come through on the way to the cluster. If the active node of the cluster fails, the passive node automatically takes over. In addition, the passive node contacts the Hub Transport server(s) and queries the Hub Transport server(s)for any messages in the transport dumpster. The passive node then goes through the messages, checks them against the database to see whether any of the messages are missing, and discards duplicates.

High Availability That Fits Your Needs
Exchange 2007 offers a variety of high-availability options that can be tailored to the needs and resources of any organization, as Table 1 shows. Even the smallest organization can use Exchange 2007 to provide a greater level of availability at a reasonable cost and with a relatively simple configuration. In addition, the passive copy of your data can be used to perform backups of your database from either the passive node of your CCR cluster or the LCR copy disk, using Microsoft Volume Shadow Copy Service (VSS). Now that you have an idea of what each solution offers, it’s up to you to decide which one fits your requirements. Stay tuned for the next article in this series, in which I’ll discuss how to implement LCR, and an upcoming article about how to implement CCR.

Comments

Plain text