Life with Exchange Server Clusters

Exchange Server 5.5 Enterprise Edition (Exchange Server 5.5/E) is the first Exchange version to support Microsoft Cluster Server (MSCS). For some time now, my mailbox has resided on a cluster, but that fact hasn't affected me as a user, because a cluster is invisible to users. Exchange Server on a cluster behaves and appears as it does on a standalone server. What experiences have people had with clusters during the product's first year?

Let's briefly review a few essential points about clustered Exchange Server computers, look at some problems that affect your decision to use clusters, and see whether the future promises any improvement. (For more information about MSCS and Exchange, see "Exchange 5.5/E and Microsoft Cluster Server," Windows NT Magazine, February 1998.)

Why Use Clusters?
Cluster technology lets a set of computers share common resources. The aim of clustering is for computers to continue to provide services to users even if hardware or software failures render one or more cluster members inoperative. Clustering technology is well understood today. Clustering's commercial origins were in VAXclusters, which Digital Equipment originally introduced in the mid-1980s. Today's MSCS supports only two computers in a Windows NT cluster. Microsoft built MSCS around a shared disk array, which the active computer accesses exclusively. Exchange Server runs on a virtual server, which has its own IP address and network name. MSCS switches the virtual server between the two physical servers in the cluster as circumstances change.

A transition process automatically switches services to the standby computer if a problem occurs on the active computer. Thus, an application such as Exchange Server is more available to users on a cluster than on standalone computers, which generally require manual intervention whenever a problem occurs. The prospect of enhanced resilience is the most attractive feature of clusters, especially if companies want to host thousands of mailboxes on one server. Obviously, a system crash that affects thousands of users causes more problems than a crash on a smaller server that hosts a few hundred mailboxes.

Cluster Basics
MSCS clusters' two servers must have symmetric hardware (i.e., the CPU power and available memory), because Exchange Server does self-tuning and stores many tuned parameters in the system Registry. Those parameters depend on a specific hardware configuration, so using different values on different servers clearly affects performance. The effect becomes more marked as the difference between server configurations grows, so clustering a 100MHz Pentium with 64MB of memory with a dual-processor 200MHz Pentium equipped with 256MB makes little sense.

In practice, you don't see many people building clusters from servers that they already have. Microsoft supports clusters only on certified configurations, so you'll probably end up buying a packaged configuration from a major hardware vendor. When Microsoft first released MSCS, Microsoft was involved in the certification process, and the process took a long time to complete. Now, Microsoft offers a set of tools to vendors to carry out self-certification. The important point is never to select hardware that isn't on the cluster compatibility list. (The list is available at http://www.microsoft.com/isapi/hwtest/hcl.idc.)

The shared disk array is the heart of the cluster. You must run a high-quality RAID controller to protect the array. The cluster in my office, at the low end of the spectrum, runs a Compaq StorageWorks RA310 array, which is enough to satisfy my company's needs. Most Digital customers who run Exchange clusters have opted to operate controllers such as the StorageWorks RA7000, which offers dual redundancy and a large cache.

Experience shows that you must build production-quality clusters only from high-quality hardware. Any attempt to cut corners will result in problems. Use top-quality servers, redundant power supplies, Error-Correcting Code (ECC) memory, and the best possible RAID array you can afford. Be aware, however, that configuring and maintaining high-end controllers and arrays is not as straightforward as working with lower-end counterparts. Expect that systems administrators will need some time to get up to speed on clustering and the hardware configuration before they put the cluster into production.

Getting Exchange Running on a Cluster
MSCS requires Windows NT Server, Enterprise Edition (NTS/E). You must apply several NT fixes (available from the Microsoft FTP site) after you install NTS/E. Review Microsoft Support Online for the latest information about MSCS, and download the relevant hotfixes before you load Exchange. The only version of Exchange Server that you can install on a cluster is Exchange Server 5.5/E, and you can't upgrade a previously standalone server into a cluster.

These requirements bring up two problems. The first problem is the software cost associated with a cluster. Most large companies adopt Exchange Server 5.5/E because it contains the X.400 connector and supports the unlimited Information Store—two good reasons for large companies to adopt it as their default version. However, NTS/E doesn't offer so many immediately obvious advantages. NTS/E is expensive to license and support, and you probably don't need to run it unless you want to run MSCS.

The second problem is the inability to upgrade an existing server to form a cluster. Some people worry about this restriction, but I don't. If you're going to use new hardware for the cluster, why complicate matters by upgrading an existing server? In Digital's case, the cluster joined an existing site as a new server. Screen 1 shows the list of servers in the Exchange Administrator program. DBOIST-MSXCL is the cluster; PLATINUM and DBO-EXCHANGEIST are two ordinary servers. Note that the cluster appears no different than a standard server, at least to the Exchange Administrator program.

After you stabilize and test the cluster configuration to ensure that everything works as you expect, use the Move Mailbox option in Exchange Administrator to gradually migrate mailboxes to the new cluster. The old servers now act as bridgeheads for connectors to other sites or as public folder servers.

The time you'll need to move mailboxes is the only problem with this approach. My mailbox (about 175MB) took more than 30 minutes to move, and during this time, I couldn't access mail. My mailbox is a little larger than the average, but you can still expect to spend several hours moving mailboxes. Be aware that mailboxes can't receive new mail when you're moving them. When a mailbox is in transit, Exchange sends a nondelivery notification such as, "The message was undeliverable because the recipient specified in the recipient postal ad-dress refused to accept the message. MSEXCH:MSExchangeMTA:Dublin:DBO-EXCHANGEIST" in response to incoming messages. The originator can resend the message after the move is complete. Plan to move mailboxes after business hours to avoid this problem.

As a user, I experienced two interesting side effects from the move. First, the Move Mailbox option doesn't move anything in the Deleted Items cache; therefore, you can't recover a deleted item after you migrate the mailbox.

Second, to let users connect to public folders on the same server as their mailbox, I created replicas on the cluster. When I connected to these replicas, I found that the read/unread status for the items in the folder disappeared and all items appeared to be unread. In most cases, changing the read status on an item isn't a serious problem, but some e-forms applications might depend on the read attribute, so you need to check your applications before you change the status. I could have avoided the problem by not creating new replicas, so Exchange would have redirected access to the replicas on the original server. However, the aim of the migration is to eventually replace the old server with the cluster, so you must move public folders at some point.

Another minor problem is the requirement to grant access to the hidden Events Root folder on the new server. If you don't hold owner permission on this folder, you can't edit or associate script code to folders for execution by the Exchange Event Service. Event scripting is still somewhat esoteric, and not many people are interested in it. You can easily set the permissions, but it's just one more task for you to do. These problems aren't specific to clusters; they occur any time you migrate mailboxes from one Exchange Server computer to another.

Over time, my plan is to gradually move the connector and public folder replicas to the cluster and then begin to decommission the old servers. Then again, I might elect to keep the old servers running as pure connector or public folder servers. Leaving a server to act as a pure hub makes sense when the server hosts multiple connectors to different sites or foreign email systems.

When the cluster is up and running, you manage clusterwide services, including Exchange Server, through the MSCS Cluster Administrator program. You access the Cluster Administrator through the Administrative Tools program group. Screen 2 shows the set of Exchange services running on the active server, as viewed through the Cluster Administrator. Marking the services with a red X (on the icon) signifies that the services haven't been started or they've encountered a problem. With time, managing clusters becomes second nature.

Not Everything Works
MSCS is a new technology, and like other new products, associated technology doesn't always work. In the case of Exchange Server, I've found two problems. First, not all the code that ships with Exchange Server is cluster-aware. Second, most third-party software doesn't work with clusters.

Outlook Web Access (OWA) is the major missing piece of functionality. A cluster doesn't support OWA. A workaround exists, but Microsoft doesn't support it. Other missing pieces include support for the Lotus Notes, IBM PROFS, and IBM SNADS connectors, and X.25 connectivity.

Finally, you must use the ISINTEG utility (in -PATCH mode) with care. ISINTEG checks for inconsistencies in the Information Store and can alter the base globally unique ID (GUID) that the Information Store uses to create new objects. If ISINTEG changes the GUID, Exchange might not be able to restart the Information Store.

The documentation says that you can avoid the problem by setting the _CLUSTER_NETWORK_NAME variable beforehand, but a bug makes this step insufficient. However, a patch is available, and Exchange Server 5.5 Service Pack 1 (SP1) fixed the problem. See the Microsoft Support Online article Q185942 (http://support.microsoft.com/support/kb/articles/q185/9/42.asp) for further information. You can expect Microsoft to address these problems over time, but in the meantime, you must run a separate server to host the missing pieces, such as the Lotus Notes connector.

Few large operational sites use Exchange in isolation. They use third-party products for backups, virus checking, public folder indexing, document management and workflow, and many other purposes. However, I haven't found any add-on software that explicitly supports Exchange Server in a clustered environment. The only way to find out whether a product works is to install it and see what happens. All too often, the product fails.

Not being able to continue using products such as Computer Associates' ARCserve or Legato Systems' NetWorker for backup or NetIQ's AppManager to monitor the server is annoying, especially if you built your operational procedures around specific product features. Sufficient momentum has not yet built up for clusters, so vendors don't have any commercial reason to upgrade their products to support clusters. Meanwhile, before you install MSCS, you must review your third-party products to check whether they support clusters.

Are Clusters Really Resilient?
Clusters built with high-quality hardware are resilient and can meet targets of 99 percent uptime or greater. However, individual servers built from similar-quality hardware are just as reliable. MSCS's shared disk array represents a single point of failure, and no matter how many servers connect to the array, the fact remains that they can do nothing if the array fails. Of course, the same fact is true of individual servers.

Outside of failures in the disk array, clusters achieve resilience because of their ability to move the NT services that form applications (e.g., Exchange) to the standby server if a hardware or software problem affects the active server. MSCS constantly monitors the heartbeat of the active server and the status of the services running on the active server. If MSCS detects a problem, MSCS transfers the services across to the passive server, which takes on the role of the active server.

Exchange is a complex application, and MSCS might have to move several services (System Attendant, the Message Transfer Agent—MTA—the Internet Messaging Service—IMS—the Information Store, and so on) before Exchange can resume normal service. The transition won't affect users, because clients can reconnect automatically. However, clients can reconnect only when the Information Store service is active again. This process can take up to 10 minutes, in part because the Information Store must commit any outstanding transactions that were waiting when the cluster transition began. However, some people believe that MSCS slows the process down a little. The transition time seems to increase with the size of the Information Store databases and the number of supported mailboxes. This characteristic is unfortunate, because clusters typically host large user communities and consequently large databases (50GB or larger).

Should I Deploy Clusters Now?
People who run clusters generally like their stability and robustness. High-quality hardware and good operational discipline have much to do with these features. However, if you're not convinced that clusters are mature enough yet for your purposes, or you depend on a third-party application that MSCS doesn't support, you can achieve much of the same resilience by buying two high-end servers and dividing the user community across them. You won't get automatic transition on failures, but you won't affect all your users if a catastrophic hardware failure occurs­a case of not putting all your eggs in one basket.

Clustering will improve in the long term. Microsoft knows that clustering is important for companies to accept NT for mission-critical computing. Microsoft plans to let multiple Exchange Server computers access shared disks concurrently rather than exclusively, but you won't see this feature until the release of NT 5.0 and of a verson of Exchange Server with new functionality. Until that time, the choice is yours: Go with clustering now and accept its limitations, or use the same money to buy two servers.

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish