Message servers became more resilient last fall when Microsoft released two products: Microsoft Cluster Server (MSCS—formerly code-named Wolfpack) and Exchange 5.5, Enterprise Edition (Exchange 5.5/E). MSCS is the first clustering solution for Windows NT to fully support Exchange Server in production environments. And Exchange 5.5/E, released in November 1997, formally supports clusters. Many large installations have been eagerly awaiting Exchange 5.5/E because they want to build very large servers or consolidate several small servers into large clusters to reduce the load of systems administration. Exchange 5.5/E offers two technical advances—clustering and an unlimited information store—that are key to building very large servers.
Putting more than a thousand mailboxes on a nonclustered server—even a server that's protected by a UPS, a high-specification RAID-5 array, and other resiliency features—is an act of faith. A hardware or software problem can interrupt the email service, and people can't do their work. Exchange 5.5 addresses the problem by supporting an active/standby cluster configuration. (For descriptions of clustering terminology, see Joel Sloss, "Clustering Terms and Technologies," June 1997.) Exchange ordinarily runs on the active node in the cluster; if a problem occurs, MSCS automatically transfers work to the standby node, which becomes active and continues to process user requests. Let's see how MSCS and Exchange 5.5/E can work for you.
Helping Exchange Understand Clusters
Engineers who want to support Exchange in clusters have several challenges, in addition to the obvious requirement to provide redundancy through hardware. One challenge is how to handle the stores (Exchange's databases). Another challenge is how to associate user mailboxes with particular Exchange servers. Exchange configuration data relating to components such as bridgehead servers (servers that connect Exchange sites) for messaging and directory replication connectors can also be server-specific.
The Exchange information and directory stores use a complex transaction model; databases, transaction logs, and queues held in memory represent the full stores. Any failover must be able to seamlessly switch the stores back to the state they were in when a problem occurred. Exchange can handle this requirement because of its capability to roll outstanding transactions forward into the database from the transaction logs. The transaction logs also satisfy the MSCS requirement for data to be persistent. In other words, applications must always write data to a place where you can access it, even if the cluster fails over.
When any Exchange server suffers an unexpected failure (such as a power outage), Exchange automatically recovers transactions the next time the Exchange Information Store (IS) service starts. This process is a soft recovery. When the active server fails in a two-node cluster, Exchange performs a similar recovery: The newly activated server takes responsibility for committing any outstanding transactions to the database before letting users reconnect. Conceptually, therefore, the IS has always been reasonably well prepared for clustering.
Breaking the association between user mailboxes (and configuration data) and specific servers is trickier. Within sites, administrators can assign servers certain work. Some servers might handle only public folders, some deal with connections, and others might act as hosts for user mailboxes. Exchange knows the work that each server performs from information that the directory holds; the name of a physical server represents the namespace services use to access data.
Clusters make systems more resilient by ensuring that each server in the cluster can perform the work of its peers, if necessary. You must, therefore, develop a method to allocate work to the cluster as a whole, rather than to an individual server, and then modify the software to permit individual cluster members to assume tasks as the cluster state changes. To allocate work to the cluster, Exchange uses a cluster alias because an alias lets you address the cluster as a named entity. In short, you alter the namespace represented by a physical server to accommodate the concept of a virtual server whose workload any server in the cluster can perform. In clustering terms, you define the virtual server as part of a cluster resource group.
Another set of changes completes support for clustering in Exchange. Most of these changes are at the application level (the Exchange services such as the Message Transfer Agent—MTA—the Information Store, connectors, etc.) on top of underlying APIs that MSCS provides. For example, Exchange treats each service as a separate resource, so an administrator can fail one service such as the Internet Mail Server (IMS) without stopping the IS. However, you can't fail one service and restart it on the standby cluster server. All Exchange services must run on the active cluster server.
In summary, Exchange 5.5/E supports clustering with these new features:
- A Setup utility cluster prompt: If you have installed MSCS, Setup will prompt you to install the cluster-aware version of Exchange, as Screen 1 shows. You can't install the standalone version of Exchange on a server where MSCS is active.
- Support for the concept of virtual and physical namespaces: Installing Exchange Server onto a cluster creates a new resource group to hold details of Exchange's clusterwide resources. These resources include the disks used to store the application files and binaries and a network name. The network name functions as the name of the virtual server that runs Exchange in the cluster, and it appears as the server name when you view the cluster through the Exchange Administrator program. You also use the network name to specify the name of the cluster when you define a bridgehead server for a messaging or directory replication connector.
- Support for cluster state transitions: When the active server fails, the cluster goes through a transition. Responsibility for running applications passes to the passive standby member, which now becomes active. Exchange 5.5/E includes modifications to let its services gracefully fail over to the other server. Transitions can also occur when an administrator voluntarily stops an application to perform maintenance.
- Administrative support for virtual servers: Microsoft has changed the Exchange Administrator program to let server monitors work with virtual servers. However, you can stop and restart services only by using the MSCS administration program.
Software and Hardware Requirements
To build an Exchange cluster, you need NT Server 4.0, Enterprise Edition (NTS/E), Exchange 5.5/E, and cluster-compatible hardware. You must configure MSCS and get it running before you install Exchange. You need the enterprise edition of Exchange because the standard edition doesn't support clustering, and the enterprise edition lets you use the unlimited store, which is the other essential component required to build very large servers. I based this article on my experience with a Digital AlphaServer 4000 cluster, composed of two 466MHz AlphaServer 1000 processors with 512MB of memory and a StorageWorks 450 RAID array.
You can build clusters only from specific hardware, so review your existing configurations to determine whether you can configure your hardware in clusters. (For a list of cluster-compatible hardware, see http://www.microsoft.com/ntserver/info/hwcompatibility.htm.) Microsoft does not support upgrades for existing standalone servers to form clusters. (Microsoft might relax its insistence that cluster hardware come from a controlled compatibility list as it gains more experience with clusters.)
Strictly speaking, the hardware must be symmetrical (i.e., the servers in the cluster must be identical in CPU power and memory), a requirement Exchange imposes to enable automatic tuning. The hardware must be symmetrical because the Performance Wizard (PerfWiz) runs only on the primary node. PerfWiz can't run on the secondary node because it can't access the shared disks where the Exchange data resides.
PerfWiz attempts to determine optimum performance settings for Exchange, including initial memory buffer allocations and the best (and the fastest) location for important Exchange files, such as the Information Store. PerfWiz writes this information into the Registry. If the hardware is asymmetrical, the performance settings for the primary node might not match the characteristics of the secondary node after a failover occurs, and performance will inevitably suffer. However, if the two servers have similar memory and CPU power, you can probably accept less than 100 percent of the performance settings after a failover.
Exchange 5.5/E introduces dynamic buffer allocation. This features is code that constantly monitors and adjusts the memory utilization of Exchange with respect to other programs' demands for NT's memory.
Dynamic buffer allocation negates some, but possibly not all, of the effects of failing over to asymmetric hardware. For best results, follow Microsoft's recommendations and use identical hardware for both nodes.
Installing Exchange into a Cluster
You must use the MSCS administration program to create a cluster group for Exchange before you begin the installation. Installing Exchange on the primary and secondary servers in a cluster requires different processes.
When you install Exchange on a server where MSCS is present, the Setup program creates the Exchange Server directory structure (usually \EXCHSRVR) on a shared cluster drive. You can't select a drive destination that isn't a shared cluster drive. Setup copies all the Exchange executables and data files to the selected drive. The Exchange executables used on a cluster are different from those used on a standalone system.
After Setup creates the directory structure, it creates and registers the Exchange services, copies system shared files into the local %ROOT\SYSTEM32 directory, and creates resource dependencies within MSCS. For example, the Exchange MTA depends on the Information Store. If the store isn't running, the MTA can't start.
When Setup has completed these steps, you can run PerfWiz. Note that PerfWiz analyzes only disks that are defined in the Exchange resource group; it ignores disks local to a server. In a cluster, you can't locate files such as the transaction logs on any disk that isn't available to the cluster as a whole. If you place files such as the transaction logs or Exchange MTA work files on local disks, cluster failovers won't work because the data needed to complete the transition won't be available. Although clustering provides some resilience, the shared disks still represent a potential single point of failure. Good backup discipline remains critical in a clustered environment.
A cluster can begin operating after you install Exchange on the primary node; you don't have to perform secondary installations immediately. Installation on a secondary server is simpler because most of the files that Exchange uses are already located on the shared cluster drive.
In a secondary installation, Setup copies system shared files into the local %ROOT\SYSTEM32 directory. Then Setup creates resource dependencies and creates and registers Exchange services. Exchange uses wizards to configure the IMS and the Internet News Server (INS). Microsoft has altered these wizards to deal with clusters; run them only on the primary node. However, Microsoft offers an update node option to update the Registry on the secondary node.
The View from the Administration Program
If you complete the secondary installation correctly, you can view a screen like Screen 2. This screen shows two servers, CSSNT1 and CSSNT2. The cluster group clearly shows that CSSNT2 is currently active, and all the Exchange services are running (online) on that server. Screen 2 shows two network names, EXCHANGE and INTERNAL. EXCHANGE is the name of the virtual server, which an administrator would use to associate user mailboxes with a server or to define the name for a bridgehead server. INTERNAL is the name of the heartbeat connection between the two servers. When the standby server notices that its partner's heartbeat has stopped, Exchange redirects users to the newly activated server.
Screen 3 shows the view of the network interfaces from CSSNT1. Each interface has a separate IP address.
Upgrading Exchange into a Cluster
As I noted previously, Microsoft does not officially support upgrading existing servers to form clusters because the Exchange engineering group has imposed this restriction. New installations are always simpler than upgrades, and developers are designing new hardware to support the requirements of MSCS. The requirement to use symmetrical hardware chosen from the compatibility list is enough to stop most people from even thinking about upgrading existing hardware. But if you have some suitable hardware, how can you introduce a cluster into an existing Exchange environment? You can take two approaches.
First, you can introduce the cluster as a new server within an existing site. This option is easier because you need only to make sure that the cluster joins the right site when you install Exchange. Later, you can move mailboxes, public folders, and connectors to the cluster from older servers. When you get the system working, you can remove the older servers from the site.
The other option is a "forklift" upgrade. In this instance, you take the name of an existing server and use it for the virtual server in the Exchange resource group when you form the new cluster. You stop Exchange services on the old server, back up its directory and information stores, and then remove the server from the network. Then, you install Exchange with the Setup/r option, which moves files onto the shared disks but doesn't start the Exchange services. After you complete the installation on the primary node, you restore the directory and information stores in the appropriate directories on the cluster and then start the Exchange services. Now, you have effectively swapped the cluster in place of the old server in your organization. You can install Exchange onto the secondary cluster node later.
These steps aren't particularly difficult, but they take careful planning. However, a forklift upgrade can take up to one day to perform, or even longer if your databases are very large (more than 10GB). If administrators have to choose between installing a cluster as a new server into an existing site or taking the downtime to upgrade an existing server, I think most will choose the first option.
Failover for Core Services
In clustered environments, Microsoft has divided Exchange Server into core and noncore services. Core services represent the kernel functionality of the server; they automatically restart when a failure occurs. Noncore services require manual intervention and will not restart until the systems administrator has made the necessary changes to a service's configuration. The core services are System Attendant, Directory Service, Information Store, MTA (including the X.400 connector), IMS, and Event Service.
Screen 4 shows the view of a cluster administrator as MSCS restarts the set of core services after a failover. You can see that the cluster has turned over responsibility for running the set of Exchange services to node CSSNT1; MSCS has not yet started two services, the Information Store and the MTA. Before MSCS restarts any core services, it updates the system Registry from the primary to the secondary node. The checkpoints that MSCS sets tell MSCS which portions of the Registry it must keep synchronized between the two nodes in the cluster. This action ensures that MSCS will respect any configuration changes you make to Exchange (such as the path to the disk holding the transaction logs) on the original primary node when you restart the services.
Dependencies Define the Starting Order
Dependencies are properties of each service. A dependency tree defines the services' starting order—the System Attendant and Directory Services start first, followed by the Information Store and then the MTA. The tree records the needs of each service, and each service checks the tree before it restarts. If you have experience with Exchange, you understand the concept of dependencies because they exist even outside a cluster. The MTA cannot start before the Information Store is available because the MTA interacts with the store when it sends messages. Similarly, the Exchange Administrator is of little use if the Directory Service has not started, because the administration program doesn't have a place to retrieve configuration data from.
The matrix of Exchange dependencies is extended in a cluster. You must make the cluster name and IP address available before anything can start, and then you must bring the shared disks online. When you have met these conditions, the Exchange services can start. Screen 5 shows that the Information Store can start only after the Directory is available.
Failover for Noncore Services
The noncore services are Microsoft Mail connector, Lotus cc:Mail connector, Key Management Server (KMS), and INS (for Network News Transfer Protocol—NNTP). In most cases, the manual intervention to restart noncore services is straightforward. For example, you must use the name of the active node for the postoffice name when you reconfigure the cc:Mail connector. The default for the KMS is not to start on failover because KMS needs a password when the service starts. Ordinarily, Exchange retrieves the password from a diskette, but automatic failover would require the diskette to always be available. In addition, to support Outlook Web Access after failures, you must configure Internet Information Server (IIS) as a separate cluster group. Although the INS is not a core service, it does not require any manual configuration to restart; you can reconfigure the INS to restart automatically. MSCS does not support the new connectors in Exchange 5.5 (Lotus Notes, IBM PROFS, and SNADS). MSCS also does not yet support the Dynamic RAS connector, the IMS or INS over dial-up connections, the X.400 connector when run over an X.25 link, and asynchronous connections to Microsoft Mail.
How Long Do Cluster Transitions Take?
Conceptually, clusters promise transparent failover. In real life, however, failover takes some time. When the transition occurs, Exchange might be only one application of several that must move services, so the transition might take several minutes before clients can access their mailboxes.
But even when Exchange is the only application on the cluster, don't expect the transition to occur in seconds. The MSCS resource libraries monitor the correct operation of the Exchange services. You insert probes in the cluster-aware versions of the services, and MSCS monitors the probes at regular intervals to provide feedback about whether a service is functioning as expected. However, not all the Exchange software is capable yet of providing feedback to the cluster (the engineers didn't have time to go completely through the massive code base).
Failover moves all services, but not always quickly. Even on a standalone Exchange server, properly closing down a service—especially large information stores—can take some time. After the cluster has switched over, you might have to roll transactions forward to the Information Store or Directory. You have to reestablish network connections to link to other Exchange servers, sites, and external messaging systems. The exact time required for a cluster state transition varies greatly, depending on workload, time of day, and hardware configuration (faster CPUs and I/O subsystems will complete transitions more quickly). Testing will establish initial baseline figures for your configurations and environment. Clustering is still in its early days, and as you learn the optimum configurations, failovers will get faster.
Client Communication with Clusters
For the most part, clustering doesn't affect clients. Point the client at the cluster name rather than an individual node, and everything works. If you transfer user mailboxes from a nonclustered server to a cluster within the same site, Messaging API (MAPI) clients (Exchange and Outlook) automatically adjust their profile to point to the cluster. Post Office Protocol (POP) 3 and Internet Mail Access Protocol (IMAP) 4 clients point to a host name that's bound to a TCP/IP address. The cluster transition takes care of rebinding the TCP/IP address to the new virtual server so the switchover is invisible to these clients, too. If you configure IIS to run on both nodes and start automatically on failover, Web browsers can continue to use Outlook Web Access.
You can consider the Administrator program as a form of client. Like Outlook, the Administrator program uses remote procedure calls (RPCs) to communicate with the Exchange services. However, admin.exe is one of the binaries stored on the shared disk resources; thus, running the administration program on the passive node is difficult. Performing administration from another system such as an NT workstation might be better because a cluster transition won't affect this setup.
Good But Not Perfect
Clustering for NT has taken a long time to arrive. Its implementation for Exchange helps deliver more robust servers, but the process has some imperfections. Load-balancing across available servers is the most notable omission; clustering helps scalability only by increasing server resilience. The requirements to use new hardware selected from the compatibility list and to have NTS/E will keep some people from rushing to implement clusters. And the fact remains that the disk I/O subsystem remains a single point of failure that even the most sophisticated application can't work around. Administrators must continue to pay close attention to the type of disks, arrays, and controllers used in mission-critical servers to protect the data in the Exchange IS.
However, for those people willing to invest in the necessary hardware, the combination of clustering and the unlimited Information Store in Exchange 5.5 opens the way to servers that support many thousands of mailboxes. Over the next year or so, I will be interested to see how many installations deploy clusters and how many mailboxes the clusters support.