Best Practices for High Availability

Master the UP time equation

A sizable portion of the analyst community has long been skeptical of Microsoft Exchange Server's ability to deliver a highly available messaging solution. In July 1999, GartnerGroup cautioned companies against consolidating smaller Exchange servers (e.g., systems that support fewer than 500 users) into large systems because of the difficulty of managing large databases and the poor performance that clients often experience when connecting Messaging API (MAPI) clients over extended WAN links. Exchange 2000 Server addresses many of the concerns that analysts have expressed. Better clustering, the partitioning of the Information Store (IS) into easily manageable databases and storage groups (SGs), and better integration with the OS all contribute to a more resilient service.

Advances in Windows 2000 (Win2K) and Exchange 2000 are important but are only part of the overall equation that determines how to deliver highly available systems. In this article, I want to reflect on how to achieve highly reliable systems with Exchange 2000 and Exchange Server 5.5.

An Uptime Survey
Before we look at an approach to uptime, let's look at what people are achieving today. In late 1999, the California research company Creative Networks conducted a survey of 63 companies. The companies' mean level of Exchange Server uptime was 99.6 percent. A full 56 percent of the companies were exceeding their uptime target. The survey left this target unstated, but it was probably higher than 99 percent. Clearly, this survey covered only a tiny portion of the Exchange Server installed base, but the 99.6 percent mean uptime level was greater than I expected. The survey also reported that the companies experienced an average of 71 minutes of unscheduled downtime per month, as well as an additional 112 minutes of scheduled downtime.

I want to know what problems caused the unscheduled downtime and why the administrators needed to take Exchange Server down on a scheduled basis each month. Installing service packs and hotfixes is a good reason, but I wonder whether some administrators are unnecessarily running Eseutil to compact databases. Unscheduled downtime of 71 minutes is a lot, especially if it occurs at peak times during the day, such as 9:00 a.m., when users are attempting to read their email after they've arrived at work.

The Uptime Equation
Table 1 illustrates acceptable downtime at different levels of availability. Typically, highly available systems seek to attain uptime of 99.99 percent or greater. Few Windows NT systems—let alone those that run Exchange Server 5.5—ever attempt to meet such a lofty goal. And software isn't the only underlying reason.

The classic equation that expresses the factors that determine uptime is

Uptime = Software + Hardware + Operations + Environment

This equation shows that a combination of software, hardware, operations, and operating environment determines uptime. The failure of any one of these elements affects uptime. You can blame Microsoft for bugs in NT or Exchange Server, but you can't blame the company if a hardware fault causes Exchange Server to stop, if an operator fails to perform a backup the day before a disk corruption occurs, or if a network failure stops messages from transporting across a backbone or to the Internet.

Companies that achieve high availability take a rigorous approach to mastering the uptime equation. As an example of how successful some companies are at achieving high availability, one company's OpenVMS servers were up continuously from 1981 to 1999—more than 18 years. Systems administrators took the servers down only to perform Y2K-compatibility checks on an application that controlled a manufacturing process. Clearly, those administrators operated on the If it ain't broke, don't fix it principle: They never applied upgrades or patches to the OS or the application, and they resisted the temptation to upgrade hardware components. The decision not to upgrade anything is a decision that you can probably make only on systems that operate within restricted networks (or no network at all) and in special circumstances. Given the number of service packs and hotfixes that Microsoft has issued for NT and Exchange Server over the past 5 years, the notion of keeping a server online all the time is difficult to fathom. You'd still be running Exchange Server 4.0 (with no patches) on NT 3.51 Service Pack 4 (SP4), you wouldn't be secure, and you wouldn't be Y2K-compliant. Of course, a comparison of OpenVMS and NT is unfair, largely because of the increased pace of development in both hardware and software today. In 1981, systems administrators typically saw one software upgrade and one new VAX computer per year. Now, new hardware debuts monthly, and service packs, hot fixes, and completely new OS releases (e.g., Win2K) occur frequently.

To build highly available servers, companies often plan their implementations based on the following simple principles:

Never deploy software without consideration. Carry out design and planning exercises to ensure that you can deploy software in a manner that delivers quality service.
Carefully test software before you put it into production. Test all aspects of the combination of OS, Exchange Server, service packs, and third-party software that will deliver the messaging service to users.
To protect your databases, always use high-quality hardware for Exchange servers, and pay careful attention to the disk subsystem (i.e., controller and disks). Monitor firmware updates to controllers and disks, and apply the updates regularly during scheduled maintenance. Be sure to protect the hardware from power surges or other electrical faults.
Take a highly disciplined approach to systems management and monitoring. Perform and verify backups. Scan event logs daily, and proactively identify any problems that might lurk in the background. Use regularly updated antivirus software. Run disaster-recovery exercises, and document the results. Record statistics monthly.
Pay close attention to any environmental variable that might affect the servers' smooth operation. Monitor the network, and do nothing to underlying parts of the infrastructure (e.g., DNS server, WINS server) that might affect client or server connectivity. Train users to make effective use of system and network resources.

Microsoft Classifications
In white papers such as "Microsoft Windows NT High Availability Operations Guide: Implementing Systems for Reliability and Availability" (http://www.microsoft.com/ntserver/nts/deployment/planguide/highavail.asp), Microsoft has attempted to bring these guidelines to the forefront. This paper defines six classifications for operational procedures: Planning and Design, Operations, Monitoring and Analysis, Help Desk, Recovery, and Root Cause Analysis. You could apply these classifications, which aren't unique to NT, to any enterprise-level OS—they lay down the foundation of a plan to achieve high availability.

Planning and Design. Clearly, you need to start with a good design, and you achieve a good design only through detailed planning. As Win2K and Exchange 2000 debut, you might want to assign a special team (e.g., employees responsible for the network, namespace, OS, and applications) to perform the detailed design work. All too often, a company's network team generates one design while the OS team generates another—and both designs are supposed to work seamlessly together to form a basis for application deployment. The unfortunate applications team typically gets the dirty end of the stick—this team must work within the constraints of design work that the other teams have performed without regard to the application team's requirements. Because Active Directory (AD) integrates data from the OS and basic functions such as DNS with data from AD-integrated applications such as Exchange 2000, AD's design and implementation require that different teams work together to create a unified design. Companies that generate plans based on all needs will achieve much higher reliability than companies that expect fractured planning to automatically gel.

Operations. Drawing up a set of operational procedures is a good start. However, you must execute the procedures to achieve success.

Monitoring and Analysis. If you master the basics—such as performing full daily online backups, running Performance Monitor to keep an eye on the system, and checking the event logs for unexplained errors—you'll experience less downtime and fewer system outages than if you simply trust that Microsoft software never goes wrong.

Help Desk. You need to establish well-defined escalation paths so that people know what steps to take in problem resolution when the first potential solution fails. Systems administrators perform first-level resolution, but what happens when you find a backup tape that contains bad data that you can't restore? Who takes care of directory or public folder replication that doesn't work? Who sorts out a DNS problem that prevents servers from finding one another, thereby halting message routing?

Recovery. A tested disaster-recovery plan is an important component of your escalation procedure. As a speaker at the 1999 Microsoft Exchange Conference said, "An untested disaster-recovery plan is simply a set of well-organized and documented prayers."

Root Cause Analysis. Mainframe and minicomputer administrators are accustomed to analyzing why problems occur so that they can take measures to prevent the problems from recurring. This discipline comes from the earliest days of computers, when CPU time and memory were precious resources that you didn't want to waste with buggy programs or with insufficient or inaccurate operating procedures. Are we taking the same care with our Windows systems? In May 1997, the late and lamented Byte Magazine concluded that NT administrators are extremely prone to taking shortcuts to get systems back online after an outage and don't take the time to understand why the problem occurred. Perhaps this behavior is a throwback to Windows' history, when a quick reboot was often the only way to stop a looping program or to regain memory or other system resources that were gently leaking away. The temptation is to reach for the power switch. However, cycling the power not only hides the root cause but can also generate new problems. Win2K is more complex than NT, and the relationship between the OS and applications is deeper than ever. Going for the quick fix doesn't make sense, particularly if you're still learning how the software works. Instead, systems administrators need to understand why problems occur and how best to address them.

Set Your Priorities
You might wonder why a messaging system needs to attain an uptime record greater than 99.9 percent. The obvious answer is that email is like your telephone—users expect it to be available all the time, just like a dial tone. Even so, efforts to attain greater than 99.9 percent uptime might be a matter of expending much effort for little gain. Email isn't as time-critical as other applications. A material-wasting computer failure in a manufacturing plant's production process is an obvious example of a situation in which uptime is cost-critical, but a messaging system hardly falls into that category. Losing a messaging service irritates users and slows the delivery of some messages, but it's hardly the end of the world. Phone or fax messages can always cover the outage. Alternatively, you can maintain a free MSN Hotmail account for those occasions when your corporate email system is unavailable.

A system is a set of pieces that fit and work together. If you expect a system such as Exchange Server to work perfectly all the time, particularly if you concentrate on only one piece of the puzzle—such as the hardware design for servers—you're headed for a disappointing experience in which you'll probably lose some data. Mastering the uptime equation is essential to a successful messaging-system implementation.

Comments

Plain text