Editor's Note: Because of next week's US Independence Day holiday, the next issue of Exchange and Outlook UPDATE, Exchange Edition, will be on July 12, 2002.
How do you measure reliability in your Exchange Server environment? You can use a standard military specification that factors failure rates and the mean time between failures (MTBF) into the total operational period. However, reliability measurements are best suited to individual physical system components. Do these values really mean anything to you as an Exchange administrator?
I prefer to think about availability rather than reliability. However, most Exchange administrators look at availability from a binary point of view (i.e., is your Exchange server up or down?) or as a measurement of the percentage of time a system is available for a given operational period (e.g., 99.999 percent). These views might not be the most effective way to measure availability in your Exchange environment.
When you want to accurately measure Exchange availability, where should you start? First, you need to understand that downtime isn't simply about the server. The poor Exchange server often takes complete blame for an entire outage period, the majority of which isn't necessarily the server's fault. Suppose your Exchange server is down (i.e., unavailable) for 8 hours. Rather than simply blaming Exchange, look deeper. You might discover that you weren't notified of the problem for 2 hours, you took 1 hour to decide what to do, 2 hours to find a good backup tape, and another 3 hours to restore the server to operational status. In this case, you can contribute only a small part of the downtime to software or hardware—most relates to personnel, procedural, and process issues (e.g., monitoring, alerting, disaster recovery). When you understand that downtime and outages actually consist of multiple components, you start to rethink how you measure downtime. For an Exchange deployment, you need to identify the components of downtime, then figure out how to address each component to reduce downtime and thereby increase availability.
You also must come to terms with how you measure availability. Most people think that Exchange availability is synonymous with Exchange server uptime. However, this definition might not provide an accurate picture of true availability (i.e., the availability of Exchange services to users). An Exchange server might be up and running, but that doesn't mean that users can get the services they require. For example, if a user's mailbox is accessible but the user can't get to an important public folder on another server, the Exchange service (or a subset of it) is unavailable—even though the user's mailbox server is running just fine. Likewise, if Exchange points of access (e.g., mailbox and public folder servers) are fully operational but all the bridgehead (i.e., routing) servers are down, mail won't flow between sites and routing groups or the Internet—thus, the Exchange service isn't completely available. The measurement of availability in your environment should be well thought out and should provide a picture of Exchange availability from your business's perspective.
Should you measure Exchange availability from the server's point of view (which can be a bit myopic), or should you primarily consider the client or user's perspective? The best approach seems to incorporate both viewpoints by focusing on the availability of Exchange "service elements," an approach that's in line with defining appropriate service levels for your Exchange environment. Service elements might include message routing, mailbox access, public folder access, protocol (e.g., POP3, IMAP4, HTTP, Messaging API—MAPI) access, recovery services, security protection (e.g., protecting against viruses, blocking spam), and other Exchange functionality that you can treat as a service for measurement purposes.
I'm not talking about something new or revolutionary—I'm encouraging you to change the way in which you think about and measure Exchange availability. Whether you use Exchange 2000 Server or Exchange Server 5.5, reevaluating your definition of downtime and availability and focusing on the availability of service elements can help you get a more accurate picture of how your Exchange environment operates and measure availability in a way that's meaningful to your organization and how it does business.