Exchange Server and Uptime: The Search for More 9s

In some businesses, there's always pressure for increased uptime of the messaging and other systems. I've worked with law firms, financial organizations, and other customers for whom time really is money, and their focus is often on squeezing the most possible uptime from their Microsoft Exchange Server organization. With that in mind, I wanted to start discussing how many 9s of uptime Exchange Server 2010 can offer.

Recall that four 9s is 99.99 percent uptime, meaning that the system is down for no more than 52 minutes and 36 seconds per year. That's a paltry 9 seconds per day! A 99.9 percent uptime would allow just less than 9 hours of downtime per year, which still isn't enough for most maintenance purposes. How is it that companies are seeking—and vendors are claiming—99.9 percent or better uptime?

Let's start with a definition of what qualifies as uptime. The first time you have to install the monthly security patches—much less an Exchange rollup or a service pack—you'll blow right through your 9-seconds-per-day downtime limit on a single server. For that reason, Exchange lets you use multiple or clustered servers, and almost everyone excludes planned maintenance from uptime calculations.

With that definition in mind, how many 9s is it reasonable to expect from Exchange? The real answer is a resounding "Who cares?" Not because uptime is unimportant, but because it's the wrong measurement. Rather than counting the seconds of downtime that you can tolerate, your efforts should be focused on two areas: recovery time objective (RTO) and recovery point objective (RPO).

RTO, of course, is the amount of time you're willing to allocate to recovery operations. This figure can range from seconds to days. For example, a complete restoration from a massive failure (like, say, a large office fire that melts all your servers) might take days, but failing over users from one Database Availability Group (DAG) member to another might take only seconds. You get to choose the RTO that's most appropriate for your business, then spend the right amount to ensure that you're protected.

RPO is a bit different, but equally important: It represents the amount of data loss you're willing to tolerate. For example, an RPO of four hours means that you're able to tolerate the loss of up to four hours of mail data. RPOs can range from seconds to weeks (imagine taking a full backup only once per month).

Together, these two factors make up a significant chunk of your service level agreement (SLA). You might not have a formal, written SLA, but I would bet a box of Krispy Kreme doughnuts that you have an implicit SLA that your messaging operations are expected to meet—even if you don't find out about it until an emergency happens. Fallout over implicit SLAs often takes the form of loud arguments about uptime after a failure, threats of firing, and so on, although the results can be more subtle.

Notice that I didn't spend any time in the preceding paragraphs telling you how many 9s Exchange 2010 can deliver. That's because the answer is a big fat "It depends." In future UPDATEs, I'll be delving into this topic in more detail. In the meantime, though, I'd love to hear what your RTO and RPO are, and what your SLA (if any) says they should be.

Related Articles:

Comments

Plain text