As you might have heard, Exchange Online experienced a 7-hour outage that affected North American customers on June 24. According to a post-incident report (FO7216) completed by Microsoft, the incident started at 2:00PM UTC and finished at 9:00PM UTC, although some customers might have experienced additional time offline at the start and end of this period. The episode followed an outage for Lync Online the previous day. On the upside, as one customer remarked, “at least SharePoint Online didn’t fail on June 25 in sympathy with its Office 365 co-servers.”
So far there’s no news whether Microsoft will compensate affected customers with a service credit for not achieving the committed 99.9% SLA for Office 365. Perhaps this will happen after the SLA is assessed at the end of June. It certainly seems like such an extended outage comes under the definition of downtime as contained in the “Service Level Agreement for Microsoft Online Services,” the relevant part cited below:
|Exchange Online||Any period of time when end users are unable to send or receive email with Outlook Web Access.|
|Exchange Online Protection||Any period of time when the network is not able to receive and process email messages|
A better definition of downtime for Exchange Online is included in the “Microsoft Exchange Online Dedicated Plans Version Service Level Agreement (SLA)”, which says:
“Downtime” is defined as any period of time when users are unable to send or receive email via all supported mailbox access which is calculated using Exchange application availability in database minutes and combined data where applicable from server, operating system, application, network segments and infrastructure services managed by Microsoft.”
Getting back to Microsoft’s post to explain the root cause behind the problem, Rajesh Jha, the Microsoft corporate VP in charge of Office 365 engineering, described the two incidents in a June 26 post. After reading this text, you might be wondering how “an intermittent failure in a directory role that caused a directory partition to stop responding to authentication requests” might cause a major outage, even if “the nature of this failure led to an unexpected issue in the broader mail delivery system due to a previously unknown code flaw.” Here’s my best understanding of the situation.
All email services depend on some form of directory service to determine how to route messages, including the ability to drop inbound messages for unknown recipients. In this instance, it seems like a portion of the Azure Active Directory infrastructure that serves Exchange Online became unresponsive, meaning that look-up requests failed. The Exchange Online Protection (EOP) service could not process inbound and outbound messages and queues accumulated. And while the initial problem affected only a small group of tenants, the flaw exposed another problem in the mail delivery flow that escalated to affect many more.
As noted in the Microsoft incident report:
“… some customers experienced delays when sending or receiving email messages to recipients who are outside of their organization. This also impacted hybrid customers sending mail between their hosted and on-premises mailboxes. Investigation determined that a portion of Exchange Online directory infrastructure was in a degraded state, causing impact to Exchange Online Protection mail flow.”
Some reports said that EOP was marking all inbound and outbound email as spam, meaning that the messages would not be processed. However, this seems to be an old wives’ tale as the queued messages were eventually delivered once the directory service began to respond properly again.
The same kind of directory access problems can be seen in on-premises deployments where Active Directory becomes unresponsive or unable to cope with the volume of inbound requests. Back in the early days of Exchange 2007, I was involved in a project that deployed Exchange 2007 in a form of “private cloud” where all clients connected across the Internet using Outlook, OWA, or ActiveSync. Everything went reasonably OK until we reached a load of 35,000 clients, at which point the infrastructure became unstable and inbound client connections failed. This was despite deploying a farm of over 20 ISA proxy servers and some 30-odd CAS servers to handle the expected connectivity load.
The long story cut short is that weeks of investigation proved that a) ISA Server was incapable of processing the kind of connections we wanted to process – we solved the issue by deploying an F5 BIG-IP load balancer, and b) that our Active Directory infrastructure couldn’t handle the client authentication load. That problem was solved by increasing the MaxConcurrentAPI setting on domain controllers.
Directory problems can be really difficult to diagnose and sort out, so I have a certain sympathy for the Exchange Online engineers. The problem was exacerbated by the length of the outage, the fact that it occurred during the working day, and some communication difficulties in the Office 365 Service Health dashboard, which blithely reported that all was well in the world of Exchange Online to some customers. Let’s hope that Microsoft can address these issues.
Running a directory service for a massive multi-tenant cloud environment is not a straightforward task. Office 365 has done pretty well to avoid this kind of problem since its introduction in June 2011, but that’s not to say that a three-year run without problems justifies the kind of outage that has just happened. Rajesh Jha notes that Microsoft has to “harden” the layers where the issue erupted. No excuses asked for or intended. It just has to happen. And like what happened with the major Office 365 outages in the autumn of 2011, Microsoft should move quickly to compensate affected users.
Microsoft has to work on their communications too. Apart from fixing the service dashboard, they might point out to the employee who told some people that “on-premises Exchange won’t be available for too much longer” that a) Microsoft has a ten-year support policy for server software, b) not everyone wants to embrace the cloud as thoroughly as its zealots would like, and c) dumping on-premises software is a great way to force the installed base to consider non-Microsoft alternatives. Thankfully, the person who made this statement is unlikely to have any vote in the long-term strategy for on-premises software.
Follow Tony @12Knocksinna