In an outage that reminded many of the June 2014 problems that affected Exchange Online users in the U.S. for seven hours, the Azure Active Directory infrastructure serving Office 365 and other Microsoft cloud services in Europe experienced major problems on December 3. The effect of the outage was not uniform and appeared to be confined to Northern and Western Europe.
The dust has now settled on this outage and things have returned to normal. However, the circumstances surrounding the issue deserves some commentary.
First, the facts. The outage started at approximately 8:59 (UTC). The UK-based DownDetecter site, which logs outages, has some nice graphics of how the issue grew in intensity over time before abating after five hours or so at around 13:15 UTC. The initial assessment posted by Microsoft cited “A configuration error led to incorrect routing of production traffic. This resulted in the inability to access services dependent on Azure Active Directory authentication and services.”
A configuration error covers a multitude of sins – anything from a mistake made editing an XML file to a fundamental flaw in infrastructure design exposed by a change in circumstances. Whatever it was, the effect was that the Azure Active Directory infrastructure was overwhelmed when client demand for client authentication began to ramp up. As authentications failed, clients were unable to connect to services. The problem was most evident in Europe because the load on Office 365 increased at the start of the European working day. North American and Asian tenants don’t appear to have suffered any impact.
Microsoft has now released the post-incident report (PIR IS34783) to customers. Like any customer-facing document, some internal details are not revealed, but there's enough in the PIR to make it interesting reading.
The initial diagnosis as reported said that a configuration error led to incorrect routing of production traffic. As demand grew, the situation degraded to become an outage. After Microsoft realized what had happened, they reversed the erroneous configuration and service was restored. The PIR says that the configuration change was made at 11:49. Although some improvements occurred soon thereafter, it took another hour or so before services were fully restored.
The PIR reveals that two primary factors contributed the root cause and prevented users signing on to Office 365.
First, a recent update exposed an unexplained configuration problem between the production and pre-production Azure Active Directory (authentication) environments. The pre-production environment is a test infrastructure that allows Microsoft to validate changes and new functionality before code is introduced into production. The configuration error caused some user requests to be routed to the pre-production environment (where obviously they should never have gone) and created a backlog of authentication requests on the Azure Active Directory front-end servers.
Second, as the backlog accumulated, it caused high system resource utilization that further compounded the issue experienced by the service and led to intermittent authentication request failures within the European datacenters. Those failures are what caused users to be unable to connect to Office 365.
Among the questions that are not answered in the post-incident report are what configuration error occurred and how it came to impact a core piece of Microsoft's cloud infrastructure at a peak time in the European working day.
it seems pretty clear that the root cause of the December 3 incident was human error in the way that Azure Active Directory is managed. The folks who build and run this service are very smart, but life tells us that even the smartest people sometimes get things wrong. The PIR says that Microsoft has addressed the configuration issue and will add additional "fault injection techniques" to improve testing and additional fallback mechanisms to ensure that an older version of the authentication service will be used in case a failure occurs in the latest version. They also plan to improve isolation across service endpoints to prevent cascading failures.
Erosion of resources to handle client authentication requests also featured in the June 2014 incident, so it’s natural to be surprised to see much the same issue reoccur. In fact, that incident had a much worse impact than the most recent problem as the authentication problem in June 2014 caused massive backlogs of mail to build up because email could not be delivered. This didn’t happen last week as mail flow continued throughout.
Interestingly, Outlook and ActiveSync clients continued to connect to Exchange Online during the outage as clients were able to use cached credentials with basic authentication, including accounts configured to use multi-factor authentication. According to the PIR, 1% of Outlook and 35% of Outlook Web App requested were impacted during the incident. This isn't too surprising when you consider that clients like OWA need to authenticate to secure credentials and so would be more impacted because their authentication requests failed. The reasons for the different client behavior is another topic that deserves to be investigated in the post-incident report.
Many Office 365 administrators of the tenants that were affected used Twitter and other social media to voice their annoyance about the lack of information that appeared in the service health dashboard (SHD) for their tenants during the outage. Based on what appeared on dashboards, all seemed well in the world of Office 365 and no indication was given that the problem users were obviously experiencing was due to Azure Active Directory. More information was available through the Azure status dashboard, but the lack of integration between the two dashboards left much to be desired. Unfortunately, the Office 365 SHD depends on the same authentication path as web clients do and was therefore affected by the outage and the status for services like Exchange, SharePoint, Yammer, and Skype for Business could be updated to reflect reality.
In fact, Microsoft anticipated that an outage might cause problems with the Office 365 SHD and built the “Emergency Broadcast System” (EBS) that runs on a different infrastructure and is accessible through status.office.com. The idea is that EBS should switch in automatically if SHD experiences problems, except in situations where the authentication layer fails, which is what happened on December 3. However, it seems like another bug in Microsoft's Cotnent Delivery Network (CDN) kicked in to prevent European tenants seeing the EBS data whereas tenants in other regions did. According to the PIR, Microsoft is addressing this issue with a planned completion date later this month.
Partners who support Office 365 tenants also complained that they couldn’t get information from Microsoft support and some were unable to log support tickets for the outage, probably due to the demand placed on the support organization as the problem evolved.
Although it really wasn’t the case, it is natural that customers and commentators alike regarded this outage to be an Office 365 problem. In fact, the outage affected many other services including the Azure Management Portal, Dynamics CRM, the Azure Data Catalog and Operational Insights portal, Stream Analytics, Remote App, Visual Studio Team Services, and SQL Database. That’s a fair chunk of Microsoft’s complete cloud portfolio that became inaccessible across Europe.
Office 365 depends on Azure Active Directory, but this was not a failure of the Office 365 infrastructure. It’s similar to the way that Outlook is often blamed when things go wrong with either Exchange Online or on-premises. Users see what is in front of them rather than all the moving parts to the rear.
A system is only as robust as its weakest part. Complex cloud services such as Office 365 depend on many different components, the most important of which is Azure Active Directory. Largely through automated management and by deploying a lot of redundancy across multiple datacenters, Microsoft does an excellent job of keeping Office 365 online, but when failures happen, the scale that the infrastructure now operates at ensures that many companies are affected.
Fortunately, the experience since Office 365 was launched in June 2011 is that relatively few large outages have happened something that indicates the high level of operational maturity that exists within Office 365. Although this outage hurt, the overall record is actually pretty impressive given the growth in the load put on Office 365 to the current level where over 60 million active users connect monthly.
No doubt, Microsoft will point to the 99.9% financially-backed SLA for Office 365 and the success they have achieved against this standard over the last four years. It’s true that the Office 365 SLA has been measured at between 99.95% and 99.99% over the last six quarters, but as I have been saying for some time, the sheer size of Office 365 now makes it very difficult for even a big outage to move the SLA needle.
Apart from the woes of Azure Active Directory, the inability of the Office 365 SHD to keep tenants informed during a major outage demonstrates that monitoring for the entire ecosystem that surrounds Office 365 is not at the level that it should be.
SHD does an acceptable job of letting tenant administrators know what’s happening when Office 365 problems are recognized and are being worked on by Microsoft. However, it’s not so good for tracking fast-developing problems in areas outside the strict boundaries of Office 365, such as those involving Azure Active Directory or components involved in directory synchronization and single sign-on for hybrid environments.
It’s clear that Microsoft could improve how they communicate outages across the entire spectrum of their cloud services to deliver a complete picture to tenants. This is especially so as more major enterprises like ABB, who recently moved 125,000 users from Lotus Notes to Office 365, are embracing the cloud. In the interim, the probes and synthetic transactions used by monitoring solutions like ENow Software’s Mailscape 365 or Exoprise CloudReady are able to detect Office 365 outages and highlight them early. Office365Mon also monitors the service and allows Office 365 customers to collect and measure the SLA for their tenant.
Of course, the other way to find out what’s happening when things are going south is to constantly monitor the #Office365 hashtag on Twitter, but that can rapidly become boring. Microsoft needs to harden Azure Active Directory so it doesn’t become the Achilles Heel for Office 365 and its other cloud services. And a better dashboard wouldn’t go amiss either.
Follow Tony @12Knocksinna