How small glitches can cause big problems for complex cloud infrastructures

Last month, I discussed whether a problem that caused Office 365 users to be unable to authenticate provided any indication that Azure Active Directory (AAD) was proving to be an Achilles Heel for Microsoft’s cloud services. That outage affected users in Western Europe on December 3 and underlined the dependency that Office 365 has on other parts of Microsoft's cloud infrastructure.

On December 18, another issue surfaced that affected some Office 365 and Azure customers in Western Europe. The natural reaction of those who couldn’t connect to Office 365 was that the new problem was a rerun of the previous event and that feeling was duly picked up by press commentary at the time. However, as it turns out, the root cause had absolutely nothing to do with AAD.

The December 18 incident was reasonably short-lived at 140 minutes (from 9:15AM UTC to 11:25AM UTC – the earlier incident lasted 316 minutes). All incidents are painful for those who are unable to work while a resolution is determined, but as I discuss later on, detecting and fixing a problem’s root cause within a complex infrastructure can take some time.

Both outages occurred during the morning peak in Western Europe. The biggest and most important difference between the two is how they compromised the ability of end users to work. The December 3 issue prevented many more end users from being able to access Office 365. On December 18, the affect was really only felt by those who wanted to log on to the Office 365 portal (portal.office.com) to perform administrative tasks.

I asked Microsoft about the incident and received the Post-Incident Report (PIR reference MO36910 dated 30 December). If you’re a tenant that might have been affected by the incident, you can get a copy of the PIR through the 30-day history section of the Office 365 Service Health Dashboard (SHD). The PIR cites the impact on users as:

“Affected users and administrators were unable to sign in to the Office 365 portal via office.com or portal.office.com. End-user access to Outlook on the web, SharePoint Online, OneDrive for Business, and other Office 365 services was not affected by this event; however, affected users would have been unable to navigate to those services through the Office 365 portal.”

The point made here is that the majority of end users continued to work as normal because they use clients (like Outlook or a mobile device) that don’t go anywhere near the portal or use bookmarks (a direct URL such) to access web-based apps like Delve or the Office 365 Video Portal.

For example, if you type outlook.office365.com, Outlook Web App starts without any need to go near portal.office.com. I can’t think of why an end user would want to connect to the Office 365 portal en route to an application like those mentioned above, but I guess it’s possible and some do. In any case, because administrators were affected, the problem was noticed quickly and reports flowed into Microsoft to ask what was happening.

The PIR goes on to describe the root cause as:

“A code issue with a network interface driver caused intermittent packet loss to occur under certain conditions. As load increased, this resulted in latency and some loss of availability for infrastructure that hosts Office 365 portal services. “

Microsoft says that the incident was isolated to a portion of the Azure storage infrastructure used by Office 365 for its portal services. A networking glitch caused packets to drop and caused part of the Azure Storage service to be unable to respond to some requests. As soon as the issue was identified, portal traffic was rerouted elsewhere to resolve the issue. The packet loss resulted in some performance issues for administrators too.

So what really happened is that neither AAD nor Office 365 caused the problem, but connections to Office 365 were affected by the packet loss in a component belonging to another part of Microsoft’s overall cloud infrastructure.

Another issue revealed by the PIR is an admittance that the Office 365 SHD was slow to report the problem. Customers reported the problem before Microsoft engineers determined that something was wrong and should therefore be flagged through the SHD. Microsoft is reviewing the Office 365 monitoring infrastructure to improve its ability to pick up future problems of this nature.

As noted in my previous post, the SHD does an acceptable job of advising tenant administrators when something has gone wrong. Microsoft has accepted that fact. It’s inevitable that time will elapse between a problem showing up and an investigation concluding that some engineering work is required to address it. You don’t want a lot of noise showing up in the SHD caused by events that might or might not be real problems. Even the most skilled troubleshooters have to follow a certain routine to review symptoms and other data before they can figure out where the root cause might lie. However, it seems that Microsoft could be faster to let people know what’s going on.

Perhaps it’s a natural unwillingness to tarnish the good name of their cloud services that makes Microsoft slow to acknowledge problems. You can defend this stance because some problems are intermittent, specific to a certain tenant, disappear on their own, or in the process of being fixed. Listing every incident on the SHD without applying a qualitative filter might induce a certain panic in more nervous administrators.

Of course, tenant administrators can deploy monitoring software to measure the real-time service delivered to their users. These solutions won’t show up what Microsoft is doing within their datacenters, but they at least inform when something is not working as it should be so that users (or even administrators) can be calmed and the issue reported. That’s better than floundering in a state of ignorance, which seems to be the modus operandi of the SHD at times.

Follow Tony @12Knocksinna

Comments

Plain text