Office 365 Engineering Lead Apologizes, Details Outage

Office 365 Engineering Lead Apologizes, Details Outage

The communication came, but it was way too little and way too late.

Technology doesn't always work, or at least work in the ways we expect. Outages are just something we deal with almost constantly. Outages give us humans something to complain about to help us assign our frustrations with a multitude of other life experiences to something, or someone, else. In a sense, we need outages. But, even during those outages we can generally expect communication of some sort from the provider. For example, when the power goes out due to a storm, there's a phone number to call with an automated response that gives us an estimated time for the electricity to be restored. It may not be entirely accurate, but having that lifeline makes us feel like we've done our part. It makes us feel like we're part of the situation – particularly when we're paying for the service. It gives us a sense that we have a small bit of control, even though we don't.

And, that's what makes last week's Lync and Exchange outages extra frustrating. It wasn't exactly that the services were unavailable. Yes, that's frustrating enough. But, it was more about the lack of proper communication on Microsoft's part. Microsoft's Office online has led somewhat of a charmed life since its inception. There have been outages, but those were early on when its customer base was much smaller. More and more business customers rely on the service now, thanks to extraordinary features, continued development, product integration, and marketing. Microsoft has had enough time to work out communication policies.

If IT Pros were to use the same communication tactics, they'd be out on the street. In Hello, Microsoft. Welcome to IT, I talked about why I think Microsoft may simply not be ready to be our IT, and the communication issue hammers that home.

On Thursday of last week, Rajesh Jha, Corporate Vice President, Office 365 Engineering, took to the Office 365 community blog to apologize and detail the outage.

The Lync Online issue was explained as a brief outage due to external network failures. Even though the outage was brief and Microsoft restored connectivity in a few minutes, the Rajesh stated that a traffic spike overloaded the network for hours.

The Exchange Online issue was a much bigger matter, however, Rajesh maintains that the outage only affected a small set of customers. A directory partition stopped responding to authentication requests, which revealed a flaw in the code.

Rajesh goes on to say…

While we have fixed the root causes of the issues, we will learn from this experience and continue improving our proactive monitoring, prevention, recovery and defense in depth systems.

And, yet, no mention of improving communication. I'm sure Rajesh could argue that a formal post two days after the resolution is pretty quick, but again, if IT couldn't provide an interim solution and just barricaded themselves up in the datacenter until the fix was complete, management would be knocking down the doors. The Cloud, I guess, has no doors.

What do you think? Was the communication timely? Was it enough?

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.