On Tuesday of this week, Azure suffered a severe outage. Unless you were working after hours, your users slammed the HelpDesk with calls, or you were using some of the affected Azure services at the time, you might not have noticed. Interestingly enough, I took notice because of a seemingly unrelated issue – my Xbox Live pinned apps had disappeared. I immediately took to the Internet to see if there were reported outages for Xbox Live, or if I might possibly be experiencing some common problem I had never seen before.
I dropped out to the Xbox Live site and all was well – at least that's what the service status showed me. But, then I started to see reports on Twitter and other places that Azure was down for many, particularly Azure Storage, Virtual Machines, Visual Studio Online, Azure Website, Search, and miscellaneous other Microsoft services. However, when I jumped out to the Azure Service status pages, they showed that Azure was fine, just like the Xbox Live site did.
We hear now, through an apology post by Jason Zander, CVP from the Microsoft Azure Team, that the cause for the outages was a "performance" update for Azure Storage Services. There's a couple threads in the apology and explanation that most of us in IT are familiar with and shows that the software company is still learning what it means to be IT:
- Service health reporting didn't work.
- The update had been tested for several weeks, but bombed in production.
- The test group apparently didn't match the production environment close enough.
- The update was deployed en masse instead of incrementally.
- Alternate communication methods to notify users were used only after the incident instead of being part of the process.
Microsoft's apology is accepted (you can't trust technology 100%), and it highlights a few areas that seem to be common ground for anyone dealing with technology. As IT Pros, we've all experienced the same scenario and had to issue the same apology emails. Microsoft is learning how to be IT and taking the hard knocks, just like we've all had to do over the years. The difference is that when it happens to us, it affects only those in a single company. When it happens to Microsoft, it affects all companies that rely on a single service provider.
The other angle here, and the one I'm sure most patching IT Pros will enjoy, is that Microsoft got bit by one of its own updates. Microsoft's updates over the past few years, and even moreso the past few months, have lacked a certain, say, quality about them. Each month, its just expected that one or more released updates will bomb and have to be fixed and rereleased. I hear rumors that Vegas bookies are now interested in Patch Tuesday (I kid, I kid). But, its a bit ironic (but, not surprising) to hear that Microsoft was bitten by its own patch.