If you're a VS Online subscriber, you might have experienced a 90 minute outage on Friday, July 18th. It took a few days, but Microsoft has now provided the story behind the outage.
In essence, a slow SQL Azure database caused threads to pile up in the SPS thread pool. The ultimate fix was to – you guessed – essentially reboot. In truth, Microsoft needed to manually disable the connection to the Azure Service Bus until the SPS thread pool cleared. The full, complicated description of the event is explained in a post by Brian Harry: Explanation of July 18th outage. But, the bottom line is how a problem with the most critical service caused all services to shut down.
In the post, Microsoft seems to be taking a different tact than in previous Azure-related outages. As the company continues to grow its Azure services, there are many lessons to be learned. Imagine a software company being thrown into having to do the work of IT. IT Pros have years of experience handling outages and following learned best practices for operations, and much of those operations are new to Microsoft.
In the post, humble apologies are given.
Brian leaves it with:
I’m sorry for the interruption we caused. I can’t promise it won’t happen again, *but* after a few more weeks (for us to implement some of these defenses), it won’t happen again for these reasons.
So, Microsoft is learning, which is a good thing. Let's just hope that it doesn't result in too many more outages in the education process.