It hasn’t been a good week for Exchange Online. The team running “the service” has done pretty well since the flurry of problems that emerged following the formal launch of Office 365 last year and the uptime performance for Exchange Online has run neck-and-neck with Gmail in the race to prove that they can deliver the best overall SLA.
I haven’t seen recent data since Google claimed a 99.99%+ SLA for Gmail last year, but what’s sure now is that the recent outages have wrecked any chance that Microsoft has of claiming a 99.9% SLA for Exchange Online in 2012. The total allowed to meet 99.9% is 525 minutes annually (8 hours 45 minutes) and according to Corporate VP Rajesh Jha’s apology to customers, the outage on Tuesday, November 8 lasted 8 hours and a minute while the outage on November 13 lasted 5 hours and 2 minutes. A total of 13 hours and 3 minutes is not what the corporate scorecard required.
Great fun can be had with any statistics and I’m sure that arguments will be made that the SLA for Exchange Online isn’t really that bad because not every user of the service was affected. There’s some truth in this claim. I use Exchange Online to host my email domain and totally missed the outage because my email is hosted in Microsoft datacenters in EMEA that didn’t experience the same problems. Thus, so far I have enjoyed an Exchange Online SLA of 100% for 2012 while North American and South American customers might not use the word “enjoy” after the last week.
It’s also fair to say that even in the affected geographies some customers continued to use the service without a problem. Not everyone is treated equally when a cloud drips some rain. Quite how the math wizards will determine lost minutes and the effect on the reported SLA is beyond the simple brain of someone like me.
In any case, any discussion about the SLA delivered by a cloud service has to be framed in the context of whether an internal IT department could do any better. My theory is that most IT departments would struggle to achieve anything close to 99.9% SLA because they don’t have the same resources in technology and people that Microsoft and Google deploy to run their services.
Back to Rajesh’s blog, I note that once again Microsoft has acknowledged that they have not achieved the expected service level and will refund customers without question (would an internal IT department refund their customers if a service failed? I think not…). More importantly, he goes on to discuss why the problems happened.
The first outage happened when Microsoft struggled to cope when anti-virus engines identified a problem message. Various factors combined to create a message backlog that eventually delayed throughput. It took time for Microsoft to figure out what was going on and to then deploy an “interceptor” to eradicate the problem messages. All very understandable for anyone who has experienced the problems involved in dealing with viruses going back to the original outbreak of the “I Love You” virus.
The second outage was due to a combination of component failure, ongoing maintenance, and a heavy load caused by customer onboarding slowed the service to a point where some users lost access. The components were part of the network and the blog says that they did not flag the failure. I assume the failures threw an additional load onto the remaining components and slowed performance and this, coupled with ongoing maintenance in the datacenter, probably created peaks in load that resulted in users losing connectivity. Again, all very understandable when you’ve seen the effect of failures on active servers.
The customer onboarding remark was the most interesting to me. Onboarding means the activities required to migrate user mailboxes from other systems to Exchange Online. In processing terms, the majority of the work is performed by the Mailbox Replication Service (MRS) as it transfers the content of user mailboxes from the existing system to Exchange. Moving mailboxes around creates lots of work for Exchange databases and it’s easy to see how the load generated by moving mailboxes from customers might have been the tipping point that caused the service to degrade to unacceptable levels when it was already having to cope with component failure. Clearly Microsoft is a victim of its own success here – if customers didn’t want to be on Office 365, Microsoft wouldn't have to do all the onboarding and MRS wouldn’t have to transfer mailboxes.
I’m pretty sure that the Exchange Online team is figuring out the finer points of what went wrong and when, and more importantly how to make sure that similar problems don’t happen again. Topics such as better network monitoring, aligning datacenter maintenance so that it doesn’t happen when large number of mailboxes need to be transferred, and ensuring that sufficient resources exist to handle unexpected peaks in demand caused by users or failures will probably be on the list.
Microsoft’s chances of achieving their desired SLA might be gone for 2012. So be it. Chasing perfection is the enemy of progress, or so it’s said. As a consumer, I want to see progress in the services that I use. I believe that Exchange Online has progressed since its debut (any look back on the spotty record of its BPOS predecessor will testify to the progress). The question now is how well Exchange Online will perform in 2013 when Microsoft upgrades the service to use Exchange 2013.
Follow Tony @12Knocksinna