I recently read a report about Office 365 that contained the following statement:
“Despite the rhetoric about stability, reliability, and high availability with cloud services, Office 365 still suffers from periodic downtime. Whether it is due to a patch that hasn’t been fully tested, an overlooked configuration requirement, a component failure, or a networking issue, Office 365 does not provide 100% service availability.”
Let’s be blunt here. As long as we depend on the Internet to connect users to cloud services, it is unreasonable and impractical to expect that any cloud service will deliver 100% availability. I imagine it’s for this reason that Microsoft’s financially backed SLA is actually 99.9%, a fact overlooked by this commentator.
But exploring this point further, on the one hand you have Microsoft claiming that their SLA record for Office 365 is very good indeed (99.99% in Q1 CY15 and 99.95% in Q2 CY15) and on the other, tenants protest all the time about service interruptions and problems that they have experienced. Much the same debate happens with on-premises IT too as the view of reliability and service quality is often very different when seen through the different lenses of the IT department and its users.
I’ve already referred to the Internet as a major factor in cloud service quality. In fact, Microsoft is doing its best to take this out of the equation as far as is possible by deploying a network of local connection points around the world so that users don’t have to make an extended multi-hop connection to the Office 365 datacenter that holds their account. The idea is that the local connection points will route traffic to Microsoft as quickly as possible. It’s not a new idea as dial-up services like CompuServe and America Online took much the same approach with local “points of presence” in the modem era.
Larger organizations can also use the Azure ExpressRoute offering to connect their WAN with Microsoft’s network. The dedicated links required by the connection make this a relatively expensive solution, but ExpressRoute has the great benefit of making the Office 365 network appear to be an extension of the corporate WAN.
All of which is fine, but Microsoft already excludes Internet glitches from its SLA calculations. The Service Level Agreement for Microsoft Online Services says that service levels do not apply “due to factors outside our reasonable control… or a network or device failure external to our data centers, including at your site or between your site and our data center.” This is perfectly fair because not even Microsoft can be expected to control the Internet. The same document also points out that local conditions such as “including, but not limited to, issues resulting from inadequate bandwidth” are excluded too. In other words, if you use a crappy Wi-Fi network, don’t expect sparkling performance from Office 365.
For Exchange Online, downtime is defined as “Any period of time when users are unable to send or receive email with Outlook Web Access.” (yes, it will take time for Microsoft Legal to get with the program and use the new name for OWA) and the SLA equation is:
Seems fair. But then you might ask why do so many people seem to report issues with Office 365 yet Microsoft still continues to clock up really good SLA performance. The answer lies in the swelling numbers of people who use Office 365. Simply put, the more that use Office 365, the less of an effect any individual outage has on the overall SLA performance.
Take the example illustrated below, which is loosely based on the outage experienced by a number of Exchange Online users in North America in June 2014. Let’s imagine that 2 million users were affected and that the outage lasted nearly 7 hours. This sounds horrible and it’s certainly not a great experience for anyone affected by the outage, but the mathematical fact is that the outage barely budged the Office 365 SLA needle. The massive number of user minutes lost (1,310.4 million) represents 0.01% of the available user minutes in a quarter (13,104,000 million), assuming that 100 million or so subscribers access Office 365.
Office 365 achieved SLA performance of 99.95% in the second quarter of 2015. This represents the loss of 4,550,000 days across 100 million users for the quarter, or just over an hour for every single user. There were some multi-hour outages in that mix, mostly in North America, but still you can see how the large numbers of users allows Office 365 a lot of latitude in terms of smoothening results out across the entire service.
As time goes by and the number of people using Office 365 grows, the effect of any individual outage will continue to decrease. Whereas an outage of 655 minutes for 2 million users is required to reduce the quarterly SLA by 0.01% for a service with 100 million users, an outage of 982 minutes is required to have the same influence over a service with 150 million users. Given its current trajectory, Office 365 is likely to be at that point sometime in 2016.
All of which means that Microsoft is likely to be able to report good SLA numbers for Office 365 into the future. Individual tenants will vary from unhappy (if they have outages) to happy (if they don’t), but might find it difficult to put their experience in context with what Microsoft reports. Monitoring products like Office365Mon help by providing tenant-level data on outages, but even so, there might just be enough flexibility in Microsoft’s Service Level Agreement document to argue away a pile of sins.
Follow Tony @12Knocksinna