Impressive Office 365 uptime data means more pressure on on-premises IT managers

I guess that we should all be dazzled by Microsoft’s proclamation about “Cloud Services that you can Trust: Office 365 availability”. The numbers for availability look good and the statements lists some impressive commitments to continued performance, all of which is good news. Then again, the suspicion might arise that Microsoft is hinting that some cloud services exist that shouldn’t really be trusted, which quite takes the gloss off the whole thing. Perhaps they are referring to the cloud services that have been penetrated by PRISM, something that Microsoft’s General Counsel Brad Smith empathically denies to their “enterprise email and document storage”, which I assume to refer to Office 365.

But in any case, the news reported in the blog post is good for Office 365 (which is of course the reason why the post is there) because it describes the performance against SLA in the form of uptime numbers for Office 365 in the last four quarters:

July 2012	October 2012	January 2013	April 2013
99.98%	99.97%	99.94%	99.97%

These are very impressive numbers that will, no doubt, be compared to the data for Google’s competing cloud suite. As always with numbers, you have to be sure that you compare like with like. For instance, since January 2011 Google, does not include scheduled downtime in its SLA calculation. Microsoft promises an SLA of 99.9% and says that services like Exchange Online or SharePoint line have no scheduled downtime, so any faltering of these services immediately impacts their numbers.

At first glance, Google’s SLA definition is much simpler than Microsoft’s, perhaps evidence that corporate lawyers have more influence over Office 365 contracts than their Google counterparts, but more likely reflecting the more complex nature (in a good sense) of Microsoft's offerings. I was surprised that the SLA doesn't cover Postini as message hygiene filtering is a pretty fundamental part of an enterprise email system.

Gmail promises 99% uptime and trumpted its achievement of a 99.984% SLA in 2011. However, Google's definition of SLA measurement contains an odd qualification: ""Downtime"; means, for a domain, if there is more than a five percent user error rate. Downtime is measured based on server side error rate". This could be construed as a get-out clause to allow Google to avoid accruing downtime if less than five percent of its users are affected. Five percent seems small but it can be a pretty large number in cloud terms. For instance, if four percent of Gmail's users were having a problem, then Google would register no downtime even though some 17 million users were affected (taking 4% of the 425 million Gmail users reported in 2012). How odd!

Google has been quieter about their SLA data recently, perhaps because of some recent problems such as the 40-minute outage on July 10. Perhaps it’s my inability to use Internet search tools that let me down, but I wasn’t able to track down any more recent reports of Google performance against SLA since 2011.

If only because we don’t have all the data necessary to make an apples-to-apples comparison, understanding the finer points of SLA measurement and reporting for Office 365 and Google Apps can be complicated. There’s also a big difference between a problem in a cloud datacenter that absolutely will affect the SLA and problems that arise from Internet or local network connectivity that prevent users getting to a cloud service. These problems do not count when SLAs are measured, even if the user perception is that the “service is down”. Cloud vendors cannot be blamed for excluding the Internet from their calculations as they exert no control outside the boundaries of their datacenters.

The complexity of SLA calculations doesn’t take away from the fact that the fears that many had that cloud services would be unreliable have proven to be unfounded. I’ve used Office 365 since its official launch in June 2011 and apart from some initial hiccups it’s been as reliable as I could have wished. I don’t use Gmail as much as I used to, but my perception is that it’s a reliable service too. Don't get me wrong - cloud outages happen all the time. For example, last week Outlook.com experienced problems for seven hours while Google was down for a few minutes. When cloud outages occur, they tend to affect millions of people and are awfully public. By comparison, an IT outage inside the boundaries of a single company affects just that company's users and is hardly ever revealed outside.

Given that the cloud services have delivered excellent performance against their published SLAs, the biggest problem that has arisen is the pressure created on on-premises IT managers to deliver the same kind of reliable and robust services from in-house systems. Sure, they don’t have the kind of resources that Microsoft or Google dedicate to their datacenters and a lot more of their work is likely to be manual instead of the automated processes used to deliver cloud services (automation is fundamentally necessary to achieve the economics of cloud services). However, these facts are unlikely to be given much importance when weighed by CIOs who compare the costs of in-house delivery against those for cloud services. Lower cost and better performance is a difficult duo to argue against – and that’s why successful cloud services create problems for IT managers.

Follow Tony @12Knocksinna

Comments

Plain text