Six months of solid Office 365 performance (but...)

Following up on Microsoft's announcement yesterday describing their sheer delight in being able to drop prices for enterprise Office 365 plans by between 13% and 20%, I wanted to write an upbeat note to congratulate Office 365 for delivering a successful six months of flawless service since their last major outage on September 8/9 2011. Although six months is nothing really to boast about in an era when Service Level Agreements (SLAs) call for ongoing and consistent delivery of well over 99.9% availability, I think it’s still important to acknowledge progress and say that Office 365 now exhibits signs of a mature and reliable service in which customers can have confidence.

And then my confidence was shattered by the Microsoft Azure outage (or rather, a service disruption) that was apparently caused by a software problem dealing with leap years. I respect Bill Laing, Microsoft Corporate VP for Server and Cloud, and am sure that his blood pressure was raised when he was forced to report that the problem appeared to be due to a “time calculation that was incorrect for the leap year”, later detailed as "the leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year." Ahem!

Of course, leap years only occur every four years and February 29 2012 was the first time that a Microsoft cloud service woke up to find that February and 29 was now part of its reality, but wouldn’t you have thought that someone like a software architect might have taken leap years into account when Microsoft designed the service?

The net result was an outage over two days that affected users worldwide. Ouch! To their credit, Microsoft has published a full blow-by-blow description of how the problem arose and how it impacted the fabric of the Azure service. They've also offered affected customers a 33% refund of their February subscription charge.

My battered confidence in cloud services was then further assailed when Facebook suffered what seems to have been a DNS problem that affected services to European users over a two-hour period on March 7. I’m a European Facebook user but I must have slept through this outage as I didn’t know anything about it until I read the news. This must reflect the sad lack of activity within my Facebook account!

Neither Azure nor Facebook have anything whatsoever to do with Office 365 so you might well be wondering why I seek to associate a successful six month operating period with no outages with some recent problems for other cloud services. Well, I think the two outages serve as salient reminders that one swallow doesn’t make a summer and that six months of flawless operation of any cloud service isn’t an indication that it’s time to discard every on-premise deployments of technology because the cloud is safe, secure, and ultra-reliable. Safe and secure - yes, at least in terms of data integrity; but ultra-reliable – maybe not quite yet.

I was also reminded quite bluntly at a recent technology conference that some companies still operate in Internet-deprived parts of the world, including in parts of the United States. If you’re in a situation where your connection to the Internet functions across two pieces of knotted twine capable of delivering a steady 32 kbps connection, it might not be time to plunge into the cloud. It came as quite a surprise to me to discover that such a situation is not unique and that people who work in these companies are pretty tired at hearing all the fuss and commotion that surrounds the cloud ho-ha at conferences these days. I guess those of us who have decent Internet connectivity lose sight that this is a sine qua non to be able to use a cloud service.

I’m an Office 365 subscriber (but only plan P, so didn't enjoy any benefit from the recent price decreases announced by Microsoft) and the performance of Office 365 against its SLA is important to me. Although I cheer and applaud Office 365 on achieving six months of reliable service, it seems that the complexity of running large-scale cloud services coupled with the Achilles Heel of a dependency that no one controls will continue to represent huge operational challenges for cloud services going forward. Even as I look forward to what I hope will be the next six months of uninterrupted service from Office 365, my advice of “look before you leap” advanced in an article dated October 17, 2011 remains valid.

Follow Tony’s ramblings via Twitter.

Comments

Plain text