The fact that cloud services depend on a lot of infrastructure that has evolved gradually as the Internet expanded provoked more headaches for Office 365 this week. Whereas the August 17 outage for North American users of Exchange Online was due to failed network components, Microsoft could not have been held responsible for the outage that afflicted Office 365, SkyDrive, Azure, Hotmail and other cloud services on September 8/9 as the root cause appears to lie within the Domain Naming Service (DNS). Clients reported an inability to resolve the DNS names required to reach the Microsoft services and so were unable to connect.
The incident started at approximately 5AM (mainland Europe, 8PM in Seattle on September 8). The effects were initially felt by users in the Asia-Pacific region. As time went by, European users added to those who wanted to connect and added their voice to the clamour asking what was going on. Of course, as DNS wasn’t working, it was impossible to get status updates from the Microsoft service dashboard so the information void had to be filled by Twitter updates from the official Office365 account. We therefore saw the best and the worst of the Internet – some web services from some vendors (such as Twitter) were available while others were not.
In a blog post at 7:49AM, Microsoft reported that they had to make DNS configuration changes that then had to propagate before normal operations could be resumed. Microsoft didn't offer any details about what exactly they had to do to fix DNS. The reconfiguration seemed to take effect from about 8:20AM with users gradually being able to reconnect to all services. Based on the non-scientific measurement of Twitter reports, all users were able to reconnect within an hour. The total incident is in the region of 140 minutes at best, 200 at worst.
It will be interesting to see whether Microsoft issues a service credit for this incident. As you might recall, they allowed a 25% credit to all users after the August 17 outage. However, in this case the root cause might be outside Microsoft’s control if the DNS issue was caused externally. On the other hand, someone may well have made a mistake in the configuration of DNS inside Microsoft's datacenters. We shall await the root cause analysis and the deliberations of Microsoft management.
The folks over in Mountain View also had their challenges this week as Google Docs had an outage on September 7 that lasted approximately an hour (some reports state that the outage was for 30 minutes; I use the data from the Google Apps dashboard shown below). The incident highlighted the lack of offline access for Google Docs, something that Google is busily working on to provide. On the plus side, the Google team seems to have been pretty efficient at getting the service back online in short order.
It hasn’t been a good week for cloud services and users could be forgiven for questioning the wisdom of moving any important application plus its data into the cloud. But then again, you can argue that the memory of users is selective and has been erased of any data about outages of internal IT systems. And you might also comment that internal IT departments have been no more capable of fast response and resolution than the cloud providers.It’s wonderful to live in a world where access to information is available all the time – until you lose that access!