Learning from cloud outages

Cloud services have been around for years but now are a real option to be the computing platform of choice for many applications. Early cloud services were focused on consumers and it was only after the launch of SalesForce.com that businesses, including major software and hardware vendors, began to focus on the task of transforming IT to accommodate both on-premises and cloud models. Recently we have seen an unfortunate storm of highly visible failures across an array of cloud services from multiple vendors that have caused people to wonder whether it’s wise to consider the transition of work now.

Consumer cloud services, especially those that are free, have much higher user tolerances when problems occur. If Picasa or Snapfish fail to upload a photo because of a network glitch, you’re likely to blame “the Internet” or some other reason and simply restart the upload. When a business cloud service suffers the same kind of failure (for example, a message cannot be sent or an attachment uploaded), user tolerance isn’t so obvious. Yet both consumer and business cloud services depend on the same loosely-coupled Internet that no one really manages, so why should we be surprised when glitches occur? With this in mind, let’s look at some of the recent cloud events to see whether something can be learned to be better prepared for the future.

A recent outage for Microsoft’s Business Online Productivity Services (BPOS) in August 2011 occurred when a transformer belonging to the national electricity network failed in Dublin. BPOS hasn’t had a great record for stability, so much so that Google was able to count some 113 instances of unplanned incidents during 2010. An unplanned incident covers a variety of sins but 113 in a year or roughly one every three days isn’t a track record to which anyone would aspire. It’s fair to say that BPOS ran software (such as Exchange 2007) in 2010 that was never designed to support the scale and complexity of cloud infrastructures. Older applications don’t function so well when asked to run in the cloud simply because they are usually designed to run inside the well-known boundaries of standard corporate deployments. It’s therefore a good thing to select applications that have been purpose-built for the cloud or those that have been re-engineered to support cloud infrastructures. SalesForce.com is an example of an application in the former category; the versions of SharePoint, Lync, and Exchange that run in Office 365 are examples of the latter.

Amazon has a great record for its online stores and is also in the business of selling compute power to third parties. The same failure that affected BPOS also struck Amazon’s European Datacenter and caused three major services (Elastic Compute Cloud (EC2), Elastic Block Store (EBS), and Relational Database Service (RDS)) to go offline. In this instance, the backup generators in the datacenter failed to restore power and the services stayed down. Amazon’s cloud services provide the fundamental underpinning for the applications of many other companies. Amongst others, the failure brought Reddit, FourSquare, Quora, and Indaba Music crashing to a halt. The lesson here is perhaps that concentrating so much computing horsepower in a relatively small number of massive datacenters might be compared to putting all one’s eggs in a single basket.

In its short production lifetime since its June 28 launch Office 365 has experienced two very public outages totaling some 330 minutes in August and September 2011. These outages undermined Microsoft’s reputation for high-quality operations of a new service that is supposed to fix all of the problems that customers previously experienced with BPOS. The first problem affected Exchange Online users in North America because a network component failed and wasn’t backed up with redundant hardware. The second failure afflicted users across the world because a DNS configuration change went horribly wrong (the same problem affected Hotmail, Azure, and other Microsoft cloud services). Given the billions of dollars that Microsoft has invested in datacenters around the world and the software engineering to make their products cloud-ready, failures in basic operational disciplines are surprising and unwelcome. The most optimistic view is that these failures are merely growing pains and that Office 365 will prove itself to be a highly robust service over time. The good news is that Office 365 has survived a whole five weeks in production without a further outage so things could be on the mend. We shall see.

RIM BlackBerry is the latest cloud service to suffer a meltdown, experiencing four days of degraded service for customers that spread like a ripple across a pond after the failure of a “core Cisco switch” in the UK followed by a corruption of an Oracle database. Ever since its inception, RIM has exerted close control over its service through a set of Network Operations Centers (NOCs). The function of the NOCs is to handle message traffic from BlackBerry devices that are transported over mobile networks back to the NOCs where they are processed and diverted to their final destination. In this instance, the failure in a UK-based NOC was not handled by a failover to different hardware and the subsequent problems in processing messages due to the corrupt database caused a huge backlog to accumulate, in turn slowing the RIM network and delaying message delivery to users. This outage is similar to another experienced by RIM on April 17, 2007 when a software upgrade didn’t deliver the expected results and a backup system failed, leading to a huge backlog of email.

RIM has built its reputation on bulletproof messaging and losing service for up to four days just heaped woes on a company already struggling to refresh its device lineup and cope with the lack of success that their Playbook has had in the market.

Some commentators have reflected that RIM’s problems might be the result of a failure to invest in sufficient capacity to handle their recent success in convincing consumers to choose BlackBerry over other devices, often because of BBM, the inbuilt BlackBerry Messaging service. After all, if your friends have BBM you want it too and everything goes swimmingly until a problem comes along to disrupt service.

Now boasting availability figures of 99.96% in the first six months of 2011, Gmail is doing very well. Even so, Gmail did have a significant issue in February 2011 when some 20,000 users “lost” their inbox for a period. The Google Documents application has also had problems, the latest coming on September 9 when a software change exposed a memory management bug. In this case the outage only lasted 30 minutes before Google restored access for users to their documents. Generally Google has demonstrated the ability to deliver reliable cloud services, even if you might quibble with the user interface of their applications.

So what can we tell from all the fuss and bother that flows from cloud outages in an attempt to be better prepared for the future?

No one controls the Internet, so offline access to information is important because it allows users to continue working with locally cached data when problems occur. Microsoft is ahead of Google in terms of clients that are able to continue functioning when the network fails.
Service Level Agreements (SLAs) are often measured at the boundary of the provider’s datacenter and don’t take Internet glitches into account. Anyone signing up for a cloud service needs to understand exactly how the SLA is measured. I like the way that Google has eliminated planned downtime for maintenance from their SLA calculation.
The delivery of incredible support including timely and accurate communication is hyper-critical for cloud providers. After all, you are literally looking into an opaque cloud when something goes wrong and you depend on the cloud providers to keep you informed as to what has happened, what they are doing about it, and when they expect normal service to be resumed. Recent experience is that Google has done a good job with support in terms of timely resolution and up-front communication while Microsoft and RIM have struggled.
Many issues appear when operators make changes in cloud datacenters. It’s almost impossible to test for every possible scenario that might occur when a software, hardware, or configuration change is applied and some problems can be anticipated. You therefore need to know when your providers have changes scheduled so that you can be prepared, just in case.
No one can anticipate something like a 110 megawatt transformer suddenly going “pop”. But you can ask questions of a provider about the levels of redundancy that are incorporated into their datacenters and how they plan for unforeseen circumstances. Better again, ask them what actions you’re expected to take to restore service – sometimes switching datacenters requires a lot of network configuration changes and that’s not so easy if you have to update individual PCs.

Handling over control for an application to a cloud provider makes exquisite sense for many companies and can return real benefits in terms of finance and operational efficiency. However, the old adage that one should “look before you leap” rings true and the boy scout “Be Prepared” motto is also valuable when plans are put in place to execute the transition and elevate into the cloud.

Comments

Plain text