What Your Business Can Learn from Amazon Web Service's Recent Outage

When Amazon Web Services experienced a five-hour service disruption on Feb. 28, many cloud customers and websites were knocked offline, making it impossible for employees and customers to connect.

For businesses, that can be a huge problem which can affect their bottom lines in big ways. For AWS customers, the outage lasted about five hours and was caused by an AWS technician who made a simple spelling mistake in entering a line of code while performing maintenance, according to a post on the AWS web page. The command removed more servers from operations than intended during the procedure, causing the outage. It was a mistake that can be easy to enter, but its implications were serious for customers.

Fortunately, Amazon reported that it has made some procedural changes since the incident, including modifications to the tool which was used to remove the servers. As it was built, the tool allowed too much to happen too quickly, allowing the coding error to go undetected, the company said. "We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level," the company stated. "This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks."

All of that is good news for cloud users, customers and internet users who rely on a large number of services and websites which are hosted by AWS, but what can your business learn from the experience so your customers and employees don't have problems connecting when an outage occurs again?

One of the most important things to do ahead of time is to plan for redundancy, several IT analysts told ITPro.com.

"Businesses are well-advised to not keep all of their IT eggs in one basket," Dan Olds, principal analyst at Gabriel Consulting Group, said. "Relying too much on a single cloud can lead to tears," as happened in the Feb. 28 AWS outage.

"Redundancy is the key," said Olds. "In data centers, they have redundant systems or failover mechanisms in place to handle outages at the data center level. Customers need to have these same types of mechanisms for their cloud workloads."

Ultimately, "anything that is dependent on humans and things that humans design and build can and will fail at some point," added Olds. "It’s inevitable. And customers need to have plans in place for what they need to do to keep operating when something, whether it's a single system, a private cloud, or a public cloud service, fails."

Another analyst, Charles King of research firm Pund-IT, said that even the heightened availability services offered by AWS didn't work during the Feb. 28 outage, pointing to the need for companies to consider their own redundancy plans, in spite of the costs.

"If maximizing resilience and availability are critically necessary, managing applications and data onsite or mirroring them in more than one public cloud would be wise," said King. "Expensive? Probably. Prohibitively? It depends. What would it cost to have your business go offline for 5 hours? I expect a number of the companies impacted by the AWS outage are re-evaluating their cloud computing perceptions and beliefs."

Among the reports King said he has heard since the outage are companies that had mirrored copies of their company websites up and running and applications running on onsite servers for redundancy. Such steps kept them in operation while other businesses that solely rely on AWS were offline. "If an organization is 100 percent invested in AWS or any other public cloud, it is entirely dependent on that provider."

One thing to remember, said King, is that "major outages like this one don't happen often but they're remembered for years afterward. It wouldn't be surprising if some customers decided to leave AWS as a result of this event, and Amazon's competitors are likely trying to leverage the company's pain for their own advantage."

And there are plenty of options, he said. "Microsoft Azure is certainly positioning itself as a sophisticated, reliable, business friendly cloud provider. IBM Cloud focuses on enterprise-class services, and emphasizes hybrid deployments that integrates its own assets and services with customers' IT data and applications."

For customers, "AWS may be the market's largest cloud provider but there are plenty of viable alternatives to consider," he said.

Deepak Mohan, an analyst with research firm IDC, told ITPro.com that the reason the Feb. 28 AWS outage was as significant as it was is because one of the services affected was the read/write to S3, one of the fundamental infrastructure services at AWS.

For this reason, cloud based applications which are sensitive to downtime "must be designed to cater for outages or lack of availability with components of the underlying infrastructure layer," said Mohan. "Even though errors are becoming fewer and further apart, sensitive applications must be designed to be resilient through infrastructure availability failures. This may by leveraging multiple Availability Zones, multiple regions or multiple infrastructure platforms."

In most cases, multi-site redundancy can be added at relatively low additional cost, with the level of disaster recovery that's needed depending on use cases, he said. "As the criteria become more stringent, the costs will naturally get higher."

Ultimately, "repeated failures with any provider will be a challenge, in terms of continued usage of the platform by customers," said Mohan. "This is not unique to AWS, and cloud customers need to eventually adopt redundancy practices -- that are common in cloud native applications -- for all sensitive applications in the cloud."

Comments

Plain text