Experts from Gartner, Cloudian and other industry watchers say the massive yet relatively brief Fastly cloud outage shows some of the resilience of the cloud but also what is at stake if it breaks.
On June 8, edge cloud platform Fastly experienced a global outage that, in a statement from the company, was attributed to “an undiscovered software bug” set off by a valid customer configuration change. According to Fastly, a software deployment in May introduced a bug that could be, and was, triggered by a specific yet normal set of circumstances.
The outage affected Amazon, Reddit, The New York Times and other major websites. Fastly detected the issue within one minute and had 95% of its network back and functional within 49 minutes. The company is taking steps to mitigate future incidents, but the outage highlighted the inescapable ubiquity of the cloud and what it takes to bounce back when it is down.
Josh Chessman, senior research director with Gartner, said, “It served as a good reminder that nothing is perfect.” This is not to assume there will always be downtime, he says, but to have contingencies in place to identify issues. That might include acknowledging that there is nothing to do while the provider works on the problem, Chessman says, aside from alerting others.
The consolidation of information and resources into the cloud has created the possibility for widespread repercussions when there is an outage, he said. “As organizations plan to move to the cloud, it’s something they should be thinking about.”
In the Fastly incident, one customer made a legitimate change that just happened to have a cascade effect, Chessman said. “That’s one of the challenges with public cloud. We’re all sharing this infrastructure and we have limited control over it.”
He said outages might lead some companies to explore automatic content delivery network switches as a safeguard from outages, but probably not in a massive way. “Outages aren’t frequent enough to make it worthwhile.”
“Organizations need to do an ROI calculation on cloud migration and digital transformation.” That includes asking questions about how to respond and implications if a resource goes down.
Gary Ogasawara, CTO of data storage company Cloudian, said the outage has brought up considerations about diversifying dependencies among enterprises. This includes multicloud and hybrid cloud strategies. There is some expectation, he said, of reliable access to the cloud much like a utility -- but even utilities can experience disruptions in service. “You expect when you plug something into the wall that electricity will come out,” Ogasawara said. “That’s the type of advantage we all want from the cloud.”
He suggested companies categorize their data and workloads, so they can identify what is absolutely essential that cannot afford downtime and what type of data can withstand temporary unavailability. Ogasawara also suggested testing and playing out different scenarios of disruption.
John Bates, chief product officer with testing and measurement equip provider Keysight Technologies, said the outage emphasized a need for automated testing for organizations eager to maintain continuous delivery of software via the cloud to beat competitors. “You’ve got to prepare for the unknown unknowns,” he said.
The outage also put other topics in focus that might not have received consistent attention in the past. Though DevOps is frequently talked about in enterprise development circles, Bates questioned to what degree it is being implemented. “If we can truly get to a DevOps world, securing development and operations, it’s going to help a lot,” he said. “We talk very glibly about DevOps, but we don’t ask the really hard questions about if anyone is really doing this.”
Taken into context of sudden moves to the cloud in response to the pandemic, the Fastly outage was a relatively quick blip, said Drew Firment, senior vice president of transformation with cloud training platform A Cloud Guru. The incident does offer a moment for reflection for organizations. “Folks are looking at their cloud architecture,” he said. “Architecture equals operations.” As organizations build in the cloud, decisions on cloud providers and services can have a dramatic effect on resiliency, Firment said. “That’s why cloud architects are in such demand, especially if they can take those things into consideration.”
Those who have been reluctant to migrate to the cloud might see such outages as a reason to back away from digital transformation. Furthermore, some organizations might try extreme measures, sacrificing the quality of their applications, just to avoid any possibility of downtime. Either approach may cause more headaches than solve problems. “It’s like going multicloud for all the wrong reasons,” Firment said. “You have an application on three different cloud providers that no one is going to use because it sucks. Guess what? You don’t have to worry about vendor lock-in anymore.”
Maintaining an iron grip in applications by not leveraging cloud resources can also be an issue. “Congratulations, you have an application that won’t scale, can’t be used globally, but it will never go down,” Firment said.
Exploring alternative approaches to using the cloud will naturally continue, even though the Fastly outage was dealt with. Maria Paula Fernández, advisor to Golem Network, a decentralized cloud computing network, said even they experienced some disruption. “It makes us realize that we need unstoppable infrastructure that is able to power reliable applications and websites,” she said. “It’s a big reality for check for everyone building this kind of infrastructure.”
There are more lessons to be learned from the Fastly outage but momentum for the cloud and digital transformation shows no signs of stopping. “The outage exposes a traditional paradox,” said John Annand, director of infrastructure at Info-Tech Research Group. “If we don’t know things are happening, we don’t worry about them. When we start to get visibility into the reality, we may get overly concerned.” Outages have occurred in other types of business systems for decades, he said, whether physical or power-related. “Business has to be prepared for them to a degree; they have to look at the likelihood of them happening,” Annand said. “They have to decide how much of that risk they want to mitigate.”
Continuity planning for IT systems should include a plan of action for what he said is one of the most predictable scenarios in the world. “We know that there will be an outage at some point, of some sort with these systems,” Annand said. “Rather than pretend that it can’t happen, why don’t we plan for it and be reasonable about how we want to deal with it?”