Despite the fact that the word "chaos" generally denotes an undesirable state of being, for DevOps teams the practice of chaos engineering can lead to very positive results.
For the past five years, San Jose, Calif.-based Gremlin has been in the business of helping organizations employ chaos engineering techniques. With chaos engineering, faults and error conditions are intentionally injected into running processes and platforms to see how they and the teams that support them react.
The practices of chaos engineering first gained public exposure in 2011, when streaming media giant Netflix released its open-source Chaos Monkey tool, providing developers with an open platform for testing. Gremlin has taken the concept a few steps further, providing a commercially supported enterprise-grade platform for DevOps professionals to test and manage chaos experiments.
On its fifth anniversary on Jan. 26, Gremlin released its inaugural State of Chaos Engineering report, providing insights from more than 400 IT professionals from around the world. Among the high-level findings in the report is that regularly running chaos engineering experiments can lead to improved service availability. In fact, the survey found that with chaos engineering, 60% of teams had a mean to resolution for issue remediation of less than 12 hours.
Five Years of Gremlin
Kolton Andrus, CEO and co-founder of Gremlin, told ITPro Today that there have been a few surprises over the last five years as he has ramped up the business.
"The first is how quickly the idea made sense to people," he said. "I’d been doing chaos engineering at Amazon and Netflix for years, but it was still nice to see how much our message is resonating across industries."
Andrus noted that his original idea was that if given the right tools, engineers could easily just start running chaos experiments on their own.
"The truth was that we’re introducing a new way of working, which means that we need to build more guidance into the product to get people started," he said. "That initial impression and onboarding is something we are constantly working to improve."
The State of Chaos Engineering
The survey found that the most commonly run chaos experiments and also the most common failures stem from network attacks. Network attacks include a wide range of error conditions including latency, DNS and packet loss.
Typically, within microservices architectures the reason why a system fails is often because someone else’s system failed, Andrus said. As such, the ability to test the network and understand how various systems and applications impact one another is critical.
"Running latency attacks, for example, helps engineers ensure customer experience continuity," Andrus explained. "They allow engineers to intentionally slow down network requests and observe how this affects response time, page load time, application stability and ultimately the customer experience."
Chaos in Production
While it's common in DevOps models to test applications and services while still in the development phase, that's not the case with testing in production.
A surprising finding from the survey, according to Andrus, was that less than 10% of people said the fear of breaking things was a primary deterrent for getting started. However, only 34% of respondents indicated that they run chaos experiments in production environments.
"We hear sometimes from people that they don’t do chaos engineering because they already have enough chaos, and that’s exactly the wrong way to look at it — it’s about taming and removing chaos from your systems," Andrus said. "At the end of the day, production is where your customers live, so the people seeing the best ROI from the practice are doing it there."