IT professionals know that application outages don't just happen — they are often the result of undiscovered risks and misconfigurations that accumulate over time. Finding those vulnerabilities before they impact service is the goal of a new solution that chaos engineering startup Gremlin announced on Aug. 30.
Gremlin designed the new Detected Risks capability to help enterprise IT teams identify potential reliability risks in their systems before they cause operational incidents or outages. Unlike existing monitoring solutions that only detect problems as they are happening, Detected Risks takes a proactive approach to find risks and provide actionable recommendations before they ever impact users.
Detected Risks works by continuously running automated experiments that simulate failures and risky conditions in production environments.
Key features and capabilities provided by Detected Risks include:
- Continuous chaos experiments to surface risks 24/7 across an environment before they become problems
- Identification of risks like configuration errors, performance bottlenecks, and cascade failures
- Clear and directed remediation advice on addressing detected vulnerabilities
- Integration with existing observability stacks and workflows including Datadog, New Relic, and Splunk
- Risk templating and customization to tailor experiments to an organization's specific needs
"With Detected Risks, technology leaders in SRE [site reliability engineering], platform, cloud, and software engineering can enforce common reliability standards across hundreds or even thousands of services and teams," Aaron Kaffen, vice president of marketing at Gremlin, told ITPro Today.
Detected Risks Expands Chaos Engineering to Improve Software Reliability
The Detected Risks tool builds on Gremlin's existing chaos engineering platform, which helps developers and Ops teams stress-test system resiliency.
While chaos engineering itself provides ways to build confidence through failure testing, Detected Risks aims make it easier to get actionable insights.
"Think of Detected Risks as a companion to chaos engineering," Kaffen said. "Where chaos engineering is designed to help you experiment and understand risks to your system, Detected Risks is designed to simply enforce common and well-defined best practices. Both are important for a modern reliability program."
Detected Risks runs continuous experiments, injecting things like drained resources, traffic spikes, and configuration changes to uncover potential cascade failures before they happen. The risks are detected across infrastructure and apps using Gremlin's failure-as-a-service approach.
The Intersection of Vulnerability and Risk Scanning
A common best practice for software development is to conduct vulnerability scans to help improve security. According to Kaffen, it's possible to draw a parallel between Detected Risks and vulnerability scanning.
"Where vulnerability scanning looks for security risks, Gremlin is looking for reliability risks," he said.
Kaffen noted that Gremlin's dashboard shows a company's detected risks by team, including those that have been resolved. There is integration with multiple systems including Atlassian's Jira, and the Gremlin API can be used to integrate with any ticketing or chat system.
Looking forward, the goal for Gremlin is to continue to build out capabilities that help organizations improve digital resilience.
"We're working with some of the world's largest banks, retailers, media companies, and others to better understand the growing need for digital resilience in enterprise software," Kaffen said. "Those customers have told us that the next horizon for reliability is scale."
About the authorSean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He consults to industry and media organizations on technology issues.