Economic uncertainty, rising costs, and increasing pressure on businesses to earn consumer trust and loyalty are among the factors driving organizations to seek new ways to deliver responsive and high-performance digital services to the market. Cloud-native computing has given them impressive capabilities to build scalable applications that adapt to rapidly changing business conditions but has also introduced new levels of complexity.
Today's enterprise applications comprise hundreds or thousands of loosely coupled containerized services and ephemeral functions strung together with APIs. Diagnosing the causes of outages can be an overwhelming task, but it's a crucial one as new research reveals the exorbitant cost of those outages.
According to New Relic's 2023 Observability Forecast, the median annual cost of high-business-impact outages is $7.75 million. Three in five respondents to the survey said outages cost at least $100,000 per average per hour of downtime, while 32% put the cost at more than $500,000 and 21% said outages cost their organizations at least $1 million per hour.
What Causes Outages?
Errors, downtime, and outages are bound to happen for various reasons. The most common causes are hardware failures, software bugs, cyberattacks, and human errors. Minimizing risk starts with gaining support at the most senior levels of the organization for investing in robust IT infrastructure, cybersecurity, and cloud-native application development. With more than 70% of customer interactions flowing through digital channels, the case for making those investments is persuasive.
Proactive measures organizations can take to minimize downtime risk include:
- Performing system maintenance on a regular schedule to ensure that equipment is in good working order and up-to-date with patches and security fixes;
- Adopting robust cybersecurity practices such as multifactor authentication, zero-trust access controls, network segmentation, and endpoint scanning;
- Embracing "shift right" and "shift left" software development practices that integrate testing into the entire lifecycle from initial design through ongoing maintenance;
- Encouraging collaboration among developers so that potential outages can be addressed during the build and deployment cycles; and
- Investing in redundancy and failover mechanisms to reduce the risk of downtime and maintain uninterrupted service.
Despite best efforts, though, some outages are inevitable. When that happens, organizations need to have the tools in place to diagnose and remediate them as quickly as possible. Finding the source of an error isn't as straightforward as you might assume. Think about a flooded yard. You may notice water flowing near your hose only to find that the cause of the flood is actually a crack somewhere in your water main. If you assumed that the leaking hose caused the flood, you'd end up with a fixed hose but a ruined lawn.
Observability is a discipline that has arisen to collect and analyze the metrics, logs, and traces generated by IT infrastructure and applications at scale. It guides you to the source of the issue so you can fix it before the flood happens by empowering teams to dig as deeply as they need to go before implementing solutions.
New Relic's survey of 1,700 technology professionals in 15 countries highlighted the impact of observability on a business's bottom line. Organizations with full-stack observability in place reported median outage costs that were 59% lower than the $7.75 million median. They also reported fewer outages, faster mean times to detection and resolution (MTTD and MTTR), lower outage costs, and a higher median annual return on investment than those that hadn't achieved full-stack observability.
Predictive, Not Reactive
Observability is proactive and predictive. It relies on large amounts of data drawn from multiple sources and seeks to identify not just what happened and when but why and how. One of the major advantages of observability is that it can identify unexpected problems, which are the most likely to cause extended downtime. It enhances the proactive measures outlined above by enabling administrators and developers to quickly see the impact of changes such as code releases or patches.
While logs, metrics, and traces are the three pillars of observability, as many as 17 distinct sub-capabilities may be involved, including network monitoring, database monitoring, error tracking, and AIOps. The survey results show a clear correlation between the number of tools an organization uses and the payoff in less frequent and shorter downtime events. For example, respondents who said their organization had five or more observability capabilities in place were 40% more likely to detect high-business-impact outages in 30 minutes or less than those with fewer than five capabilities.
Achieving full-stack observability doesn't mean running the table with every possible tool. It's about having the ability to observe the status of each component in a distributed environment in real time. The elements needed to do that vary by the complexity of the IT stack.
More important than the number of tools used is that data is unified. A consolidated approach enables developers and engineers to shift their attention from fighting fires to fixing problems before they occur. It also improves collaboration and cross-pollination of skills. Survey respondents with unified telemetry data reported fewer high-business-impact outages, a faster MTTD, and a faster MTTR than those with more siloed data.
The value of this single view isn't lost on IT leaders, one-quarter of whom told us that juggling too many monitoring tools is a primary challenge to achieving full-stack observability. Fortunately, the situation is improving. The average number of tools respondents to the 2023 survey said they were using fell from the previous year's survey while the proportion of respondents using a single tool more than doubled. This indicates that as observability matures, IT organizations are shifting to a consolidation strategy.
Observability is only one element of a resilient IT infrastructure but one that ties into best practices across the board. When you consider the growing cost of downtime in lost business, customer frustration, and reputational damage, the case for investing in resilience becomes an easy one to make.
Peter Pezaris is Chief Strategy and Design Officer at New Relic.