SRE Practices Evolve as Systems Become More Complex

Catchpoint's SRE Report 2024 provides insight into the changing need of site reliability engineering.

Sean Michael Kerner, Contributor

January 10, 2024

4 Min Read
SRE under a magnifying glass
Alamy

A new report from Catchpoint provides insights into how site reliability engineering (SRE) practices are adapting to increasingly decentralized and complex systems. The SRE Report 2024 summarizes findings from a survey of more than 400 SRE professionals worldwide.

The report reveals that most organizations now need to monitor third-party services and endpoints outside their direct control. This indicates a shift away from centrally managed services toward reliance on federated vendors and infrastructure. Organizations will have to rethink reliability as their architectures become more distributed.

Key highlights from the report:

  • 64% of organizations believe reliability practitioners should monitor experience-impacting endpoints outside their control, such as third-party services.

  • 66% of organizations use two to five monitoring tools due to their unique capabilities, with more tools used as staff size grows.

  • 44% of companies use team structures organized around platforms and capabilities, rather than products.

  • Learning from incidents has the most room for improvement, regardless of company size. Just 52% spend enough time reviewing major incidents.

  • 53% expect artificial intelligence (AI) to make work easier in the next two years, but views are mixed on its usefulness for reliability tasks.

Related:How to Become a Site Reliability Engineer: A Step-by-Step Guide

"We were surprised by how interested organizations were in monitoring things outside their control," Leo Vasiliou, web performance expert at Catchpoint, told ITPro Today. "For us, this is a clear indication of the need for new approaches to critical visibility."

The Big Challenges Facing SREs in 2024

There is no shortage of challenges site reliability engineers will face in 2024.

The biggest challenges SRE teams will need to tackle in 2024, according to Vasiliou, are balancing costs, time, alignment between ranks, and complexity of architectures.

More than one-third of respondents to the survey mentioned that resource constraints are a top concern, with 44% naming cost or budget as a challenge. Vasiliou said there is a significant opportunity for organizations to monitor elements of an internet stack that they do not directly manage such as for CDN (content delivery networks) and DNS (domain name system).

Vasiliou pulled quote

Vasiliou-Catchpoint

In his view, this is an important gap to fill to improve efficiencies, incorporate reliability practices to include third-party providers, and enhance the customer experience.

How SREs Can Learn From Incidents

Learning from incidents was cited as the top area for improvement across company sizes.

Related:5 Signs Your ITOps Team Needs a Site Reliability Engineer

“We recommend taking the time to learn from both major and non-major incidents as they represent major learning opportunities for practitioners that will ultimately improve their company's resilience over time," Vasiliou said.

According to the report, 71% of respondents worked on dozens or even hundreds of non-ticketed incidents per month in 2023. Catchpoint flagged this as a major area for company improvement, Vasiliou said. For SRE teams to improve, they need to be able to track their work, he added.

"I believe instituting the practice of refining blameless feedback loops as part of the company's culture will help teams prepare to tackle major challenges,"  Vasiliou said.

The Role of AI for SRE

There is little doubt that AI will play some kind of role in SRE, though the report found mixed views on AI's usefulness over the next two years.

“An interesting finding from our report was that the mixed views were mostly based on rank in the organization," Vasiliou said.

He added that it's not surprising that management and leadership are looking at AI for potential cost savings. Whether that is from reducing headcount or accelerating time to market remains to be seen.

In contrast, individual contributors tended to see AI with less positive sentiment, since individual contributors said "being proud of their work" was most important to them, whereas management chose "being efficient" as important. As such, for individual contributors, their sense of pride in their work may diminish when AI is performing tasks.

"We believe that this difference in mindset will continue to drive mixed views," Vasiliou said. "Additionally, the AI applications we see as most promising include GenAI, although some may also be judging it based on the hype of AIOps.”

About the Author(s)

Sean Michael Kerner

Contributor

Sean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He consults to industry and media organizations on technology issues.

https://www.linkedin.com/in/seanmkerner/

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like