Skip navigation
best practices button on keyboard Alamy

2023 SRE Report Identifies Site Reliability Engineering Best Practices

The SRE report doesn't see AIOps as the solution for ITOps — and tool sprawl apparently isn't a terrible thing after all.

The practice of site reliability engineering (SRE) has become increasingly central to IT operations in recent years.

SRE is all about having the right tools and processes in place to ensure the reliability and resilience of the applications and services that IT operations deliver and support. According to the 2023 SRE Report put together by SRE vendors Catchpoint and Blameless, there are a lot of different tools that IT operations teams can use — and that's not a bad thing. The report found that 54% of organizations use three or more tools to get telemetry from their operations, including application network and infrastructure resources.

Also of note, the SRE report found that about 46% of organizations said they get no or little value from AIOps tools.

"The low value received from AIOps was not a surprise to me, but it may be a surprise to some readers," Leo Vasiliou, director of product marketing at Catchpoint, told ITPro Today. "We did caution to not ignore underlying AIOps capabilities, but we did caution to ignore the hype as people consider those capabilities as part of larger observability implementations."

Related: What Does the Future Hold for Role of SRE?

The Challenge of SRE Tool Sprawl

Across multiple segments of IT operations, IT sprawl is often identified as a primary challenge. For example, the recent GitLab DevSecOps survey and one from ESG and Mezmo both identified tool sprawl as a challenge, specifically for DevSecOps.

Practitioners need different tools to accomplish different tasks at different points in time — and as long as the value received from tools in the stack is greater than their cost, then there is no tool sprawl problem, Vasiliou said.

In the report's conclusions, Steve McGhee, reliability advocate, SRE, for Google Cloud, wrote that when an individual goes to a mechanic, they don't look for the place with the fewest tools on the wall.

McGhee suggests that SREs should not be forced to rationalize every tool in an attempt to prevent overlap.

"When it comes to skilled labor, or operations perhaps, you want teams to be able to reach for the right tool at the right time, not to be impeded by earlier decisions about what they think they might need in the future," McGhee wrote.

Identifying the Top SRE Challenges

Vasiliou said the biggest challenges, as listed empirically, from this year's report are:

  • finding talent
  • complex architectures
  • realizing business value
  • lack of end-to-end visibility
  • and alignment or prioritization

The challenge of addressing the top issues largely concerns managing bias and predisposition, he said.

Related: Why Site Reliability Engineering Is Key to Modern DevOps

"Too many other research papers unidirectionally say, SREs/IT need to add business value," Vasiliou said. "Saying SREs/IT need to add business value is nefarious nothingness and does not help SREs/IT know which speeds and feeds are important."

On the other hand, he noted that SREs need to also know that geeking out over speeds and feeds does not help executives understand why they are valuable. The bridging of this gap has to do with new or better conversations around capabilities, which are an important middle ground between both ends of the spectrum.

"The only way to have conversations around required capabilities will be if involved parties let go of their bias; otherwise, the IT to business gap will always remain," he said.

About the author

 Sean Michael Kerner headshotSean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He consults to industry and media organizations on technology issues.
Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.