If you work in IT operations, you're likely familiar with site reliability engineering, or SRE, a trendy role that some IT organizations are now adding to their ranks.
But you may be wondering: How do you know when your team needs a site reliability engineer? How large do IT operations teams need to be to benefit from SREs? Which types of tasks or technological challenges require help from an SRE, and which can be left to traditional IT engineers?
Those are great questions without clear answers. But let's take a stab at them by exploring when it makes sense to add an SRE (or several) to an ITOps team.
What Is an SRE, and Why Should ITOps Care?
Site reliability engineers are engineers who specialize in making systems reliable and optimizing performance. The role originated at Google way back in the early 2000s, but it has been only within the past few years that site reliability engineering has started to become a common role at organizations far and wide.
Some ITOps engineers might contend that SREs don't actually do anything that ITOps teams can't do on their own. After all, don't IT engineers also specialize in reliability and performance?
The answer is that they do, but there are important differences between ITOps and SRE. The biggest include:
- Job scope: SREs specialize in reliability alone, but reliability is just one of many responsibilities that fall to IT engineers. Others include application deployment, ticketing management, end-user support, and beyond.
- Tools and approach: SREs leverage tools and concepts rooted in software engineering, such as infrastructure as code (IaC), to manage reliability. Traditionally, at least, IT engineers relied on tools and techniques that were distinct from those of software engineers.
(ITOps engineers are also paid significantly less than SREs on average, but that doesn't necessarily mean they're functionally different from SREs, of course — although it could be a source of tension between ITOps teams and SREs.)
Thus, although there is certainly overlap between SREs and ITOps engineers, SREs can complement ITOps by ensuring deep dedication to reliability, while also contributing new techniques and tools for optimizing reliability.
Signs Your ITOps Team Needs SREs
The question that remains to answer is when ITOps teams need SREs, and when they can manage reliability well enough on their own, without the special focus or techniques of SREs.
There's a good chance that your ITOps team has crossed the line from self-sufficiency into needing an SRE if any of the following is true:
- You're consistently failing to meet service-level agreements, or SLAs, because you can't manage reliability effectively on your own.
- You struggle merely to know which SLAs to set in the first place because you're unsure which levels of reliability you can reliably promise to your users.
- You can't measure how well you are meeting SLAs because you don't know what to measure for that purpose, or you don't know how to collect that data.
- You are deploying complex new technologies, like containers and Kubernetes, which pose novel reliability challenges that SREs can help solve.
- Reliability currently feels like an afterthought within your IT operations workflows, rather than a primary focus. Even if you're not yet falling short of reliability goals, reliability shouldn't be something you don't think of until it becomes important. It should be a first-order priority, and SREs can help make it so.
This list could be summarized as follows: "If you aren't sure how to define, measure, or achieve reliability goals, especially in the context of complex technology, you could probably benefit from an SRE (or two or three)."
To be sure, not all IT operations teams need SREs, no matter how trendy the role is. If your IT organization is setting and achieving reliability goals well enough on its own, or if the types of technology it manages are relatively simple to support, SREs may be overkill. Plus, the fact that a single SRE costs something like 1.5 IT engineers means that hiring SREs could deprive organizations of standard IT engineering resources, which is not a good thing unless the SREs truly deliver unique benefits.
But for IT teams that do constantly struggle with reliability, SREs may be a great investment.
About the authorChristopher Tozzi is a technology analyst with subject matter expertise in cloud computing, application development, open source software, virtualization, containers and more. He also lectures at a major university in the Albany, New York, area. His book, “For Fun and Profit: A History of the Free and Open Source Software Revolution,” was published by MIT Press.