Part of the key to practicing site reliability engineering (SRE) is adopting a mindset focused on maximizing reliability and performance across all stages of the application delivery lifecycle.

The other part of being a site reliability engineer is having tools that help you operationalize that mindset. And while many of the tools that SREs use for planning, measuring, and improving reliability operations are ones that other types of IT pros (like DevOps engineers) might also use, SREs tend to deploy their tools in different ways.

Let's explore seven main types of SRE tools and the role that each plays in optimizing reliability.

Alamy

1. Infrastructure-as-code: Infrastructure-as-code, or IaC, platforms — like Terraform and CloudFormation — are popular among IT practitioners of all types because they help automate the provisioning of resources. For SREs in particular, however, IaC is a critical SRE tool for improving reliability. The main reason why is because an IaC-based approach to provisioning enables consistent configurations across environments. In turn, consistency breeds reliability because it reduces the risk of accidental configurations.

Alamy

2. Kubernetes: Kubernetes isn't a reliability tool per se. But if you're an SRE working in a business that deploys microservices-based applications, understanding how to use Kubernetes is almost certainly a must. When properly configured and administered, Kubernetes keeps applications running reliably across clusters of servers, reducing the risk of reliability problems that SREs need to respond to.

Alamy

3. Chaos engineering tools: Chaos engineering, which means experimenting with systems to find faults or flaws that might otherwise go unnoticed until they cause a disruption to production operations, can help SREs be proactive about reliability issues. That's why chaos engineering tools, like Gremlin and Chaos Monkey, should be part of every SRE's toolchain.

Alamy

4. Observability platforms: Observing systems to detect, investigate, and fix reliability issues is at the core of what SREs do. That's why every SRE needs to master observability platforms, which automate the process of collecting, analyzing, and reporting on the various data sources generated by applications and infrastructure.

Alamy

5. Source code management tools: SREs may not be primarily responsible for writing source code (that job falls to developers), but they may want to embrace practices such as GitOps to help standardize application deployment and management operations. That's why SREs, and not just developers, should learn how to use source code management tools, like Git, or platforms, like GitHub.

Alamy

6. Incident response tools: When, despite SREs' best efforts, something actually fails, incident response takes place. And while you can orchestrate incident response manually, incident response platforms — which automate tasks like assigning stakeholders to different tasks and managing communications — make the incident response process faster and smoother.

Alamy

7. Postmortem tools: After resolving an incident, SREs are often charged with performing a so-called postmortem, which means identifying which failure on the part of the organization allowed the incident to happen and how the team will prevent it from recurring. The market for SRE postmortem tools remains underdeveloped, but there are a few solutions out there that aim to streamline the postmortem process, like the aptly named Morgue.

About the author

Christopher Tozzi is a technology analyst with subject matter expertise in cloud computing, application development, open source software, virtualization, containers and more. He also lectures at a major university in the Albany, New York, area. His book, “For Fun and Profit: A History of the Free and Open Source Software Revolution,” was published by MIT Press.

Comments

Plain text

7 Essential SRE Tools for Optimizing Reliability

Comments

Plain text