SREs are responsible for evaluating tools that can help reduce toil on their teams and make their applications more reliable for end-users. There are a lot of SRE tools out there, and it can be hard to know which ones are the most important to consider. We’ve compiled this guide to highlight the key categories of SRE tools and help you find the right fit for your team.
Monitoring tools are used to generate valuable metrics and insights about an application and help SREs do everything from creating benchmarks to debugging outages. Luckily, there are monitoring tools for practically anything you might want to measure relating to your application. These tools usually cover one or more of the following areas:
Models the response times perceived by end users of your application and compares this against performance benchmarks to detect latency and outages.
Examines all the incoming and outgoing traffic routed through your network to help with load balancing, debug client/server issues, and stop network attacks like DOS attacks before they occur.
Looks at the consumption rates and SLOs for the components of your application, like the CPU load for your Kubernetes clusters, to help with resource management.
No matter what type of monitoring tool you’re looking at, you’ll want to investigate the following important features:
Here are some of the monitoring tools that we hear about often from SREs:
SREs and engineers are often responsible for being on call during working and non-working hours, prepared to react immediately to resolve any issue that might threaten the system’s health. Being on call can be stressful, and you want to be careful not to burn out your team. Fortunately, there are many tools SREs can use to help reduce the burden and make being on-call a bit less painful.
While most on-call tools also help your team manage the incident itself, we’ll be covering incident management in the next section. Here are some on-call rotation specific features to help you prioritize your assessment:
Your tool should help you distribute the on-call duty equally and fairly across your team, with flexibility in case someone needs to trade on-call rotations at the last minute.
Sometimes an issue touches multiple components, and several on-call engineers need to be involved in the incident response. For that reason, it’s important that your tool provides a centralized way to view the on-call calendars across your organization.
This one is pretty obvious, but it’s absolutely essential. Your on-call tool needs to integrate with your monitoring tools to deliver alerts, and it must provide a payload with enough context so that the on-call engineer can address the issue. Your tool should also give you rules-based configurations for alert routing and controls to combat spammy alerts, which lead to burnout.
Popular on-call management tools include PagerDuty, Splunk OnCall (formerly VictorOps), and Atlassian’s Opsgenie.
Inevitably, failures will occur in your system, and the person on-call is going to need powerful tools to help fix issues in the moment and make sure that they don’t happen again. According to Google’s SRE Book, managing an incident successfully comes down to three things: (1) clear escalation paths, (2) well-defined response procedures, and (3) a blameless post-mortem culture. SREs can find tools to codify these practices and help your team communicate better and resolve incidents faster.
All of the on-call management tools previously mentioned also feature some form of incident management. Additional dedicated tools for this purpose include Blameless, ServiceNow, and Netflix’s open source crisis management tool, Dispatch.
A main function of the SRE role is to automate away the mindless and busywork tasks of software development. Automated configurations can be used to ensure that toil is reduced and the same steps are always repeated when provisioning, managing, and destroying resources.
Popular tools for automating your infrastructure configurations are Terraform and Ansible.
To make sure that incidents are avoided in the first place, SRE teams should look for tools that help them enforce high-level policies and best practices. A key tenet of microservice catalog tools should be governance over your service oriented architecture. Governance refers to what rules and processes the team follows, as well as who is responsible for reliability across different services and applications. This helps avoid tribal knowledge, and is especially important when teams work remotely and manage many different microservices.
The most common governance tool we’ve seen is unfortunately a giant spreadsheet (or, worse, several spreadsheets!). Since this is the opposite of the SRE philosophy, we’ve created Cortex to fill the gap. Our microservice catalog makes it easy to see your services and owners in one place, and we foster accountability and show you how to improve using a gamified scorecard.
With the right tools, your SRE team will be able to spend their energies improving your product’s reliability and performance instead of dealing with toil and overhead. We hope this article has been a helpful overview of SRE areas of focus and popular tools. If you’re interested in giving Cortex a try, request a demo here.