We see teams fall into a few common traps with SLOs, SLIs, and SLAs, particularly when they’re just starting out. In this article, we’ll first define these three acronyms (it’s easy to get confused!) and show you how to avoid the mistakes other teams make.
SLOs are your service-level objectives. When your team decides internally that you have a goal of 99.99% uptime for a service, you’re setting an SLO. And to know that you’re hitting that SLO, you need to measure your uptime. When you attach metrics to an SLO, these stats get acronymized (a real word, apparently) and become your SLIs, or service-level indicators. For example, your site’s response rates and error rates could be SLIs if they tell you whether you’re hitting your SLOs (of super fast response rates and super low error rates, presumably).
Finally, your SLAs are the service-level agreements that you’ve made with customers and are legally bound to uphold. For every SLA, you’ll want to have an SLO that you track internally. On the other hand, your SLOs don’t need to correspond to SLAs—they can just be ways for your team to track the health of your product.
To sum it up:
We see teams make a few common mistakes with SLOs, SLIs, and SLAs, particularly when they’re just starting out.
There’s a lot of different monitoring tools out there, and many of them can help automatically set SLOs for your services. It can be tempting to monitor all the SLOs. But if you have an SLO, you’re going to want to have an alert for when it fails, and if you have an alert for when it fails, you’re going to be nagging your team to fix it, and if you’re nagging your team…it’s like the engineering nightmare version of If You Give A Mouse a Cookie.
So it’s important to proceed with caution when it comes to defining your SLOs. Before you get tempted to set an SLO around, say, your server’s data processing throughput, you should ask yourself, do I actually need this SLO? Does it map to an SLA—something that we promised to our customers? If not, does this SLO correspond to a core business objective? Does this represent whether our product is working, or are we just trying to check a box?
In our experience, a lot of engineers have learned somewhere along the line (maybe in previous jobs at Google or Facebook) that you can’t launch a service without tracking certain SLOs that were designed for and popularized by large companies. So they end up focusing on SLOs that aren’t actually important for the stage that their service is at, and don’t make an impact for their end users. And these SLOs end up diverting time and energy from the team’s actual goals. Your monitoring should serve your team—not the other way around.
On a related note: beware of the pitfall of over-optimizing your SLOs for minimal returns. According to Google, “users cannot distinguish between a 100-millisecond (ms) and a 300-ms refresh and might accept any point between 300 ms and 1000 ms” (Adopting SLOs) and “It is generally recognized that achieving the next nine in availability costs you ten times as much as the preceding one” (Defining SLOs).
Like your SLOs, your SLIs should correspond to what your product is trying to achieve at a high level. We commonly see teams tracking SLIs like uptime and error rate across all their services because it seemed like a good idea. But SLIs aren’t one-size-fits all metrics. Especially when you’re a small company, you should view your SLIs as an opportunity to identify what really matters for your team and stakeholders.
It helps to start by considering the core user journeys that you want to support in your product. For example, you might have an essential feature where users upload files to import data into your product. If users can’t upload files, or if the experience is very slow, this key user journey fails. Therefore, it would make sense to track SLIs around latency and failure rates for this particular service—but not for an ordinary asynchronous background process. Along the same lines, you might decide that if the health of your top 5 endpoints looks good, you can safely ignore random 500s elsewhere.
When you focus on metrics that directly relate to user engagement, you might find you can afford to be more flexible on your reporting. If you’re a five-person team with a small user base, maybe you don’t need to set up alerting yet—reporting on certain metrics once a week might be enough. This frees up your team to focus on all the other aspects of business, so that if you’re lucky, you’ll have a good reason to worry a lot more about SLIs later on.
One of the problems with SLOs is ownership and responsibility. Usually an engineering manager, or a software reliability engineer if there’s an SRE team, will be responsible for managing the SLOs and setting up alerts for your on-call rotation. But it’s likely that the on-call engineer won’t know anything about the service the SLO corresponds to and who is responsible for it.
That’s why at Cortex we put services and their owners first-class citizens and support SLO integrations out of the box. You no longer have to worry about having 50 SLOs where you don’t know which services they map to, and you can always find the engineer or team who owns the service you are inspecting. And you can drill down to identify other services that might be affected to see if there is a widespread issue at play. We’ve also developed Scorecards for your services, which puts SLOs in the context of your overall service quality and reliability, giving you a broader view of the health of your services. If you’re failing your SLOs as a startup that’s one thing, but if you’re failing your SLOs and your code coverage is really low and your development quality metrics are low, then it might be time to really focus on that service.
Finally, by putting SLOs front and center in our service catalog, engineers can better educate themselves on the health of their services. The service owner can immediately see how their SLOs are looking instead of needing the SRE to come and tell them when something goes wrong. By surfacing this information you can avoid tribalism on your team and create a culture where there is shared responsibility instead of blame and resentment.