If you're a frequent reader of the Cortex blog, you know that we care deeply about empowering Site Reliability Engineering (SRE) teams to adopt and manage microservices architecture. In organizations of all sizes and industries, SRE teams ultimately own the responsibility of keeping systems up and running and putting in place systems that mitigate risk, automate manual operations, and integrate alerting. Beyond that, successful SRE teams maintain clearly defined criteria for production, ensure developer accountability, and diligently measure success against availability targets (e.g. SLOs).
For organizations with maturing SRE teams, we've found that it's worth asking how those teams might garner influence beyond that function. What SRE principles can we evangelize and adopt across the wider engineering organization? What might other engineers gain by thinking like an SRE?
In this article, we'll provide a brief history of the SRE role and identify a number of key SRE principles that we've found to be impactful across engineering functions.
The ethos of the Site Reliability Engineer (SRE) was first defined in 2003 by Ben Treynor Sloss, a VP of Engineering at Google. As he puts it in this interview,
"SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor."
The root of the SRE function came from the need to build and automate software to solve operational problems. It was in large part a response to the dysfunctional model where a development team might be responsible for writing code, and a separate operations team might be responsible for maintaining that code in production. In this model, developers are incentivized towards development velocity whilst operations teams are incentivized against change. This system not only hinders feature development and innovation, but also fails to optimize for investment in automation that can bring significant gains in the long term.
Since the inception of the role in the early 2000s, Google published a widely respected SRE book that memorializes both the SRE mission and best practices. In 2019, LinkedIn celebrated the SRE role as the second-most promising job in the U.S.
Given the explosive growth of the SRE role in modern software development teams, it's worth noting the qualities of the role that make it so impactful. And perhaps beyond that — what might it mean to adopt an SRE mindset if you're not an SRE? Might there be an opportunity in evangelizing those qualities across an organization? We think so.
Below, we'll start with the a few key SRE principles that we believe apply strongly across functions.
In the context of microservices, service ownership is critical. As we wrote in "How to Drive Ownership in Microservices",
"Service ownership means that there is a clear person or group of people who are ultimately held accountable for the success of each service."
For SRE teams managing dozens of services across multiple applications, a failure to track and assign service ownership can make it extraordinarily difficult to diagnose and address outages. For that reason, they might:
The developer accountability that comes with service ownership, however, can and should apply well beyond the SRE function. Engineers across an organization should strive to assign clear ownership within their team always, whether that's enforcing that every GitHub issue and Pull Request have an owner or assigning a single person to be held accountable for each step of a product release. Ultimately, engineers in high-ownership environments feel significantly more empowered to solve challenging problems and are much more likely to deliver high-quality output at a predictable rate.
As noted above, SREs are responsible for keeping systems up and running — and building processes to ensure that those systems can handle scale. For example, an SRE might:
Beyond the SRE function, however, the principle across these three responsibilities is simple: create reliable systems that can scale. For engineers outside of the SRE function, adopting this principle can be incredibly impactful.
For example, consider a QA Engineering Lead who's responsible for writing test scenarios intended to produce performance benchmarks for each component of an application. Suppose the QA engineer can hire 10 other engineers on her team. If she adopted the SRE mindset of creating reliable systems that scale, she might write a test scenario template for each application component, write a script that outputs performance benchmarks when a test scenarios is run, and empower her team to grow those test scenarios to incorporate more complexity over time.
As much as successful SRE teams create systems to maintain reliability, outages are inevitable. And when incidents do happen, SREs are wholly responsible for building systems that make it easier to detect, diagnose and mitigate impact when an incident does occur. To this end, SRE teams might:
While the initiatives above are specific to the SRE function, the principle is once again simple: build systems that make it easy to detect, mitigate, and prevent problems. A Support Engineer, for example, would benefit from enforcing a postmortem on high-priority tickets to better understand root cause and potentially identify a gap in the product. If the Support Engineer uses a tool like Zendesk, she might create a Slack integration that alerts the Customer Success team if a P1 ticket breaches an SLA. Support teams might already be working on a process of this nature, but adopting an SRE mindset can certainly strengthen the momentum behind those processes and cultivate a culture of preparedness.
Here at Cortex, we've helped teams of all sizes take steps towards widely adopting an SRE mindset. Here are a few places to get started.
This list is certainly not exhaustive, but we're confident it's a good place to start.
While the steps above might sound intuitive, we know that it's hard, it takes time, and it takes commitment. Here at Cortex, however, we've learned that not investing in the SRE mindset risks:
SREs have significantly strengthened software development teams in the last decade, and the SRE mindset is just as powerful. We encourage you to embrace it. If you're looking for help or additional tips, don't hesitate to reach out to us at firstname.lastname@example.org. We'd love to learn from you.