What does your team’s production readiness practice look like for deploying a new service? If you or a dedicated SRE team within your engineering organization has ever green-lit a new service, you probably have used some form of a production readiness checklist to ensure certain conditions are met and processes are followed. In this article, we’ll share some helpful approaches for creating a great production readiness checklist that will set your team up for a successful launch.
The ideal production readiness checklist is comprehensive, but flexible — it can and should look different based on the type of service you’re deploying and the impact of the launch. That said, there are certain categories that are essential for any checklist.
Production issues can be extremely costly, so it’s important to make sure you have the right logging and monitoring in place for any new service. By preparing now, you can make sure that you have the data you need to debug failures when they (inevitably) occur — otherwise, issues might go undetected or take too long to fix.
Make sure your team knows exactly what is being recorded across your application and access logs — and if there’s a piece of information you might need to debug a failure, start logging it from day one. Although it might seem obvious, it’s also important to document where to find the logs, since the people debugging your service later on might not be the same people who wrote your logging code. Also document key information about the service, like where the git-repo lives, what language is used, what version, and when the service was deployed.
Before launching, make sure you’ve implemented alerts that notify your team when certain SLA thresholds are exceeded. And ensure there is appropriate tracing across other services that might interact with this one.
You can get some of the above for free by using automated tools. Your team most likely relies on third-party software to automate your monitoring, on-call rotations, incident management, and more (if you need any tips, check out our guide to SRE tools). If this is the case, getting ready for production means making sure that everyone has access to the right tools and dashboards and knows where to find them. You should make sure your on-call rotation is set up and your teammates know how to use your playbooks for incident management. And you’ll want to make sure that a team is responsible for regularly checking that your tools are still configured and working as expected throughout the lifetime of the service.
There are some aspects of your production readiness checklist that aren’t easily automated. For example, you should make sure that whoever wrote new APIs also wrote good documentation and made sure the APIs were well-versioned. Or you will want to ensure when your team is making a call, they’re logging the status codes. And have you done load testing and capacity planning?
For a lower-impact service, some of these questions might default to the honor system. But most of the time, you will need to have a conversation between the engineers who developed the service and a manager or SRE to verify that certain best practices were followed.
SRE teams are more successful when they reduce tribal knowledge, and it’s important that for every new service you push into production, you have great communication from the beginning. This means identifying a clear owner — a single team or engineer who will take accountability for the service. It’s also important to document the way your team will discuss issues, like a dedicated Slack channel and a direct escalation path.
Defining a production readiness process is one of the most important things you can do for the health of your products, but getting it to stick as part of your team’s culture can be hard. No one wants to feel like a nag, and yet the honor system doesn’t always work in practice. At the same time, production readiness checklists are usually unwieldy and a hassle to manage. With all those questions and answers to document, teams typically end up using Excel or Google Sheets to store their lists. But it’s hard to standardize and communicate across many different spreadsheets that are floating around. And someone has to maintain those spreadsheets and keep them up to date.
At Cortex, we developed Scorecards to replace the production readiness spreadsheet and help your team understand the health of your services at a glance. You can set standardized requirements for each service and adjust them as needed (for example, maybe for 10% of your services, you enforce extra security vulnerability scans). We integrate with your third-party automation tools so that you get most of your scorecard filled in automatically, and we let you create and answer custom questions for everything else. If something changes out from under you— like your on-call rotation disappearing from PagerDuty — you can see it right in the Scorecard.
We built Scorecards because we wanted to make production readiness more objective and clear, and as engineers ourselves who have worked with countless third party tools, we know how important it is to create a single source of truth that your whole team can refer to. Scorecards help create a culture of ownership and reliability among service owners within the team!
The details of your production readiness checklist are important, but the details don’t matter if the overall process is broken. We encourage you to take a step back and think about how you can bring a culture of blameless communication to your production readiness practice. If you’re interested in finding out more about Cortex, sign up for a free account here.