What startup hasn’t dreamed about going viral and seeing their usage skyrocket overnight? However, when companies are small, teams tend to focus all their energy on improving their product — preparing for a sudden surge of usage is mostly an afterthought. But beware: when success comes knocking, sometimes it brings a sledgehammer and takes down your entire site.
When your site goes viral, you probably want to stay up late celebrating, not frantically trying to revive your crashed app. That’s where load testing comes in. It’s a great thing to be thinking about while your organization is still in the early stages. Investing in your strategy now will only continue to pay off as you grow and scale.
Every engineer knows that load testing is important. Yet, here’s what load testing often looks like at a startup: It’s one engineer’s personal project, where they run scripts on their local machine to test out some theories on what might be slowing the site down. No surprise, then, that teams are often caught unprepared when they finally get that coveted retweet.
If you’re interested in improving your load testing, there are three things you can focus on that most startups simply don’t do:
Let’s dig deeper into each of these points.
There are two kinds of load testing. The first type of load testing is about identifying basic bottlenecks. For example, is your application I/O-bound? Is it CPU-bound? Which microservices fail when you hammer them with concurrent requests? This type of load testing can be good for testing out a hunch based on where you’ve seen latency on your site. But with everything being tested in isolation, you don’t get the full picture of their performance in production.
That’s why there’s a need for a second type of load testing that’s production-based. This means seeing testing your entire setup in an environment that resembles production as closely as possible. Here, you’re interested in how the whole system takes on load: What are the dependencies and limitations between upstream and downstream services? How do your SLIs change as you scale up the simulated traffic?
The right SRE tools can help you perform production-based load testing. But there’s another challenge: overcoming the cultural and organizational obstacles that prevent you from taking action on your test results.
While many software engineering functions have a dedicated owner, load testing is one of those gray areas, especially at a startup where there’s no dedicated SRE. Should the QA team handle load testing? Or the cloud infrastructure engineering team? How about DevOps?
If you don’t make a decision about ownership, load testing probably won’t be prioritized — and your on-call engineers will be left cleaning up the mess when services fail in production. Engineering managers can help protect their team from this outcome by making a proactive decision about who will oversee the setup, automation, and reporting on load testing.
You can be doing all kinds of automated load testing on your production traffic, but it won’t help if there’s just one engineer on the team who cares about it. (And unfortunately, this is often the case.) If you own load testing at a startup, part of your job is going to be making sure the results have an impact and add value to your team. Here are some suggestions:
You’ll find more tips about affecting cultural change in our guide to improving your influence as an SRE.
We’ve seen first-hand how some of our customers are using Cortex to scale up their load testing efforts. For example, one customer runs load tests every day and pushes the previous seven days of rolling results to Cortex at all times. As the reporting mechanism, they use a Cortex Scorecard they’ve designated specifically for load testing.
Scorecards are ways to report on service health at a glance. Customers have used Scorecard notifications to set up reports on their load testing results that go straight to the service owners. The SREs can even set new performance goals and track progress using Cortex’s Initiatives feature.
With Scorecards, the whole team has a dashboard for seeing how the system is performing overall. The new Graph View lets you contextually overlay your Scorecard data on the graph of your microservices. This feature is especially useful for a load testing Scorecard: You can easily see all the dependencies and identify bottlenecks at a glance.
If you want to scale up your load testing efforts as an organization (maybe because your userbase is scaling up) and make this a fully integrated part of your QA and production readiness process — we might be able to help! You can sign up for a free demo of Cortex here.