Integration
Development
SRE

Monitoring application stability in Cortex with Bugsnag

Bugsnag monitors application stability so you can make data-driven decisions on whether you should be building new features, or fixing bugs.

Using the Bugsnag integration in Cortex allows you to see the Bugsnag issues that have been discovered for each service alongside your other integrations in Scorecards. This is particularly powerful in measuring the operational maturity of a service. 

To get started see our documentation for adding Bugsnag to Cortex.

Scorecard Integration

By adding Bugsnag to Cortex, you’ll be able to create Scorecard rules that check: 

  • If Bugsnag project is set
  • Number of Bugsnag issues
  • Number of Bugsnag issues for custom filters, such as “filters[event.since][]=all&filters[error.status][]=open&filters[app.release_stage][]=production”. This will get the number of events since the beginning of time, which are currently open in production. 
Example of potential Bugsnag rules in a Scorecard.

Measuring Operational Maturity 

Have you ever wondered:

  1. If your services are meeting SLOs? 
  2. If the on-call metrics are looking healthy? 
  3. If your customers are facing too many incidents? 

You can actually measure this within Cortex by creating a Scorecard that combines the issues Bugsnag finds in your application and other data from your integrations. To do so, create a Scorecard for operational maturity. It can look something like: 

  • Number of bugsnag issues < 5 - confirm that bugsnag isn’t finding too many issues in the platform 
  • MTTR < 1hr - example threshold, but make sure that issues are resolved in a reasonable amount of time. If they’re not, you can dig into the root cause.
  • off hour interruptions < 3 - if engineers are being paged off hours, it will lead to alert fatigue and low morale. By catching services that are causing high numbers of off hour interruptions, you can improve developer happiness.
  • post mortem tickets opened in the last 6 months that are still open - if developers are constantly creating JIRA action items for services and not actually closing them, then this is an organizational risk. Either the team is not prioritizing incident-related issues, or the team is not equipped with the right resources.
  • customer facing incidents in last 3 months < 2 - check that JIRA does not have too many customer facing incidents
  • outstanding compliance issues < 3 - make sure there are no outstanding compliance/legal JIRA issues that the service is affected by
  • compare custom data to number: "99-percentile-latency" < 500ms - either through direct integration with tools or through a batch process that queries prometheus and sends this to Cortex periodically, you can enforce specific metrics that services should be meeting.

Setting up a scorecard like the one above will allow your team to align on the meaning of a service that is operational mature. Additionally, it’ll help you identify any gaps in your services. By doing so, you’ll be able to create team-wide initiatives to improve your services. Cortex will remind you of upcoming action items and notify the service owners if any services are regressing against these rules. 

Start using Cortex & Bugsnag today

Using Bugsnag within Cortex will allow you to measure the operational maturity of your services and align your team on the standards. Visit our documentation to integrate Bugsnag with Cortex. If you're new to Cortex, set up a demo with our team to get started. 

By 
Cortex
 - 
June 17, 2021
By 
Ganesh Datta
 - 
June 1, 2020