You’re a fraud analyst, customer success manager, product manager, trust & safety analyst. You’re in charge of making sure your company’s operations run smoothly. That there isn’t a major fraud event or customer loss on your watch. How do you use your company’s data to manage an operational process?
Google’s site reliability lead wrote an engineering classic piece on his philosophy on DevOps alerting. We’ve interviewed 100+ leaders about business alerting, from companies across fintechs, healthcare and marketplaces varying from 2–4000 people.
We inferred several similarities, uncovered some great best practices and common failure modes to avoid. We bring you these lessons in this series about alerting.
If you take away only three things from this article, remember
Monitoring is a trade-off between your time/system complexity and your confidence its working well. Ensure alerts are designed by the right stakeholders (Read: What you need to get started) While it might seem counterintuitive, over-monitoring is harder to solve than under-monitoring (Read: How to decide what to monitor)
There’s several ways to ensure you’re crafting good alerts (Read: How to craft a good alert), but at a minimum, alerts should make it clear what the next steps are. Else, you risk having different responses to the same alerts depending on who’s on call, or worse, inaction.
Ensure you’re only adding alerts for something that is or will imminently be a user-facing issue that you need to immediately address. Alerts are not the way to display regular business metrics or cases (Read: How to manage non-alert items).