Alerting
Definition
Automatically notifying the right people when a system metric crosses a threshold or when unexpected events occur, ensuring timely responses.
Use Cases
- Netflix: Detecting service degradation and infrastructure issues in a large microservices environment to reduce time-to-detect and time-to-recover. — Netflix uses extensive telemetry and alerting practices (including automated notifications and on-call workflows) to surface abnormal conditions quickly and route them to the right responders. Alerts are typically tied to service-level indicators (SLIs) and operational metrics, with escalation to on-call engineers via incident management tooling. (Faster detection of incidents and quicker remediation through structured on-call response and actionable alerts, improving service reliability at scale.)
- Etsy: Catching production issues early during frequent deployments and rapid iteration. — Etsy has publicly discussed using monitoring and alerting to detect regressions and operational anomalies, routing alerts to engineering teams responsible for affected services and using dashboards/metrics to validate changes after deploys. (Reduced impact of regressions by identifying problems quickly after releases and enabling faster rollback or fixes, supporting high deployment velocity.)
- Shopify: Maintaining uptime during traffic spikes (e.g., major sales events) by reacting quickly to performance and capacity signals. — Shopify has described reliability practices that include monitoring key platform metrics and alerting on abnormal behavior, integrating alerts with on-call processes so teams can respond rapidly during peak load. (Improved operational readiness and faster response during high-traffic periods, helping protect revenue and customer experience.)
Provider Equivalents
- AWS: Amazon CloudWatch (Alarms) + Amazon SNS
- Azure: Azure Monitor (Alerts) + Action Groups
- GCP: Google Cloud Monitoring (Alerting) + Notification Channels
- OCI: OCI Monitoring (Alarms) + Notifications
Frequently Asked Questions
- What's the difference between alerting and monitoring?
- Monitoring is collecting and visualizing signals (metrics, logs, traces) so you can understand system health over time. Alerting is the action layer on top of monitoring: it evaluates those signals against rules (like thresholds or missing data) and notifies people or triggers automation when something needs attention.
- When should I use alerting?
- Use alerting when a condition requires timely action—like preventing downtime, protecting data, or avoiding customer impact. Start with alerts tied to user impact (latency, error rate, availability), then add alerts for capacity and safety (CPU/memory saturation, disk full, certificate expiration). Avoid alerting on every metric; alert only when someone can and should do something about it.
- How much does alerting cost?
- Costs depend on the cloud provider and what you alert on. Common pricing factors include: (1) number of monitored time series/metrics, (2) number of alert rules/alarms, (3) evaluation frequency, and (4) notification delivery (for example, SMS or phone calls can add charges). Many providers charge separately for the monitoring data you ingest/store and for notification services (such as AWS SNS). Always check the specific service pricing page for your region and expected alert volume.
Category: monitoring
Difficulty: basic
Related Terms
See Also