Golden Signals
Definition
Four key metrics for monitoring distributed systems: latency, traffic, errors, and saturation. Like the vital signs doctors check to assess patient health.
Use Cases
- Google: Site Reliability Engineering (SRE) monitoring for large-scale user-facing services to detect outages and performance regressions quickly. — Google popularized the Golden Signals in SRE guidance by focusing alerting and dashboards on latency, traffic, errors, and saturation. Teams define service-level indicators (SLIs) that map to these signals (e.g., request latency percentiles, request rate, error rate, and resource utilization/queue depth) and page on-call only when thresholds or error-budget policies indicate user impact. (Faster detection of user-impacting incidents and reduced alert fatigue by prioritizing a small set of high-signal indicators over many low-value metrics.)
- Netflix: Monitoring streaming and API services to maintain playback quality and reliability during traffic spikes and regional failures. — Netflix engineering practices emphasize strong observability and actionable alerting. Teams commonly track request latency, request volume, error rates, and saturation indicators (CPU, memory, thread pools, connection pools, queue depth) across microservices, then correlate these with distributed tracing and logs to isolate bottlenecks. (Improved operational response during peak demand by quickly distinguishing between increased traffic, elevated errors, and resource saturation as the primary driver of degradation.)
- Shopify: Keeping checkout and storefront APIs reliable during flash sales and seasonal peaks (e.g., Black Friday/Cyber Monday). — Shopify engineering has publicly discussed reliability and incident response practices that rely on monitoring key service health indicators. Teams typically alert on elevated latency and error rates, track traffic changes to separate demand shifts from regressions, and watch saturation signals such as database/queue utilization to prevent cascading failures. (More stable customer experience during high-traffic events by catching latency/error regressions early and scaling or mitigating before widespread impact.)
Provider Equivalents
- AWS: Amazon CloudWatch
- Azure: Azure Monitor
- GCP: Cloud Monitoring
- OCI: OCI Monitoring
Frequently Asked Questions
- What's the difference between Golden Signals and the RED method?
- They’re very similar. The RED method focuses on Request rate, Errors, and Duration (latency). Golden Signals include those three ideas (traffic, errors, latency) and add Saturation, which helps you spot when a resource is nearing its limit (like CPU, memory, queue depth, or connection pools).
- When should I use Golden Signals?
- Use them when you operate a service that users depend on (APIs, web apps, microservices, databases) and you need fast, reliable detection of user impact. They’re especially useful as a default dashboard/alerting baseline: start with the four signals, then add deeper metrics only when they help explain or prevent issues.
- How much does Golden Signals cost?
- Golden Signals themselves are a concept and are free. Costs come from the tools that collect and store telemetry and send alerts. Pricing depends on factors like metric volume and retention, log ingestion and storage, trace sampling rates, number of monitored resources, and alert notifications. To control cost, teams often aggregate metrics, limit high-cardinality labels, sample traces, and set log retention policies.
Category: monitoring
Difficulty: intermediate
Related Terms
See Also