Error Budget
Definition
Amount of allowable downtime or errors before reliability targets are breached, used to balance reliability with feature development.
Use Cases
- Google: Balancing feature velocity with reliability for large-scale user-facing services — Google SRE popularized the error budget model: teams define SLIs (e.g., request success rate/latency) and SLOs (e.g., 99.9% availability). The difference between 100% and the SLO becomes the error budget. Burn rate is monitored; when the budget is consumed too quickly, releases are slowed or paused and engineering time shifts to reliability work (e.g., reducing toil, fixing top failure modes). (Creates a shared, quantitative decision framework between product and engineering, helping prevent chronic over-release during instability while still enabling planned risk-taking when reliability is healthy.)
- Netflix: Managing reliability risk during frequent deployments of streaming platform services — Netflix is known for high deployment frequency and resilience engineering. Teams commonly use SLO-style targets and operational metrics to decide when to slow changes and focus on stability. Error-budget-like policies are used to guide release decisions based on observed reliability (e.g., elevated error rates/latency) and incident trends, supported by strong observability and automated rollback/mitigation practices. (Supports rapid iteration while maintaining service reliability by making reliability impact measurable and tying it to deployment and operational decisions.)
Frequently Asked Questions
- What's the difference between an Error Budget and an SLA?
- An SLA (Service Level Agreement) is an external promise to customers, often with penalties if it’s not met. An error budget is an internal engineering tool based on an SLO (Service Level Objective). It measures how much unreliability you can “spend” (downtime, errors, or slow requests) before you must prioritize reliability work over new features.
- When should I use an Error Budget?
- Use an error budget when you have a service with clear reliability goals and frequent changes (deployments, config updates, new features). It’s especially useful if teams argue about whether to ship faster or stabilize. Start once you can measure a few key SLIs (availability, latency, correctness) and you have an agreed SLO for what “good enough” reliability means.
- How much does an Error Budget cost?
- The error budget itself doesn’t cost money—it’s a policy derived from your SLO. Costs come from implementing it: monitoring/observability tools (metrics, logs, traces), alerting/incident management, and engineering time to define SLIs/SLOs and improve reliability. Tighter SLOs usually increase cost because you may need more redundancy, better automation, and more operational effort to stay within the budget.
Category: software
Difficulty: advanced
Related Terms
See Also