Error Budget

Definition

Amount of allowable downtime or errors before reliability targets are breached, used to balance reliability with feature development.

Use Cases

Frequently Asked Questions

What's the difference between an Error Budget and an SLA?
An SLA (Service Level Agreement) is an external promise to customers, often with penalties if it’s not met. An error budget is an internal engineering tool based on an SLO (Service Level Objective). It measures how much unreliability you can “spend” (downtime, errors, or slow requests) before you must prioritize reliability work over new features.
When should I use an Error Budget?
Use an error budget when you have a service with clear reliability goals and frequent changes (deployments, config updates, new features). It’s especially useful if teams argue about whether to ship faster or stabilize. Start once you can measure a few key SLIs (availability, latency, correctness) and you have an agreed SLO for what “good enough” reliability means.
How much does an Error Budget cost?
The error budget itself doesn’t cost money—it’s a policy derived from your SLO. Costs come from implementing it: monitoring/observability tools (metrics, logs, traces), alerting/incident management, and engineering time to define SLIs/SLOs and improve reliability. Tighter SLOs usually increase cost because you may need more redundancy, better automation, and more operational effort to stay within the budget.

Category: software

Difficulty: advanced

Related Terms

See Also