Amount of allowable downtime or errors before reliability targets are breached, used to balance reliability with feature development. Like having a spending allowance for risk.
With a 99.9% SLA, a service has 43 minutes of error budget per month - if it's depleted, teams focus on reliability instead of new features.