SRE

Definition

Site Reliability Engineering - an approach that applies software engineering principles to infrastructure and operations for improved reliability.

Use Cases

Frequently Asked Questions

What's the difference between SRE and DevOps?
DevOps is a broad culture and set of practices that improve collaboration between development and operations. SRE is a more specific approach to running services: it uses software engineering to automate operations and uses measurable reliability targets (SLIs/SLOs) and error budgets to decide how fast teams can safely ship changes.
When should I use SRE?
Use SRE when your service has clear uptime/latency expectations, frequent releases, and meaningful operational risk (customer impact, revenue loss, or compliance concerns). It’s especially helpful once manual operations work (toil) and incident load start slowing delivery, or when you need consistent on-call, incident response, and reliability metrics across multiple teams.
How much does SRE cost?
SRE cost is mostly people and process, plus tooling. Key factors include: staffing (on-call coverage, senior reliability engineers), time spent reducing toil via automation, and observability/incident tooling costs (metrics, logs, traces, paging). Cloud costs can also rise if you add redundancy (multi-AZ/multi-region), higher-capacity buffers, or more extensive testing environments. Many teams justify the cost by reducing downtime, improving customer experience, and enabling faster delivery with controlled risk.

Category: emerging

Difficulty: advanced

Related Terms

See Also