MTTR
Definition
Mean Time To Recovery - average time needed to restore service after a failure, a key metric for assessing operational efficiency and reliability.
Use Cases
- Netflix: Reducing customer impact from service failures by improving recovery speed and operational readiness. — Netflix is known for resilience engineering practices in AWS, including automated recovery patterns and failure testing (often referred to as chaos engineering) to validate that services can recover quickly when components fail. (Improved service resilience and faster recovery from failures, helping reduce the duration and impact of incidents on streaming availability.)
- Google: Improving reliability of large-scale production services by measuring incident outcomes and recovery performance. — Google’s Site Reliability Engineering (SRE) practices emphasize measuring reliability and using operational processes (incident response, postmortems, and automation) to reduce time to restore service after outages. (More consistent incident response and reduced time to restore service through automation and disciplined operational practices.)
- Etsy: Improving incident response and learning from outages to reduce recovery time for customer-facing systems. — Etsy has publicly discussed blameless postmortems and operational practices that help teams learn from incidents and improve detection, response, and recovery workflows over time. (Faster and more reliable incident handling over time by turning outage learnings into concrete operational and engineering improvements.)
Frequently Asked Questions
- What's the difference between MTTR and MTTD?
- MTTD (Mean Time To Detect) measures how long it takes to notice an incident. MTTR measures how long it takes to restore service after the incident is detected. Lowering MTTD helps you start fixing sooner; lowering MTTR helps you finish fixing sooner.
- When should I track MTTR?
- Track MTTR when you run production systems where downtime matters (customer-facing apps, APIs, data pipelines, internal business systems). It’s especially useful if you have on-call/incident response, SLOs/SLAs, or frequent changes (deployments) and want to quantify how quickly you recover from failures.
- How much does MTTR cost?
- MTTR itself has no direct cost because it’s a metric. Costs come from the tools and practices used to measure and reduce it: monitoring and logging platforms, incident management/on-call tooling, additional redundancy (multi-AZ/multi-region), automated remediation (runbooks, functions, pipelines), and engineering time for reliability work.
Category: software
Difficulty: intermediate
See Also