Resilience
Definition
A system's ability to recover quickly from failures and continue operating under adverse conditions, vital for maintaining service availability.
Use Cases
- Netflix: Keep video streaming available despite instance, zone, or service failures and sudden traffic spikes. — Built a microservices architecture on AWS with redundancy across multiple Availability Zones, extensive automation, and resilience testing practices (e.g., failure injection/chaos engineering) to validate recovery behavior. (Improved ability to continue serving customers during infrastructure disruptions and to recover quickly from component failures, supporting high availability at global scale.)
- Amazon: Maintain availability of high-traffic retail and checkout workflows during peak events and partial infrastructure failures. — Uses distributed, fault-tolerant service design with redundancy, load balancing, and automated scaling; designs systems to degrade gracefully (e.g., non-critical features can be reduced) while keeping core purchase flows running. (Better continuity of critical customer journeys during demand spikes and localized failures, reducing revenue impact from outages.)
- Google: Deliver reliable consumer and enterprise services that continue operating through machine and datacenter failures. — Applies site reliability engineering (SRE) practices, redundancy across failure domains, automated rollouts/rollbacks, and continuous monitoring with defined service level objectives (SLOs) to drive resilient operations. (Faster detection and recovery from incidents and more predictable reliability outcomes through measurable objectives and automation.)
Frequently Asked Questions
- What's the difference between resilience and high availability?
- High availability focuses on minimizing downtime (keeping the service up). Resilience is broader: it includes high availability plus the ability to absorb failures, degrade gracefully, and recover quickly—even when things go wrong (bad deploys, dependency outages, traffic surges). A system can be highly available in normal conditions but not resilient if it fails catastrophically under stress.
- When should I design for resilience in the cloud?
- Use resilience when downtime or data loss would significantly impact customers or revenue, when you expect variable traffic, or when you rely on multiple services that can fail independently. Start with resilience for customer-facing and revenue-critical paths (login, checkout, payments, core APIs). For internal tools or low-impact workloads, you may accept simpler designs and add resilience later based on risk.
- How much does resilience cost in cloud computing?
- Costs usually increase with redundancy and faster recovery targets. Common cost drivers include running resources in multiple zones/regions, extra load balancers, replicated databases/storage, higher provisioned capacity to handle failover, more frequent backups, and additional monitoring/observability. You can control cost by matching resilience to business needs (RTO/RPO), using autoscaling, choosing managed services that include replication, and designing graceful degradation so you don’t need to over-provision everything.
Category: cloud
Difficulty: intermediate
Related Terms
See Also