Resilience

Definition

A system's ability to recover quickly from failures and continue operating under adverse conditions, vital for maintaining service availability.

Use Cases

Frequently Asked Questions

What's the difference between resilience and high availability?
High availability focuses on minimizing downtime (keeping the service up). Resilience is broader: it includes high availability plus the ability to absorb failures, degrade gracefully, and recover quickly—even when things go wrong (bad deploys, dependency outages, traffic surges). A system can be highly available in normal conditions but not resilient if it fails catastrophically under stress.
When should I design for resilience in the cloud?
Use resilience when downtime or data loss would significantly impact customers or revenue, when you expect variable traffic, or when you rely on multiple services that can fail independently. Start with resilience for customer-facing and revenue-critical paths (login, checkout, payments, core APIs). For internal tools or low-impact workloads, you may accept simpler designs and add resilience later based on risk.
How much does resilience cost in cloud computing?
Costs usually increase with redundancy and faster recovery targets. Common cost drivers include running resources in multiple zones/regions, extra load balancers, replicated databases/storage, higher provisioned capacity to handle failover, more frequent backups, and additional monitoring/observability. You can control cost by matching resilience to business needs (RTO/RPO), using autoscaling, choosing managed services that include replication, and designing graceful degradation so you don’t need to over-provision everything.

Category: cloud

Difficulty: intermediate

Related Terms

See Also