Fault Tolerance
Definition
Ability of a system to continue operating properly even when some components fail. Like a plane that can fly safely even if one engine stops working.
Use Cases
- Netflix: Keep video streaming available during server, instance, or Availability Zone failures — Runs services across multiple AWS Availability Zones with load balancing and automated instance replacement; uses resilience testing (e.g., Chaos Engineering practices) to validate that services continue operating when components fail. (Improved service continuity and reduced customer-visible interruptions when infrastructure components fail.)
- Amazon: Maintain availability of high-traffic retail and backend services despite hardware and network failures — Designs services to be redundant across multiple Availability Zones, uses health checks and automated failover patterns, and relies on distributed systems that can tolerate node loss. (Higher uptime and the ability to continue serving customer traffic during localized infrastructure failures.)
- Google: Keep global consumer services available during machine and data center component failures — Uses large-scale distributed systems with redundancy, automated recovery, and traffic management to route around failures; services are designed to tolerate individual machine loss without user impact. (Resilient user experiences at global scale with reduced impact from routine hardware failures.)
Frequently Asked Questions
- What's the difference between Fault Tolerance and High Availability?
- High availability focuses on minimizing downtime (often measured as uptime percentage) and typically uses redundancy and failover to recover quickly. Fault tolerance aims to keep the system operating correctly even when components fail, ideally with no interruption at all. In practice, fault-tolerant designs usually require more redundancy and automation than basic high-availability setups.
- When should I use Fault Tolerance?
- Use fault tolerance when downtime or data loss is unacceptable or very costly—examples include payment processing, healthcare systems, emergency services, core authentication, and mission-critical APIs. If brief interruptions are acceptable, a high-availability design (fast recovery) may be sufficient and cheaper than full fault tolerance.
- How much does Fault Tolerance cost?
- Costs usually increase because you run redundant components (often in multiple zones or regions), add load balancers, use replicated databases/storage, and pay for extra network traffic and monitoring. The biggest cost drivers are duplicate compute capacity, cross-zone/region data replication and egress charges, and managed services with multi-zone or multi-region configurations.
Category: cloud
Difficulty: intermediate
Related Terms
See Also