Fault Tolerance

Definition

Ability of a system to continue operating properly even when some components fail. Like a plane that can fly safely even if one engine stops working.

Use Cases

Frequently Asked Questions

What's the difference between Fault Tolerance and High Availability?
High availability focuses on minimizing downtime (often measured as uptime percentage) and typically uses redundancy and failover to recover quickly. Fault tolerance aims to keep the system operating correctly even when components fail, ideally with no interruption at all. In practice, fault-tolerant designs usually require more redundancy and automation than basic high-availability setups.
When should I use Fault Tolerance?
Use fault tolerance when downtime or data loss is unacceptable or very costly—examples include payment processing, healthcare systems, emergency services, core authentication, and mission-critical APIs. If brief interruptions are acceptable, a high-availability design (fast recovery) may be sufficient and cheaper than full fault tolerance.
How much does Fault Tolerance cost?
Costs usually increase because you run redundant components (often in multiple zones or regions), add load balancers, use replicated databases/storage, and pay for extra network traffic and monitoring. The biggest cost drivers are duplicate compute capacity, cross-zone/region data replication and egress charges, and managed services with multi-zone or multi-region configurations.

Category: cloud

Difficulty: intermediate

Related Terms

See Also