Failover
Definition
Automatically switching to a backup system when the primary system fails. Like a backup generator kicking in instantly when the power goes out.
Use Cases
- Netflix: Keeping streaming services available during infrastructure failures — Designed services for high availability across multiple AWS Availability Zones and uses automated health checks and traffic shifting through load balancing and resilient service architecture. (Reduced customer impact from instance or zone-level failures and improved overall service availability through automated recovery behaviors.)
- Amazon: Maintaining availability of transactional systems during component failures — Uses redundant infrastructure and automated failover patterns across data centers/Availability Zones, including load balancing and database replication with standby capacity. (Improved continuity of critical services by minimizing downtime when individual components fail.)
Provider Equivalents
- AWS: Amazon Route 53 (Failover routing) + Elastic Load Balancing + Multi-AZ (RDS/Aurora)
- Azure: Azure Traffic Manager (priority routing) + Azure Load Balancer/Application Gateway + zone-redundant services
- GCP: Cloud Load Balancing (failover) + Cloud DNS (routing policies) + regional managed services
- OCI: OCI DNS Traffic Management (steering policies) + OCI Load Balancer + Data Guard (for databases)
Frequently Asked Questions
- What's the difference between failover and failback?
- Failover is the automatic switch from a failed primary system to a standby system. Failback is switching back to the original primary after it’s repaired and stable (often done carefully to avoid another outage).
- When should I use failover?
- Use failover when downtime is costly or unacceptable—such as customer-facing apps, payment systems, APIs, and databases that must stay online. It’s especially important when you have clear availability targets (like an SLA) and you can run a secondary instance/region/zone to take over during failures.
- How much does failover cost?
- Costs depend on how you implement it. Active-passive designs cost more than single-instance setups because you pay for standby compute, replicated storage, and cross-zone/region data transfer. Active-active can cost even more because you run multiple production-capable environments. You may also pay for health checks, load balancers, DNS queries, and database replication features (for example, Multi-AZ or cross-region replication).
Category: cloud
Difficulty: intermediate
Related Terms
See Also