Failback
Definition
Returning operations to the primary system after it has been restored. Like moving back home after evacuation when it's safe to return.
Use Cases
- GitHub: Restoring normal operations after a major database incident by moving traffic back to the primary data store once stability was verified. — During a widely reported 2018 incident, GitHub used database failover and recovery procedures, then carefully re-synchronized data and progressively shifted services back to the primary configuration after validation and monitoring checks. (Service availability was restored and operations were returned to the primary setup after data consistency and system health were confirmed, reducing ongoing risk and operational complexity.)
- Netflix: Regional resilience testing and recovery workflows where services can be shifted away from a region and later returned once the region is considered healthy. — Netflix is known for chaos engineering practices and multi-region capable architectures for critical systems. In a regional impairment scenario, traffic can be shifted to healthy capacity and later moved back after verifying dependencies, capacity, and error rates. (Improved confidence in recovery procedures and reduced downtime risk by practicing controlled traffic shifts and validating readiness before returning to normal routing.)
Frequently Asked Questions
- What's the difference between failback and failover?
- Failover is the switch from the primary system to a standby/secondary system when there’s a problem. Failback is the planned move back to the primary system after it has been repaired and verified as stable. Failover is usually urgent; failback is usually controlled and scheduled.
- When should I use failback?
- Use failback after a failover event (or DR activation) once the primary environment is fully restored, data is synchronized, and you’ve validated application health. It’s best done during low-traffic windows with a rollback plan, clear success criteria (latency, error rate, replication lag), and stakeholder communication.
- How much does failback cost?
- Failback cost depends on how your DR is designed. Common cost drivers include: ongoing replication and storage in the secondary site/region, data transfer/egress charges between regions or providers, running standby compute (warm/hot standby costs more than cold), additional licensing for DR tooling, and the operational effort to test and execute failback. A controlled failback may also incur temporary double-running costs while both environments are active during validation.
Category: cloud
Difficulty: intermediate
Related Terms
See Also