Handling Failures
Standard fault recovery techniques including retries, backoffs, Jitter delays, and dead-letter queue routing.
What you'll learn
- Timeouts
- Retries with Exponential Backoff + Jitter
- Circuit Breaker
- Bulkheads
- Dead Letter Queue (DLQ)
- Fallback & Graceful Degradation
TL;DR
Standard fault recovery techniques including retries, backoffs, Jitter delays, and dead-letter queue routing.
Visual System Topology
Handling Failures Execution Topology
Concept Overview
Failure handling is the set of patterns and mechanisms a distributed system uses to detect, contain, and recover from component failures — hardware crashes, network timeouts, software bugs, and data corruption — without propagating failures to end users.
In distributed systems, the question is not if failures will occur but when and how many simultaneously. A production system with 1,000 services, each with 99.9% availability, will experience on average 1 service failure per day. Designing for failure from the start is not pessimism — it is engineering reality.
The key failure handling patterns: Retries with backoff (handle transient failures), Circuit Breakers (prevent cascade failures), Timeouts (bound failure blast radius), Bulkheads (isolate failure domains), and Dead Letter Queues (handle unprocessable messages safely).
Key Architectural Pillars
Timeouts
Every outbound call (HTTP, database query, cache lookup) must have a timeout. Without timeouts, a slow downstream service can hold threads open indefinitely, exhausting the upstream service's thread pool and causing a cascade failure.
Retries with Exponential Backoff + Jitter
After a transient failure, retry the operation with increasing delays: 1s, 2s, 4s, 8s. Add random jitter (±500ms) to desynchronize retries from multiple clients, preventing synchronized retry storms that overwhelm recovering services.
Circuit Breaker
A proxy that monitors failure rates to a downstream service. When the failure rate exceeds a threshold (e.g., 50% of calls fail in 10 seconds), the circuit "trips open" and subsequent calls immediately fail fast without hitting the downstream service. After a cooldown period, the circuit enters "half-open" state and tests with a few requests.
Bulkheads
Partitioning resource pools (thread pools, connection pools, memory) so that overload in one consumer cannot starve all other consumers. Named after ship compartments that prevent one hull breach from sinking the whole vessel.
Dead Letter Queue (DLQ)
When a message queue consumer fails to process a message after N retries, the message is moved to a separate "dead letter queue" rather than being discarded. Engineers can inspect DLQ messages to diagnose bugs and replay them after fixing the issue.
Fallback & Graceful Degradation
When a downstream service fails, provide a fallback response: cached data, default values, or a simplified version of the functionality. This keeps the user experience acceptable even during partial outages.
