Handling Failures

TL;DR

Standard fault recovery techniques including retries, backoffs, Jitter delays, and dead-letter queue routing.

Visual System Topology

Handling Failures Execution Topology

Inbound Node Ingests request

Handling Failures Engine Processes operations

Target Replica Updates state

Concept Overview

Failure handling is the set of patterns and mechanisms a distributed system uses to detect, contain, and recover from component failures — hardware crashes, network timeouts, software bugs, and data corruption — without propagating failures to end users.

In distributed systems, the question is not if failures will occur but when and how many simultaneously. A production system with 1,000 services, each with 99.9% availability, will experience on average 1 service failure per day. Designing for failure from the start is not pessimism — it is engineering reality.

The key failure handling patterns: Retries with backoff (handle transient failures), Circuit Breakers (prevent cascade failures), Timeouts (bound failure blast radius), Bulkheads (isolate failure domains), and Dead Letter Queues (handle unprocessable messages safely).

Key Architectural Pillars

Timeouts

Every outbound call (HTTP, database query, cache lookup) must have a timeout. Without timeouts, a slow downstream service can hold threads open indefinitely, exhausting the upstream service's thread pool and causing a cascade failure.

Example: Setting a 500ms timeout on all database queries ensures a slow query doesn't block all 100 connection pool slots for minutes.

Retries with Exponential Backoff + Jitter

After a transient failure, retry the operation with increasing delays: 1s, 2s, 4s, 8s. Add random jitter (±500ms) to desynchronize retries from multiple clients, preventing synchronized retry storms that overwhelm recovering services.

Example: An S3 client retrying a failed upload: first retry after 1s, second after 2.4s (2 + 400ms jitter), third after 4.8s (4 + 800ms jitter), max 3 retries.

Circuit Breaker

A proxy that monitors failure rates to a downstream service. When the failure rate exceeds a threshold (e.g., 50% of calls fail in 10 seconds), the circuit "trips open" and subsequent calls immediately fail fast without hitting the downstream service. After a cooldown period, the circuit enters "half-open" state and tests with a few requests.

Example: Hystrix (Netflix) circuit breaker: if 50% of calls to the "recommendations service" fail in a 10-second window, all subsequent calls for the next 5 seconds return a cached fallback immediately.

Bulkheads

Partitioning resource pools (thread pools, connection pools, memory) so that overload in one consumer cannot starve all other consumers. Named after ship compartments that prevent one hull breach from sinking the whole vessel.

Example: The "search service" has a dedicated 50-connection pool to the database, separate from the "checkout service" 20-connection pool. Search overload cannot starve checkout transactions.

Dead Letter Queue (DLQ)

When a message queue consumer fails to process a message after N retries, the message is moved to a separate "dead letter queue" rather than being discarded. Engineers can inspect DLQ messages to diagnose bugs and replay them after fixing the issue.

Example: An SQS queue for order processing. Messages that fail 3 times are sent to orders-dlq. An alarm fires when DLQ depth exceeds 10.

Fallback & Graceful Degradation

When a downstream service fails, provide a fallback response: cached data, default values, or a simplified version of the functionality. This keeps the user experience acceptable even during partial outages.

Example: When the personalization service is unavailable, Netflix shows the generic top-10 trending list instead of crashing the homepage.

Foundations of Distributed Systems

Networking & Communication

Data Storage & Databases

Performance & Scaling

System Architecture

Data Processing Systems

Reliability & Operations

Security

Trade-offs & Interview Thinking

Real-world Case Studies

Visual System Topology