Reliability

TL;DR

Building fault-tolerant networks that preserve transactional operations through node crashes and hardware degradation.

Visual System Topology

Reliability Execution Topology

Inbound Node Ingests request

Reliability Engine Processes operations

Target Replica Updates state

Concept Overview

Reliability is the probability that a system performs its required function correctly over a specified time period under given conditions. Unlike availability (which measures uptime), reliability measures correctness — a system that is always "up" but frequently returns wrong results or corrupts data has high availability but low reliability.

Building reliable distributed systems requires acknowledging that everything fails: disks corrupt data, networks drop packets, CPUs produce wrong results under cosmic ray bit flips, and software has bugs. The engineering discipline of reliability is about designing systems that detect failures, limit their blast radius, recover automatically, and do not propagate corruptions.

Reliability engineering is formalized in the concept of SRE (Site Reliability Engineering), pioneered by Google. Key tools include error budgets, SLOs, chaos engineering, and postmortem culture.

Key Architectural Pillars

Fault Tolerance

The ability of a system to continue operating correctly even when some components fail. Achieved through redundancy, graceful degradation, and isolation. A fault-tolerant system never exposes internal failures as user-visible errors.

Example: A payment service that falls back to a cached exchange rate if the live rate API times out, rather than failing the transaction.

Graceful Degradation

When a subset of functionality is unavailable, the system continues serving the remaining functionality rather than failing completely. Core flows remain unaffected while auxiliary features degrade.

Example: Netflix disabling personalized recommendations (complex ML inference) during a cache outage, but still serving video playback normally.

Retry with Exponential Backoff + Jitter

Transient network failures are unavoidable. Clients should retry failed requests, but with exponentially increasing delays and random jitter to avoid synchronized retry storms that overwhelm recovering services.

Example: Retry after 1s, 2s, 4s, 8s with ±500ms random jitter. Give up after 5 attempts and return an error to the user.

Idempotency for Safe Retries

Making operations idempotent (repeating them multiple times produces the same result as once) enables safe retries without side effects. Achieved using unique idempotency keys or natural idempotency (PUT, DELETE in HTTP).

Example: Including a unique `X-Idempotency-Key: charge_xyz_789` header in payment requests so retries don't double-charge customers.

Bulkheads (Resource Isolation)

Isolating resource pools (thread pools, connection pools, memory) per service so that overload in one consumer does not exhaust resources for others. Named after ship compartments that prevent a single breach from sinking the whole vessel.

Example: Giving the "checkout service" a dedicated database connection pool of 50 connections, separate from the "analytics service" pool.

Chaos Engineering

Proactively injecting failures in production (or staging) to discover reliability weaknesses before they cause real outages. Netflix's Chaos Monkey randomly terminates production EC2 instances to test whether services recover automatically.

Example: Running a weekly scheduled chaos experiment that terminates one random app server and verifies the system remains healthy.

Foundations of Distributed Systems

Networking & Communication

Data Storage & Databases

Performance & Scaling

System Architecture

Data Processing Systems

Reliability & Operations

Security

Trade-offs & Interview Thinking

Real-world Case Studies