Beginner8 min readFoundations of Distributed Systems

30 Must-Know Concepts

Crucial, highly frequent distributed terminology checklist including elasticity, SLO, SLA, and SPOFs.

What you'll learn

1. Vertical Scaling (Scaling Up)
2. Horizontal Scaling (Scaling Out)
3. Service Level Agreement (SLA)
4. Service Level Objective (SLO)
5. Service Level Indicator (SLI)
6. High Availability (HA)
7. System Reliability
8. Single Point of Failure (SPOF)

TL;DR

Crucial, highly frequent distributed terminology checklist including elasticity, SLO, SLA, and SPOFs.

Visual System Topology

30 Must-Know Concepts Execution Topology

Inbound Node Ingests request

30 Must-Know Concepts Engine Processes operations

Target Replica Updates state

Concept Overview

To navigate system design challenges successfully, a core vocabulary of distributed systems terminology is mandatory. These concepts represent the primary tools engineers use to debug outages, design stateful storage clusters, and maintain five-nines availability under massive concurrent traffic.

From understanding how networking packets travel across networks to determining how databases partition transactions under latency limitations, mastering these patterns transforms standard software engineers into senior systems architects.

Key Architectural Pillars

1. Vertical Scaling (Scaling Up)

Increasing the computing capacity (such as CPU, RAM, or SSD size) of a single server machine to handle higher computational demands.

Example: Upgrading an AWS EC2 database server from a t3.medium to an r5.2xlarge instance.

2. Horizontal Scaling (Scaling Out)

Adding more commodity hardware servers to a shared resource pool managed under a dynamic load balancer, facilitating unlimited capacity expansion.

Example: Expanding a stateless web service cluster from 2 instances to 50 active container nodes during traffic spikes.

3. Service Level Agreement (SLA)

A binding legal agreement between a service provider and external clients detailing system reliability guarantees, uptime metrics, and financial penalties for failures.

Example: A cloud provider promising a service credit if monthly API endpoint availability drops below 99.9%.

4. Service Level Objective (SLO)

An internal target metric set by engineering teams to preserve service quality, serve as a reliability boundary, and guide developer error budgets.

Example: Targeting that 99% of user catalog database read requests must return in under 150ms over any rolling 30-day window.

5. Service Level Indicator (SLI)

The actual quantitative metric measured in real-time to track compliance against active SLO objectives.

Example: A real-time telemetry agent reporting that the current p99 catalog latency is 104ms.

6. High Availability (HA)

The capability of an architecture to remain operational and accessible under high-traffic spikes and hardware failures, typically measured in nines (e.g. 99.99%).

Example: A multi-region deployment keeping a service online with less than 52.56 minutes of total downtime per year.

7. System Reliability

The probability that a distributed platform executes its target functions correctly under stated conditions without errors, corruptions, or bitrot for a specified timeframe.

Example: Ensuring transaction storage records persist safely across database drives without data integrity loss.

8. Single Point of Failure (SPOF)

A central component or node in an infrastructure whose single failure triggers a complete cascading system outage.

Example: Running a monolithic primary database node without any automated failover mechanism or standby replicas.

9. Latency

The absolute round-trip time elapsed (typically measured in milliseconds) for a single network packet to travel from a client agent to a server and return.

Example: A mobile client experiencing a 45ms round-trip delay when calling a search autocomplete endpoint.

10. Throughput

The capacity of requests or operations a system safely processes per unit of time, typically measured in Requests Per Second (RPS) or Queries Per Second (QPS).

Example: An API Gateway load balancer routing up to 25,000 requests per second across worker nodes.

11. Bandwidth

The maximum rate of data transfer across a physical network connection interface, measured in bits per second (e.g. Gbps).

Example: An internal datacenter fiber backplane transferring massive files at speeds up to 100 Gbps.

12. Strong Consistency

A consistency model guaranteeing that a read operation always returns the absolute most recent write transaction, regardless of which distributed node is queried.

Example: A banking ledger balance immediately reflecting a deposit across all replica nodes before returning success.

13. Eventual Consistency

A consistency model promising that if no new updates are made, all distributed replicas will eventually synchronize and converge to the same value.

Example: A user profile picture change propagating to friends' social feeds over several seconds.

14. CAP Theorem

The fundamental constraint stating that under a network partition, a distributed system can guarantee Consistency (C) or Availability (A), but not both.

Example: A cluster choosing to block writes (Consistency) rather than allow conflicting updates (Availability) during network splits.

15. PACELC Theorem

An extension of CAP stating that even when no partition exists (E), a system must trade off response Latency (L) against data Consistency (C).

Example: Reading from a local replica to minimize response times (Latency) while accepting a small risk of reading stale data.

16. Load Balancing

Distributing inbound client network traffic dynamically across multiple backend servers to prevent compute overload.

Example: Using NGINX to partition HTTP traffic using the Round-Robin algorithm across stateless application nodes.

17. Content Delivery Network (CDN)

A network of cache proxy servers located at the network edge, closer to users, to serve media assets quickly.

Example: Caching global user interface assets on Cloudflare to bypass Origin backend server calls.

18. Domain Name System (DNS)

The global decentralized register mapping human-readable hostnames into computer-readable IP addresses.

Example: Resolving the domain name "revisealgo.com" into its target machine IP address "104.21.32.181".

19. Caching

Storing high-frequency query responses in fast temporary memory stores to bypass slow persistent disk storage blocks.

Example: Querying Redis for active user session keys before executing a SQL lookup in PostgreSQL.

20. Database Sharding

Partitioning a massive table horizontally across multiple physically distinct database instances based on a sharding key.

Example: Storing users with IDs 1-1,000,000 on Shard A, and IDs 1,000,001-2,000,000 on Shard B.

21. Read Replicas

Replicating transactions from a primary database asynchronously to auxiliary nodes, scaling out read capacity.

Example: Directing all customer analytics reporting queries exclusively to auxiliary read-only database nodes.

22. Consistent Hashing

A hash distribution strategy mapping nodes and keys onto a circular ring, minimizing key remapping when servers scale out or in.

Example: Distributing files across a caching ring so that adding Node 4 only moves 25% of existing keys.

23. Message Queues

Asynchronous event buffers that decouple publishers and consumers, safely absorbing large peak traffic ingestion spikes.

Example: Storing billing renewal events in RabbitMQ for background worker execution.

24. Publish-Subscribe (Pub/Sub)

An event-driven architectural pattern where publishers broadcast to topics, and multiple subscribers consume independently.

Example: A purchase event notifying both the Inventory system and the Shipping billing system simultaneously.

25. Circuit Breaker

A structural safety switch that intercepts calls to a failing downstream service to conserve backend threads.

Example: Tripping a gateway connection when downstream billing timeouts exceed 50%, returning fallback mock payloads.

26. Rate Limiting

Throttling client request volumes within a specific time window to protect downstream resources from exhaustion.

Example: Restricting a public API client key to a maximum of 100 requests per minute using a Token Bucket algorithm.

27. Heartbeats

Periodic lightweight signals transmitted between distributed cluster nodes to coordinate cluster status and detect silent node crashes.

Example: Worker nodes sending UDP pings every 500ms to the central master scheduler.

28. Gossip Protocol

A decentralized peer-to-peer communication framework where nodes exchange cluster status incrementally with adjacent neighbors, similar to epidemics.

Example: Cassandra clusters dynamically discovering newly booted database nodes without a central orchestrator.

29. Distributed Locks

A synchronization mechanism designed to coordinate mutual exclusion across shared resources in stateless server farms.

Example: Using Redis Redlock to ensure only one worker node can process a specific billing renewal subscription task.

30. Idempotency

A transactional guarantee ensuring that executing an API request multiple times yields the exact same outcome as running it a single time.

Example: Attaching a unique "idempotency-key" to bank API calls to prevent charging a client twice on connection retries.

More in this module

What is System Design?

Introduction to high-level architecture planning, standard requirements engineering, and modern scaling metrics.

Scalability

Deep-dive into horizontal vs. vertical scaling patterns and handling exponential resource growth curves.

Availability

Measuring high availability metrics, redundancy models, active-passive failover strategies, and "The Nines" SLA standards.