30 Must-Know Concepts
Crucial, highly frequent distributed terminology checklist including elasticity, SLO, SLA, and SPOFs.
What you'll learn
- 1. Vertical Scaling (Scaling Up)
- 2. Horizontal Scaling (Scaling Out)
- 3. Service Level Agreement (SLA)
- 4. Service Level Objective (SLO)
- 5. Service Level Indicator (SLI)
- 6. High Availability (HA)
- 7. System Reliability
- 8. Single Point of Failure (SPOF)
TL;DR
Crucial, highly frequent distributed terminology checklist including elasticity, SLO, SLA, and SPOFs.
Visual System Topology
30 Must-Know Concepts Execution Topology
Concept Overview
To navigate system design challenges successfully, a core vocabulary of distributed systems terminology is mandatory. These concepts represent the primary tools engineers use to debug outages, design stateful storage clusters, and maintain five-nines availability under massive concurrent traffic.
From understanding how networking packets travel across networks to determining how databases partition transactions under latency limitations, mastering these patterns transforms standard software engineers into senior systems architects.
Key Architectural Pillars
1. Vertical Scaling (Scaling Up)
Increasing the computing capacity (such as CPU, RAM, or SSD size) of a single server machine to handle higher computational demands.
2. Horizontal Scaling (Scaling Out)
Adding more commodity hardware servers to a shared resource pool managed under a dynamic load balancer, facilitating unlimited capacity expansion.
3. Service Level Agreement (SLA)
A binding legal agreement between a service provider and external clients detailing system reliability guarantees, uptime metrics, and financial penalties for failures.
4. Service Level Objective (SLO)
An internal target metric set by engineering teams to preserve service quality, serve as a reliability boundary, and guide developer error budgets.
5. Service Level Indicator (SLI)
The actual quantitative metric measured in real-time to track compliance against active SLO objectives.
6. High Availability (HA)
The capability of an architecture to remain operational and accessible under high-traffic spikes and hardware failures, typically measured in nines (e.g. 99.99%).
7. System Reliability
The probability that a distributed platform executes its target functions correctly under stated conditions without errors, corruptions, or bitrot for a specified timeframe.
8. Single Point of Failure (SPOF)
A central component or node in an infrastructure whose single failure triggers a complete cascading system outage.
9. Latency
The absolute round-trip time elapsed (typically measured in milliseconds) for a single network packet to travel from a client agent to a server and return.
10. Throughput
The capacity of requests or operations a system safely processes per unit of time, typically measured in Requests Per Second (RPS) or Queries Per Second (QPS).
11. Bandwidth
The maximum rate of data transfer across a physical network connection interface, measured in bits per second (e.g. Gbps).
12. Strong Consistency
A consistency model guaranteeing that a read operation always returns the absolute most recent write transaction, regardless of which distributed node is queried.
13. Eventual Consistency
A consistency model promising that if no new updates are made, all distributed replicas will eventually synchronize and converge to the same value.
14. CAP Theorem
The fundamental constraint stating that under a network partition, a distributed system can guarantee Consistency (C) or Availability (A), but not both.
15. PACELC Theorem
An extension of CAP stating that even when no partition exists (E), a system must trade off response Latency (L) against data Consistency (C).
16. Load Balancing
Distributing inbound client network traffic dynamically across multiple backend servers to prevent compute overload.
17. Content Delivery Network (CDN)
A network of cache proxy servers located at the network edge, closer to users, to serve media assets quickly.
18. Domain Name System (DNS)
The global decentralized register mapping human-readable hostnames into computer-readable IP addresses.
19. Caching
Storing high-frequency query responses in fast temporary memory stores to bypass slow persistent disk storage blocks.
20. Database Sharding
Partitioning a massive table horizontally across multiple physically distinct database instances based on a sharding key.
21. Read Replicas
Replicating transactions from a primary database asynchronously to auxiliary nodes, scaling out read capacity.
22. Consistent Hashing
A hash distribution strategy mapping nodes and keys onto a circular ring, minimizing key remapping when servers scale out or in.
23. Message Queues
Asynchronous event buffers that decouple publishers and consumers, safely absorbing large peak traffic ingestion spikes.
24. Publish-Subscribe (Pub/Sub)
An event-driven architectural pattern where publishers broadcast to topics, and multiple subscribers consume independently.
25. Circuit Breaker
A structural safety switch that intercepts calls to a failing downstream service to conserve backend threads.
26. Rate Limiting
Throttling client request volumes within a specific time window to protect downstream resources from exhaustion.
27. Heartbeats
Periodic lightweight signals transmitted between distributed cluster nodes to coordinate cluster status and detect silent node crashes.
28. Gossip Protocol
A decentralized peer-to-peer communication framework where nodes exchange cluster status incrementally with adjacent neighbors, similar to epidemics.
29. Distributed Locks
A synchronization mechanism designed to coordinate mutual exclusion across shared resources in stateless server farms.
30. Idempotency
A transactional guarantee ensuring that executing an API request multiple times yields the exact same outcome as running it a single time.
