Network Partitions & Split Brain

Detecting physical routing ruptures, isolating split node sub-graphs safely, and preventing independent sub-clusters from executing conflicting master operations through mutual consensus rules.

What you'll learn

Network Partition
Split Brain Problem
Quorum-Based Consensus
Fencing Tokens
STONITH (Shoot The Other Node In The Head)
Partition Detection & Timeout Tuning

TL;DR

Detecting physical routing ruptures, isolating split node sub-graphs safely, and preventing independent sub-clusters from executing conflicting master operations through mutual consensus rules.

Visual System Topology

Network Partitions & Split Brain Network Handshake Flow

Client Node Initiates Request

Multiplexed

Network Partitions & Split Brain Gateway Routes Traffic

Fast Payload

Backend Server Executes Logic

Concept Overview

A network partition is a failure in a distributed system where nodes can no longer communicate with each other, dividing the cluster into two or more disconnected sub-groups. Partitions occur due to cable cuts, router failures, AZ outages, or transient packet loss storms. They are not exceptional events — in large-scale systems, partitions happen several times per year.

The Split Brain Problem is the most dangerous consequence of a network partition: both partitions believe they are the authoritative primary and continue accepting writes independently. When the partition heals, the cluster discovers it has two conflicting versions of the data with no clear "correct" history. This is catastrophic for databases where writes in both partitions cannot be trivially merged.

Solving network partitions is the core challenge of distributed consensus. The key insight: a partition-tolerant system must either sacrifice consistency (allow divergence during the partition) or sacrifice availability (refuse requests from the minority partition).

Key Architectural Pillars

Network Partition

A network failure that divides a cluster into disconnected sub-groups. Nodes within each group can communicate internally but cannot reach nodes in other groups. Neither sub-group can distinguish between "the other group is down" and "the other group is unreachable but alive."

Example: An AWS cross-AZ link failure splits a 6-node Cassandra cluster into two groups of 3. Both groups continue operating independently.

Split Brain Problem

When a network partition causes two or more nodes to simultaneously believe they are the primary/leader, they accept writes independently, creating diverging data histories. When the partition heals, the cluster must reconcile conflicting writes — and in many cases, data loss is unavoidable.

Example: A PostgreSQL primary and its standby lose connectivity. The standby is promoted by a monitoring tool. Both primaries accept writes for 10 minutes. When the link heals, 10 minutes of writes have conflicts.

Quorum-Based Consensus

The primary defense against split brain: require a majority of nodes (quorum = N/2 + 1) to agree before executing any write. With 5 nodes, quorum is 3. During a 2+3 split, only the 3-node partition can form a quorum and accept writes. The 2-node partition is read-only or rejects writes.

Example: etcd with 5 nodes uses quorum=3. If nodes 1,2 partition from 3,4,5: only 3,4,5 accept leader elections. Nodes 1,2 serve stale reads only.

Fencing Tokens

A monotonically increasing token issued to the current leader. Every write to shared resources must include the fencing token; the storage system rejects writes with stale tokens. This ensures writes from an old leader are rejected after a new leader is elected.

Example: etcd issues fencing token 42 to the current leader. If a partition causes a new leader to be elected with token 43, all writes from the old leader with token 42 are rejected by the storage system.

STONITH (Shoot The Other Node In The Head)

A fencing mechanism that physically kills a node suspected of split-brain to guarantee it cannot continue accepting writes. Used in high-availability database clusters. Extreme but reliable.

Example: Pacemaker cluster fencing: when a node stops responding to heartbeats, the cluster controller powers off that node via IPMI or triggers a hard reboot via cloud API.

Partition Detection & Timeout Tuning

Nodes detect partitions by the absence of heartbeat signals. The timeout value is a critical trade-off: too short → false positives (transient network hiccup triggers split brain recovery); too long → real failures take too long to detect.

Example: etcd default election timeout is 1000ms; a typical production setting is 500ms-2000ms depending on network reliability.

Foundations of Distributed Systems

Networking & Communication

Data Storage & Databases

Performance & Scaling

System Architecture

Data Processing Systems

Reliability & Operations

Security

Trade-offs & Interview Thinking

Real-world Case Studies