ReviseAlgo Logo
Advanced20 min readReal-world Case Studies

Design WhatsApp

Developing real-time WebSockets gateways, message statuses, and Cassandra chat history stores.

What you'll learn

  • WebSockets — Why Not HTTP Polling?
  • Presence Service (Redis Routing Table)
  • Cassandra for Message Storage
  • Message Delivery Receipts (3-State FSM)
  • Group Messaging via Kafka Fan-Out
  • End-to-End Encryption (Signal Protocol)

TL;DR

Developing real-time WebSockets gateways, message statuses, and Cassandra chat history stores.

Visual System Topology

WhatsApp — Real-Time Messaging Architecture

Client App iOS / Android
Load Balancer
Chat Servers WebSocket pool
Presence Service Redis — routing table
Message Queue Kafka — offline / groups
Cassandra Chat history store
Online: A → Chat Server → Redis lookup → push to B's WS server → B's socket
Offline: Save to Cassandra → Kafka → Notification Service → APNs / FCM

Concept Overview

A real-time messaging system at WhatsApp scale serves 2B+ users sending 100B messages per day — roughly 1.15 million messages every second.

Functional Requirements:

  • 1-on-1 and group messaging (up to 1,024 members per group)
  • Message delivery status: Sent ✓, Delivered ✓✓, Read ✓✓ (blue)
  • User presence (online/offline/last seen)
  • Media sharing (images, video, voice, documents)
  • End-to-end encryption (every message)
  • Push notifications for offline users

Non-Functional Requirements:

  • < 100ms message delivery for online users
  • 99.99% availability (~52 minutes downtime/year)
  • Horizontal scalability for billions of messages/day
  • Durability — no message loss even on server crash

Capacity Estimation (2B users, 500M DAU):

  • Messages/day: 100B → ~1.15M/sec
  • Text storage/day: 100B × 100 bytes = 10 TB/day
  • Media (20% of messages, 100KB avg): 20B × 100KB = 2 PB/day
  • Active WebSocket connections: 500M persistent TCP connections

Key Architectural Pillars

1

WebSockets — Why Not HTTP Polling?

HTTP polling (client asks "any messages?" every N seconds) wastes bandwidth and adds latency. WebSockets establish a persistent, bidirectional TCP connection — the server pushes a message the instant it arrives. Each Chat Server holds 50K–100K simultaneous WebSocket connections. WhatsApp uses Erlang/OTP (BEAM VM) — each WebSocket is one lightweight Erlang process, enabling millions of concurrent connections per server.

Example: Server-Sent Events only work server→client. WebSockets are bidirectional: needed for sending messages AND receiving delivery receipts on the same connection.
2

Presence Service (Redis Routing Table)

A dedicated Presence Service maintains a mapping of user_id → (server_id, socket_id) in Redis. When User A connects via WebSocket, the chat server writes {user_id, server_ip, socket_id} to Redis. When A disconnects, the entry is removed after a heartbeat timeout. This lets any chat server instantly route a message to the server holding any online user's connection.

Example: Redis key: presence:{user_id} → {server_ip, last_heartbeat}. TTL = 30s. Client sends heartbeat every 10s. Expired TTL → user marked offline. At 500M DAU, this Redis cluster needs careful sharding.
3

Cassandra for Message Storage

SQL bottlenecks on chat: millions of concurrent writes need row locks, and ORDER BY timestamp scans are slow. Cassandra's wide-column model is append-only, optimized for (conversation_id, timestamp) lookups, and scales writes linearly by adding nodes. Schema: partition_key = conversation_id, clustering_key = message_id (time-based UUID for ordering).

Example: SELECT * FROM messages WHERE conversation_id = 'A-B' ORDER BY message_id DESC LIMIT 50 — O(1) on Cassandra regardless of history length. 100B messages/day = ~10TB appended to Cassandra daily.
4

Message Delivery Receipts (3-State FSM)

Three delivery states: (1) Sent ✓ — server received and saved to Cassandra. (2) Delivered ✓✓ — recipient's device received the message (device sends ACK). (3) Read ✓✓ blue — recipient opened the conversation. Each state transition flows: recipient → their chat server → updates Cassandra → pushes receipt event to sender's WebSocket.

Example: B's app receives message → sends {type: DELIVERED, msg_id: X} to chat server → server updates Cassandra status → pushes {DELIVERED, msg_id: X} to A's WebSocket. A sees single tick become double tick.
5

Group Messaging via Kafka Fan-Out

For 1-on-1 messages, direct server-to-server push is instant. For groups with 1,024 members, synchronous fan-out would block for seconds. Instead: message is saved to Cassandra, published to a Kafka topic (partition = group_id), and fan-out workers read from Kafka in parallel — each worker routes to one member's chat server via Redis Presence lookup.

Example: Group of 500 members: 1 Kafka event → 500 parallel fan-out tasks → each checks Redis → delivers to recipient's WebSocket server. Decouples group fan-out latency from the original send latency.
6

End-to-End Encryption (Signal Protocol)

Every message is encrypted on the sender's device before transmission. The server only sees ciphertext — it cannot read content. Key exchange uses Diffie-Hellman to establish a shared secret. Messages use AES-256 encryption with new keys per session (forward secrecy: past messages cannot be decrypted even if current keys are compromised).

Example: Server stores: encrypted_payload, sender_id, recipient_id, timestamp. The encrypted_payload is opaque to the server. Even a full server compromise cannot decrypt stored messages without the private keys on the user's device.

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In
Design WhatsApp - Module 10: Real-world Case Studies | System Design | Revise Algo