ReviseAlgo Logo
Advanced20 min readReal-world Case Studies

Design Notification System

Configuring priority message queues, spams throttlers, and integrations with APNs/FCM gateways.

What you'll learn

  • Priority Kafka Queues
  • Deduplication with Idempotency Keys
  • Channel Worker Architecture (Per-Channel Isolation)
  • Retry with Exponential Backoff + Dead Letter Queue
  • User Preference Service
  • Rate Limiting + Carrier Throttling

TL;DR

Configuring priority message queues, spams throttlers, and integrations with APNs/FCM gateways.

Visual System Topology

Notification System — Multi-Channel Architecture

Trigger Services Order / Auth / Marketing
Notification Gateway validate + deduplicate
Kafka high_priority / low_priority
APNs Worker iOS push
FCM Worker Android push
SMS Worker Twilio
Email Worker SendGrid
Redis dedup + throttle
Retry Queue exp. backoff + DLQ
Analytics delivery rates
OTP (urgent): high_priority Kafka → dedicated workers → APNs/FCM/Twilio → <1s delivery
Marketing (bulk): low_priority Kafka → throttled workers → batched delivery, best-effort

Concept Overview

A notification system must reliably deliver messages across multiple channels (push, SMS, email) at high volume while ensuring OTPs arrive in < 1 second, marketing messages don't spam users, and third-party API failures don't cause permanent notification loss.

Functional Requirements:

  • Send push notifications (iOS APNs, Android FCM)
  • Send SMS (Twilio, carrier-based)
  • Send emails (SendGrid, SES)
  • Priority classification (OTP = urgent, marketing = bulk)
  • User preference management (opt-in/opt-out per channel)
  • Delivery tracking and analytics

Non-Functional Requirements:

  • OTP delivery: < 1 second (user is waiting at login screen)
  • Marketing notifications: best-effort, minutes acceptable
  • Zero duplicate notifications per event
  • Retry failed deliveries with exponential backoff
  • Handle third-party API unreliability (APNs, Twilio go down)

Capacity Estimation (100M notification/day, 10M DAU):

  • Total: 100M/day = 1,157 notifications/sec
  • OTP (urgent): 5M/day = 58/sec (high priority)
  • Marketing (bulk): 80M/day = 925/sec (low priority)
  • By channel: Push 70% (810/sec), SMS 15% (174/sec), Email 15% (174/sec)
  • Peak burst: 10x = 11,570/sec during product launches

Key Architectural Pillars

1

Priority Kafka Queues

Different notification types have different latency requirements. A single Kafka queue would have OTPs waiting behind bulk marketing emails. Solution: separate Kafka topics per priority: high_priority (OTPs, transactional alerts, password resets) and low_priority (marketing campaigns, weekly digests, recommendations). High-priority workers: dedicated, always-running, low Kafka consumer lag target (< 100ms). Low-priority workers: larger batches, throttled, allowed lag in minutes.

Example: User clicks "Forgot Password": POST /notifications {type: OTP, priority: HIGH, channel: SMS} → high_priority Kafka topic → dedicated Twilio worker picks up in < 100ms → SMS delivered in < 1 second. Marketing email: same API → low_priority topic → bulk email worker processes it in next batch (may be 30-60 seconds later). OTPs never wait behind campaigns.
2

Deduplication with Idempotency Keys

Network failures, server retries, and event replay can cause a notification to be sent multiple times. Receiving two "Your OTP is 123456" messages is confusing; receiving two "Flash Sale 50% off!" emails is annoying and damaging to sender reputation. Each notification request must include an idempotency_key (UUID generated by the caller). The Gateway stores {idempotency_key → status} in Redis with TTL = 24 hours. Duplicate requests with the same key: return the cached status, do NOT send again.

Example: Order service sends: {idempotency_key: "order-456-shipped", type: EMAIL, user_id: 123, subject: "Your order shipped"}. Gateway: check Redis → not seen → process and send → store {order-456-shipped: SENT} in Redis. If order service retries due to timeout: same idempotency_key → Redis hit → return SENT, skip sending. User receives exactly one email.
3

Channel Worker Architecture (Per-Channel Isolation)

Each notification channel (APNs, FCM, Twilio SMS, SendGrid Email) has its own dedicated worker pool. This isolation means: APNs being slow doesn't delay SMS delivery, and a SendGrid outage doesn't affect push notifications. Each worker pool independently: consumes from its Kafka topic (partitioned by channel), calls the third-party API, handles retries and circuit breaking. Workers scale independently based on channel-specific throughput.

Example: APNs workers: pool of 20, consume from kafka_topic_apns. FCM workers: pool of 30, consume from kafka_topic_fcm. Twilio workers: pool of 10, consume from kafka_topic_sms. If APNs has a 30-minute outage: APNs workers accumulate Kafka lag, FCM and SMS continue normally. APNs workers drain the lag when service recovers.
4

Retry with Exponential Backoff + Dead Letter Queue

Third-party APIs (APNs, Twilio, SendGrid) are unreliable — they return 5xx errors, rate limits, or timeout. A simple immediate retry hammers a struggling API. Exponential backoff: retry 1 → wait 1s, retry 2 → wait 2s, retry 3 → wait 4s, retry 4 → wait 8s (capped at 60s). After max retries (e.g., 5): move to Dead Letter Queue (DLQ) for manual inspection and potential redelivery. DLQ is critical: permanent delivery failures must be auditable.

Example: Twilio returns 503 (Service Unavailable). Worker retries with backoff: 1s, 2s, 4s, 8s, 16s (5 attempts total = ~31s). Still failing → publish to Kafka DLQ topic. Ops team receives alert. When Twilio recovers, DLQ messages can be replayed. Without DLQ: failed notifications disappear silently.
5

User Preference Service

Users should only receive notifications they opted into. The Preference Service stores: {user_id, channel (push/sms/email), category (marketing/transactional/security), enabled: true/false}. The Notification Gateway queries preferences before routing: OTP/security notifications always sent (cannot opt out). Marketing: check preference before routing to queue. SMS: check if user provided phone number and opted in. This prevents sending to invalid channels and respects legal requirements (GDPR, CAN-SPAM).

Example: User disables marketing emails in settings → UPDATE preferences SET enabled=false WHERE user_id=123 AND category=MARKETING AND channel=EMAIL. Next marketing campaign: Gateway checks preference → disabled → skip email routing for user 123. OTP: Gateway checks → all transactional notifications enabled regardless of marketing preference.
6

Rate Limiting + Carrier Throttling

Sending too many SMS or emails too fast triggers carrier blacklisting (your sender ID gets marked as spam) or rate limit errors from third-party APIs. Apply per-user throttling (e.g., max 5 SMS per hour, max 3 marketing emails per day) enforced in Redis with sliding window counters. Apply per-API-account throttling to stay within APNs/Twilio/SendGrid rate limits. For marketing campaigns targeting 10M users, distribute the send over hours (scheduled bulk send) rather than blasting all at once.

Example: SendGrid rate limit: 100K emails/hour on the free tier. Marketing campaign to 1M users: split into 10 batches, send one batch/hour over 10 hours. Redis throttle: INCR email_count:{user_id}:{date} → if > 3 → skip this user today. Prevents spam reputation damage and unsubscribe spikes.

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In
Design Notification System - Module 10: Real-world Case Studies | System Design | Revise Algo