Design Notification System
Configuring priority message queues, spams throttlers, and integrations with APNs/FCM gateways.
What you'll learn
- Priority Kafka Queues
- Deduplication with Idempotency Keys
- Channel Worker Architecture (Per-Channel Isolation)
- Retry with Exponential Backoff + Dead Letter Queue
- User Preference Service
- Rate Limiting + Carrier Throttling
TL;DR
Configuring priority message queues, spams throttlers, and integrations with APNs/FCM gateways.
Visual System Topology
Notification System — Multi-Channel Architecture
Marketing (bulk): low_priority Kafka → throttled workers → batched delivery, best-effort
Concept Overview
A notification system must reliably deliver messages across multiple channels (push, SMS, email) at high volume while ensuring OTPs arrive in < 1 second, marketing messages don't spam users, and third-party API failures don't cause permanent notification loss.
Functional Requirements:
- Send push notifications (iOS APNs, Android FCM)
- Send SMS (Twilio, carrier-based)
- Send emails (SendGrid, SES)
- Priority classification (OTP = urgent, marketing = bulk)
- User preference management (opt-in/opt-out per channel)
- Delivery tracking and analytics
Non-Functional Requirements:
- OTP delivery: < 1 second (user is waiting at login screen)
- Marketing notifications: best-effort, minutes acceptable
- Zero duplicate notifications per event
- Retry failed deliveries with exponential backoff
- Handle third-party API unreliability (APNs, Twilio go down)
Capacity Estimation (100M notification/day, 10M DAU):
- Total: 100M/day = 1,157 notifications/sec
- OTP (urgent): 5M/day = 58/sec (high priority)
- Marketing (bulk): 80M/day = 925/sec (low priority)
- By channel: Push 70% (810/sec), SMS 15% (174/sec), Email 15% (174/sec)
- Peak burst: 10x = 11,570/sec during product launches
Key Architectural Pillars
Priority Kafka Queues
Different notification types have different latency requirements. A single Kafka queue would have OTPs waiting behind bulk marketing emails. Solution: separate Kafka topics per priority: high_priority (OTPs, transactional alerts, password resets) and low_priority (marketing campaigns, weekly digests, recommendations). High-priority workers: dedicated, always-running, low Kafka consumer lag target (< 100ms). Low-priority workers: larger batches, throttled, allowed lag in minutes.
Deduplication with Idempotency Keys
Network failures, server retries, and event replay can cause a notification to be sent multiple times. Receiving two "Your OTP is 123456" messages is confusing; receiving two "Flash Sale 50% off!" emails is annoying and damaging to sender reputation. Each notification request must include an idempotency_key (UUID generated by the caller). The Gateway stores {idempotency_key → status} in Redis with TTL = 24 hours. Duplicate requests with the same key: return the cached status, do NOT send again.
Channel Worker Architecture (Per-Channel Isolation)
Each notification channel (APNs, FCM, Twilio SMS, SendGrid Email) has its own dedicated worker pool. This isolation means: APNs being slow doesn't delay SMS delivery, and a SendGrid outage doesn't affect push notifications. Each worker pool independently: consumes from its Kafka topic (partitioned by channel), calls the third-party API, handles retries and circuit breaking. Workers scale independently based on channel-specific throughput.
Retry with Exponential Backoff + Dead Letter Queue
Third-party APIs (APNs, Twilio, SendGrid) are unreliable — they return 5xx errors, rate limits, or timeout. A simple immediate retry hammers a struggling API. Exponential backoff: retry 1 → wait 1s, retry 2 → wait 2s, retry 3 → wait 4s, retry 4 → wait 8s (capped at 60s). After max retries (e.g., 5): move to Dead Letter Queue (DLQ) for manual inspection and potential redelivery. DLQ is critical: permanent delivery failures must be auditable.
User Preference Service
Users should only receive notifications they opted into. The Preference Service stores: {user_id, channel (push/sms/email), category (marketing/transactional/security), enabled: true/false}. The Notification Gateway queries preferences before routing: OTP/security notifications always sent (cannot opt out). Marketing: check preference before routing to queue. SMS: check if user provided phone number and opted in. This prevents sending to invalid channels and respects legal requirements (GDPR, CAN-SPAM).
Rate Limiting + Carrier Throttling
Sending too many SMS or emails too fast triggers carrier blacklisting (your sender ID gets marked as spam) or rate limit errors from third-party APIs. Apply per-user throttling (e.g., max 5 SMS per hour, max 3 marketing emails per day) enforced in Redis with sliding window counters. Apply per-API-account throttling to stay within APNs/Twilio/SendGrid rate limits. For marketing campaigns targeting 10M users, distribute the send over hours (scheduled bulk send) rather than blasting all at once.
