ReviseAlgo Logo
Advanced20 min readReal-world Case Studies

Design Netflix

Optimizing global video streams using localized ISP-embedded Open Connect hardware caches.

What you'll learn

  • Open Connect CDN (ISP Colocation)
  • Content Pre-Staging (Proactive Push)
  • Microservices on AWS (200+ Services)
  • Recommendation Engine (Collaborative Filtering)
  • AV1 Codec + Per-Title Encoding
  • A/B Testing at Scale

TL;DR

Optimizing global video streams using localized ISP-embedded Open Connect hardware caches.

Visual System Topology

Netflix — Streaming Platform Architecture

Control Plane (AWS)
Auth + API Gateway
Recommendation Engine collaborative filtering ML
Catalog + Metadata
Data Plane (Open Connect CDN)
Open Connect Appliances installed at ISPs worldwide
ISP Colocation 95% of traffic served here
AWS S3 (Origin) source of truth for content
Control plane: Client ↔ AWS — auth, catalog browse, recommendations, billing
Data plane: Client → nearest Open Connect ISP box → HLS/DASH video stream (no AWS involved)

Concept Overview

Netflix serves 250M+ subscribers with 200M+ hours of video daily at peak. Its defining architectural decision: separating the control plane (AWS microservices) from the data plane (custom Open Connect CDN hardware at ISPs) — so your streaming video never touches AWS after the initial URL is returned.

Functional Requirements:

  • Browse movie/TV catalog with personalized recommendations
  • Stream video with adaptive quality and zero buffering
  • Download for offline viewing
  • Continue watching across devices
  • Multiple user profiles per account
  • Creator tools (studio/production partner uploads)

Non-Functional Requirements:

  • < 0.5% rebuffering rate — smooth streaming is the #1 metric
  • 25M+ concurrent streams at peak (evenings)
  • Multi-device support (TV, phone, browser, game console)
  • 99.99% availability for streaming

Capacity Estimation (250M subscribers):

  • Active streams at peak: 25M concurrent × avg 4 Mbps = 100 Tbps of bandwidth
  • Storage: 36,000 titles × avg 5 GB/quality level × 5 quality levels ≈ 900 TB encoded
  • Control plane API: login, catalog, search — millions of requests/hour (AWS-based)
  • Open Connect traffic: 95% of 100 Tbps = 95 Tbps served from ISP-colocated hardware

Key Architectural Pillars

1

Open Connect CDN (ISP Colocation)

Netflix builds and operates their own CDN hardware called Open Connect Appliances (OCA). These are custom-built servers (8–250TB of flash storage, 100 Gbps network interfaces) installed inside ISP data centers and exchange points worldwide. When a subscriber in New York watches Stranger Things, the video bytes travel only from their ISP's machine room to their home — never crossing the public internet to AWS. This is why Netflix has near-zero buffering even during peak hours.

Example: Without Open Connect: Netflix subscriber → public internet → AWS us-east-1 → public internet back to subscriber. With Open Connect: Netflix subscriber → ISP local network → Open Connect box in the same ISP data center → subscriber. Latency: 5ms vs 80ms. Buffering: 0% vs 2%+
2

Content Pre-Staging (Proactive Push)

Netflix knows which shows are releasing tomorrow (they made them). Every night during off-peak hours, Netflix pushes video chunks for upcoming releases to all relevant Open Connect appliances worldwide. By the time subscribers click "play" on a new season premiere, the files are already on their ISP's hardware. This eliminates origin-server load spikes on launch night. For older content: ML predicts which titles will be watched this week and pre-stages accordingly.

Example: Stranger Things Season 5 premieres Friday. Netflix pushes all 9 episodes (in 5 quality levels each) to every Open Connect box in every ISP worldwide Thursday night. Friday at 9 PM ET: 5M users click play simultaneously → 100% CDN cache hits → AWS origin receives zero streaming requests.
3

Microservices on AWS (200+ Services)

The control plane runs on AWS with 200+ microservices: Authentication, Profile, Catalog, Search, Recommendations, Playback URL generation, Analytics, Billing, A/B Testing, etc. Netflix pioneered Chaos Engineering: deliberately injecting failures into production (Chaos Monkey kills random service instances) to verify resilience. Services communicate via REST and Kafka. All services are deployed across 3 AWS regions (us-east-1, eu-west-1, ap-southeast-1) for failover.

Example: Chaos Monkey randomly terminates an EC2 instance running the Recommendation Service every business day. If that causes user-visible errors, the team gets paged. This forces proper design of resilient, stateless services — if any single instance dying causes an outage, the design is wrong.
4

Recommendation Engine (Collaborative Filtering)

Netflix's recommendation system is its core differentiator — 80% of watched content comes from recommendations (not search). The system uses: (1) Collaborative filtering: "users similar to you also watched X" (matrix factorization on watch history), (2) Content-based filtering: "you liked action thrillers, here are more", (3) Contextual signals: time of day, device type, day of week, (4) A/B testing: different recommendation algorithms run simultaneously for different user cohorts.

Example: Matrix factorization: represent each user and each title as a 100-dimension vector. Similar tastes → vectors close together in latent space. "You might like X" = find titles whose vector is close to your taste vector. Runs on Spark with 250M user × 36K title matrix updated daily.
5

AV1 Codec + Per-Title Encoding

Netflix doesn't use the same bitrate for every video. Each title is analyzed frame-by-frame and encoded at the minimum bitrate that maintains quality for that specific content. An animated cartoon needs far less bandwidth than a dark, complex nature documentary at the same visual quality. This "per-title encoding" reduces storage and bandwidth by 20–40%. Netflix uses AV1 (30% more efficient than H.264) for supported devices.

Example: Animated film "Arcane" at 1080p: 2.5 Mbps (smooth, predictable motion). Documentary "Our Planet" at 1080p: 5.5 Mbps (complex natural textures, grain, motion). Same perceived quality to viewer, very different bandwidth. Per-title encoding saves Netflix $1B+/year in CDN costs.
6

A/B Testing at Scale

Netflix runs 100s of simultaneous A/B experiments. Every user cohort may see a different: thumbnail image for a show, row ordering on the home screen, recommendation algorithm, UI layout, even video encoding quality level. The experimentation platform randomly assigns users to treatment groups, measures engagement metrics (click-through rate, completion rate, retention), and statistically determines winning variants. Features ship if they win A/B tests with statistical significance.

Example: Thumbnail A/B test: 50% of users see a dramatic action shot for "Breaking Bad," 50% see a character portrait. Variant with higher click-through rate wins after 2 weeks. This exact methodology revealed that personalized thumbnails (different thumbnail per user based on their taste profile) increased click-through rate by 20–30%.

AI Tutor

Ask about the topic

Sign in Required

Please sign in to use the AI tutor

Sign In
Design Netflix - Module 10: Real-world Case Studies | System Design | Revise Algo