YouTube Architecture（原文）¶

YouTube Architecture Most "YouTube architecture" articles are system design fanfic. This one sticks to what's public and justifiable: Google / YouTube / Google Cloud material Papers by Google engineers Vitess maintainer docs Measurement studies of YouTube’s delivery network Where the evidence is thin, I'll say it. The scale YouTube serves 2B+ users on the same global infrastructure that underpins Google's Media CDN. That network spans 200+ countries/territories and 1,300+ cities. YouTube's recommendation system is described by Google as one of the largest and most sophisticated industrial recommenders in existence. If you're building anything that moves bytes globally, YouTube is the reference architecture. TL;DR YouTube is: frontend apps + backend microservices running on Borg, backed by sharded MySQL (Vitess), Bigtable, and object storage (GCS/Colossus), fronted by a multi-tier CDN/cache for video and metadata. That's the whole game. Now let's break it into the parts. The four request shapes If you want to understand YouTube, stop thinking "one architecture" It's four architectures sharing one brand: Dynamic pages & APIs Watch page, comments, subscriptions, search, recommendations. Video bytes Huge sequential reads, adaptive bitrate, cache hierarchy. Thumbnails & small objects Tiny payloads, brutal QPS, filesystem pain. Metadata & relationships Users, videos, counters, subscriptions - heavy reads/writes under consistency constraints. Each path has different bottlenecks. 1) Video delivery: CDN is the product At YouTube scale, the "video service" is mostly a caching problem. A measurement-based study of the YouTube video delivery cloud describes: A flat video ID space Multiple DNS namespaces representing a multi-tier logical cache hierarchy A 3-tier physical cache hierarchy (primary / secondary / tertiary) The important mechanism: Video ID → DNS mapping → logical namespace → physical cache location That means you can: add servers move load change cache policy by updating mappings, not deploying application code. Protocol behavior Google’s Media CDN stack uses modern transport: QUIC / HTTP/3 TLS 1.3 modern congestion control (like BBR) And on the client side, YouTube uses adaptive streaming: DASH / HLS player switches renditions based on bandwidth + buffer health So "watching a video" is really: fetch manifest fetch segments adjust bitrate keep the buffer alive avoid rebuffering The origin should be the last place traffic goes. If origin load is high, something upstream is failing. 2) Dynamic requests: microservices on Borg For watch pages, feeds, comments, subscriptions, etc., the common public model is: Client → Edge → LB/API layer → backend microservices → data layer This isn't officially published as an exact diagram by YouTube, but the shape aligns with: how Google runs large services what’s visible from delivery behavior how system design analyses describe it What is well-supported publicly: YouTube runs as a massive workload inside Google's internal cluster infrastructure (Borg). Borg matters because YouTube has two very different workloads: latency-sensitive serving (APIs, page assembly) batch/throughput jobs (transcoding, analytics) Borg is designed to run both types efficiently: scheduling and bin-packing for utilization fast recovery when machines die isolation and admission control to prevent one workload from eating the cluster If your architecture assumes a stable fleet, you’re already behind. At YouTube scale, machines failing is background noise. 3) Thumbnails: "small files" are a trap Most teams learn this too late: small objects at scale are harder than big objects. Watch pages can show dozens of thumbnails. That creates massive request rates for tiny payloads. Public sources describe Bigtable being used at YouTube for replicated data such as images/thumbnails and other large key-value datasets. Why that direction makes sense: A filesystem full of billions of tiny objects becomes an ops tax: inode pressure cache warmup pain directory scaling limits constant cold-cache misses Bigtable-style storage lets you: pack data better than "one file per object" replicate across locations sit behind distributed caching layers If you treat thumbnails like "just static assets," you get paged by thumbnails. 4) The most proven part: MySQL scaling with Vitess This is where we have the cleanest, primary evidence because Vitess is open source and maintainer-documented. YouTube's metadata started on MySQL. Then growth created three classic failure modes: replication lag (async replication under writes) too many connections (app tiers melting MySQL) tables too big (vertical scaling stops working) The usual story is "move off SQL." YouTube's story is: build infrastructure to keep SQL. Vitess in one line Vitess is a control plane that makes MySQL sharding operable. From the Vitess GitHub README: Vitess was a core component of YouTube’s database infrastructure from 2011, and grew to encompass tens of thousands of MySQL nodes. That sentence is the headline. Not because it's impressive. Because it tells you the real trade-off YouTube made: They accepted sharding complexity and paid it down with automation. The key Vitess primitives VTGate (query router): routes queries to the right shard pools connections (protects MySQL from connection storms) adds query safety / guardrails (kills dangerous queries) VTTablet (per-shard agent): manages the shard's MySQL participates in failover supports operational workflows (backups, reparenting) Resharding: split/merge shards with minimal downtime required when "one shard gets hot" becomes your normal day A subtle but huge lesson here: Sharding isn't just about write throughput. It’s about isolation and blast radius. hot users shouldn't slow down everyone one failing shard shouldn’t take the site down cache locality should improve, not degrade Upload → transcode → store → serve Public deep-dives (not official, but consistent with the way modern streaming systems work) describe: chunked uploads (10–50 MB chunks) so uploads can resume and parallelize edge ingestion so the first hop is local async transcoding workers producing multiple renditions (codec/bitrate/resolution) storage optimized for large blobs (GCS/Colossus patterns) delivery optimized for cacheability + segment fetch behavior While we don't have a single canonical "YouTube transcoding pipeline" spec from Google, the above is aligned with: Media CDN patterns common streaming architectures public system design analyses The important mental model: upload latency and processing latency are different products and must be decoupled with queues + idempotency + backpressure. Recommendations: two-stage DNN, watch-time driven This is another area with strong primary sources. Google's paper "Deep Neural Networks for YouTube Recommendations" describes a two-stage system: 1) Candidate generation Reduce billions of videos to a few hundred candidates fast. 2) Ranking Score candidates using richer features. The key optimization detail: Watch time is a first-class objective. Clicks are easy to game. Watch time is harder. Also: the system explicitly includes freshness features so new uploads can surface without waiting for long historical interaction data. If you're building a recommender, your biggest architecture decision isn't the model. It’s the metric. Whatever you optimize for becomes the behavior you breed. What to learn Design by request shape. Video bytes, thumbnails, dynamic metadata, and writes are different systems. Use caches as tiers, not a checkbox. Edge → regional → origin, with clear hit-rate goals. Make routing a control plane. DNS/service discovery should let you shift load without code. Shard for isolation. Throughput is a bonus. Blast radius is the real win. Pick a metric users can’t game easily. Watch time beats clicks for most media feeds. You can copy the technologies or copy the principles. But you can’t ignore the constraints. Thanks for reading. Ya'll are the best.