A Million Requests Per Second: Inside High-Traffic Systems

A million requests per second. That's 60 million per minute. 3.6 billion per hour. Companies like Google, Netflix, and Discord handle this routinely. How? The answer involves clever engineering, ruthless prioritization, and accepting that some things will fail.

The Scale Intuition Problem

Humans struggle to grasp exponential scale. A thousand feels big. A million feels like "a lot of thousands." But the difference is qualitative, not just quantitative.

At 10 requests per second, you can log everything, debug interactively, and deploy whenever you want. At 1,000 rps, you need to think about it. At 1,000,000 rps, everything is different.

Consider: at 1M rps, if 0.01% of requests fail, that's 100 failures per second. Your error dashboard is on fire constantly. "Five nines" availability (99.999%) still means 10 failed requests per second.

Pattern 1: Cache Everything

The fastest request is one you don't process. High-scale systems cache aggressively:

Browser caching: Set appropriate headers, and users don't even hit your servers. A year-long cache for static assets eliminates billions of requests.

CDN caching: Cloudflare, Fastly, CloudFront—put your content on hundreds of edge servers worldwide. Users hit the nearest one.

Application caching: Redis, Memcached—keep hot data in memory. A database query that takes 50ms becomes a cache lookup that takes 1ms.

Precomputation: If you can compute it ahead of time, do so. A recommendation engine that runs hourly beats one that computes recommendations on every request.

Netflix caches entire video chunks at ISPs. YouTube caches popular videos at every edge location. Wikipedia serves most pages from cache—the actual servers rarely see traffic.

Pattern 2: Distribute the Load

One server can't handle a million requests per second. You need thousands working together.

Horizontal scaling: Add more identical servers behind a load balancer. Each handles a fraction of traffic. Need more capacity? Add more servers.

Geographic distribution: Users in Tokyo hit Tokyo servers. Users in London hit London servers. Latency drops, and no single region handles everything.

Microservices: Instead of one big application, many small services. Each scales independently based on its own load profile.

Database sharding: Split data across multiple database servers. Users A-M go to one shard, N-Z to another. Each database handles less load.

Stripe processes millions of payments by routing each to one of many database shards. Slack runs thousands of microservices, each handling one specific task.

Pattern 3: Asynchronous Processing

Not everything needs to happen immediately. Deferring work smooths load spikes and improves perceived performance.

Message queues: Instead of processing a request synchronously, drop it on a queue and return immediately. Workers process the queue at their own pace.

Eventual consistency: Accept that not everything needs to be immediately consistent. Your follower count might be a few seconds behind reality. That's okay.

Batch processing: Instead of sending one email per request, batch them and send every minute. Same result, 60x fewer operations.

Instagram doesn't update your feed in real-time—it batches updates and refreshes periodically. Amazon's inventory counts aren't perfectly accurate; they reconcile asynchronously.

Pattern 4: Graceful Degradation

At scale, failures are constant. The question isn't whether things will fail, but how to fail gracefully.

Circuit breakers: If a dependency is failing, stop calling it. Return cached data, a default response, or a graceful error. Don't let one failure cascade everywhere.

Load shedding: When overloaded, reject some requests cleanly rather than failing all requests poorly. 90% of users getting fast responses beats 100% getting timeouts.

Feature flags: Turn off expensive features during traffic spikes. The recommendation engine can wait; checkout cannot.

Netflix's system is famous for this. If the recommendation service fails, you get a generic list. If ratings fail, stars disappear. The core experience—streaming video—keeps working.

The Database Problem

Everything above is application layer. The database is where scale gets really hard.

Databases were designed for correctness, not throughput. ACID guarantees, strong consistency, complex queries—these are expensive. At million-rps scale, traditional approaches break.

Solutions:

  • Read replicas: Write to one primary, read from many replicas. Reads scale horizontally.
  • NoSQL: Trade some SQL features for scale. DynamoDB, Cassandra, MongoDB offer different trade-offs.
  • Caching layer: Don't hit the database if you don't have to. Most reads should be cache hits.
  • Denormalization: Duplicate data so you don't need joins. Violates textbook design but enables scale.

Discord stores billions of messages across Cassandra clusters. Google's Spanner database spans the globe while maintaining consistency (using atomic clocks, no less).

The Cost of Scale

Building for a million rps is expensive—in engineering time, infrastructure cost, and operational complexity.

Most systems don't need it. If you're handling 100 rps, optimizing for a million is waste. Start simple. Add complexity as load demands it. The goal is solving your actual problem, not preparing for hypothetical scale.

But when you do need scale, these patterns are proven. Companies have figured out how to serve billions of users reliably. The knowledge exists. Applying it thoughtfully is the art.

Building for Scale?

MKTM Studios designs systems that grow with your business. Let's talk architecture.

Start the Conversation