When AWS Goes Down: Inside Cloud Outages

On December 7, 2021, a network device in AWS's us-east-1 region began acting up. Over the next several hours, services across the internet crumbled: Disney+, Venmo, Tinder, Instacart, McDonald's apps, Amazon's own Ring doorbells—even warehouse robots stopped working. A third of the internet runs on AWS. When it stumbles, the world notices.

The 2021 US-East-1 Outage

The trigger was seemingly minor: an automated process to scale capacity in AWS's internal network. The process overwhelmed a networking protocol, creating a traffic jam in the internal network that connects AWS services.

When internal AWS services couldn't communicate, the cascades began:

EC2: Couldn't launch new instances because it couldn't reach internal services
Lambda: Functions failed to execute
DynamoDB: Tables became inaccessible
ECS/EKS: Container management broke
CloudWatch: Monitoring went dark—when you most needed it

The irony: AWS's own status dashboard couldn't update because it depended on the same infrastructure that was failing. Teams knew something was wrong but had no visibility into what.

The Pattern of Cloud Outages

Major cloud outages share common patterns:

Small triggers, big consequences: A misconfigured deployment. An overloaded network device. A bad certificate rotation. The trigger is usually mundane. But cloud systems are so interconnected that failures cascade unpredictably.

Control planes fail: Cloud providers distinguish between "control plane" (managing resources) and "data plane" (actually running your stuff). Control plane outages are especially brutal—existing resources might keep running, but you can't manage them, scale them, or see their status.

Monitoring breaks when you need it: Monitoring systems often depend on the same infrastructure being monitored. When things fail catastrophically, you're flying blind precisely when visibility matters most.

Recovery takes hours: Even after identifying the problem, recovery is slow. Systems must be brought up carefully to avoid recreating the cascade. Backlog of failed requests must drain. Customers must clear their own backlogs.

The Us-East-1 Problem

Us-east-1 (Northern Virginia) is AWS's oldest and largest region. It hosts AWS's core internal services. Many customers default to it. Some AWS services are only available there.

This creates concentration risk. Us-east-1 outages are more common and more impactful than other regions. The 2017 S3 outage, the 2020 Kinesis outage, the 2021 network outage—all us-east-1.

AWS's advice: use multiple regions. In practice, this is expensive and complex. Multi-region architectures require duplicating everything—databases, services, state—and keeping them synchronized. Many organizations accept the risk rather than pay the cost.

What Google and Azure Teach Us

It's not just AWS. Google Cloud has had spectacular outages. A 2019 incident took down YouTube, Gmail, and Google Cloud Platform for hours. The cause: a configuration change that overloaded a core network service.

Azure's major outages often involve Active Directory or DNS—foundational services everything else depends on. A 2018 outage caused by a lightning strike and subsequent cooling failure took down services for hours.

The lesson: all clouds fail. The question is how often, for how long, and what you can do about it.

The Illusion of Availability

Cloud providers advertise impressive availability numbers. AWS S3 promises 99.99% availability (about 52 minutes of downtime per year). EC2's SLA is 99.99% for multi-AZ deployments.

But these numbers come with asterisks:

SLAs cover credits, not damages—you get refunds, not compensation for lost business
Availability is measured regionally—a global outage affecting multiple regions might not "count"
Control plane vs. data plane distinctions matter—your instance might be running while you can't manage it

The practical availability most customers experience is lower than the headline numbers suggest.

Building for Failure

The mature response: assume failures will happen and design accordingly.

Multi-region deployment: The gold standard. Run your application in multiple regions with automatic failover. Expensive and complex, but eliminates single-region risk.

Multi-cloud: Even more resilient—and even more complex. Running on both AWS and GCP means you're not dependent on either. But managing two clouds is more than twice the work.

Graceful degradation: Can your application do something useful when dependencies fail? Show cached data? Accept writes to a queue? Display a useful error? Not all features are equally critical.

Circuit breakers: Stop calling failing services. Return cached responses or errors rather than timing out repeatedly. Failing fast is better than failing slow.

Offline capabilities: Can users do anything without connectivity? Local storage, offline modes, and eventual sync reduce dependence on always-available backends.

The Uncomfortable Reality

We've built the internet on shared infrastructure. This creates efficiency—cloud providers achieve economies of scale no individual company could match. But it also creates correlation—when Amazon has a bad day, we all have a bad day.

The 2021 outage affected warehouse robots and Ring doorbells because everything connects to the same cloud. The efficiency that makes cloud computing possible is the same interconnection that makes failures cascade.

There's no perfect answer. Multi-cloud is expensive. Multi-region is complex. On-premise is a step backward for most organizations. The practical reality: accept that outages will happen, build for resilience where it's worth the cost, and have a plan for when things break.

Because they will break. The only question is whether you're ready.

Building Resilient Infrastructure?

MKTM Studios designs systems that handle cloud failures gracefully. Let's discuss your architecture.

Get in Touch