Back to Portfolio
AWS Cloud Engineer — October 2025

When DNS Ate the Cloud

A first-person account of the October 19-20, 2025 AWS us-east-1 outage that took down 70,000+ organizations due to a race condition in DynamoDB's DNS management system

The Moment It Started

It was 11:48 PM PDT on October 19th when the first customer calls started coming in. I was on the AWS support team that night, and what began as isolated DynamoDB connection errors quickly escalated into something we'd never seen before.

"My DynamoDB calls are timing out," the first customer reported. Then another: "EC2 instances won't launch." Within minutes: "Lambda functions failing," "ECS tasks stuck," "My entire application is down."

The pattern that emerged was both subtle and devastating: everything in us-east-1 that touched DynamoDB was failing, but each service was failing in its own unique way. Customers were seeing their own infrastructure as the problem when the actual failure was completely invisible to them.

The Invisible Failure: The hardest incidents to triage are the ones where the root cause is invisible to customers. Their monitoring showed their infrastructure failing, but the actual problem was deep inside AWS's internal DNS management system.

The Root Cause: A Race Condition in DNS Management

The failure originated in DynamoDB's internal automated DNS management system, which was designed with two independent components for high availability:

DNS Planner

Monitors load balancer health and creates DNS update plans

DNS Enactor

Applies DNS changes via Route 53 API calls

The race condition unfolded like this:

1
11:48 PM PDT
DNS Enactor #1 experiences unusually high delays
Processing a routine DNS update plan, but running much slower than normal
2
11:49 PM PDT
DNS Planner generates new plans
Unaware of Enactor #1's delays, continues creating fresh DNS update plans
3
11:50 PM PDT
DNS Enactor #2 starts processing newer plans
Begins applying the latest DNS plans and runs cleanup process
4
11:51 PM PDT
The Race Condition Fires
Enactor #1 finishes its delayed run just as Enactor #2's cleanup deletes the "stale" plan, removing ALL IP addresses for DynamoDB's regional endpoint
Why Empty DNS Records Are Catastrophic: When a DNS record becomes empty, clients receive NXDOMAIN responses. These get cached for the TTL duration, meaning even after AWS fixed the DNS record, clients that had cached the NXDOMAIN kept failing until their cache expired.

This wasn't a Route 53 bug. Route 53 faithfully executed what it was told to do. The bug was in the system that told Route 53 what to do. Once the race condition fired, the system was left in an inconsistent state that prevented any further automated DNS updates.

The Cascade: How One DNS Failure Became a Multi-Hour Outage

What made this incident particularly devastating wasn't just the initial DNS failure—it was how that failure cascaded through AWS's interconnected systems:

DynamoDB DNS Failure
All IP addresses removed from DNS record
EC2 DropletWorkflow Manager (DWFM)
Manages leases for physical servers, depends on DynamoDB
DWFM Recovery Causes Congestive Collapse
Entire EC2 fleet tries to renew leases simultaneously
Network Manager Backlog
New EC2 instances launch without network configuration
NLB Health Check Flapping
Instances fail health checks, get removed, then restored
Lambda, ECS, EKS, Fargate Failures
All depend on EC2 instance launches
DNS
11:48 PM - 2:25 AM
DynamoDB DNS Resolution Fails
All internal and external traffic to DynamoDB fails DNS resolution
EC2
2:25 AM - 5:28 AM
EC2 Congestive Collapse
DWFM recovery causes thundering herd problem across entire EC2 fleet
NET
5:28 AM - 8:00 AM
Network Configuration Delays
Backlog of network configurations causes ongoing issues

What Customers Experienced vs. What Was Actually Happening

The most challenging aspect of this incident was that customers' own monitoring showed their infrastructure as the problem:

What Customers Saw Actual Root Cause
"My DynamoDB calls are timing out" DNS NXDOMAIN responses - no IPs to connect to
"My EC2 instances won't launch" DWFM congestive collapse preventing lease acquisition
"My Lambda functions are failing" Underlying EC2 capacity unavailable for function execution
"My app is down but EC2 shows running" Network config delays and NLB health check flapping
"I fixed it but it's still broken" DNS negative caching - NXDOMAIN cached until TTL expires
Key Insight: The hardest part of customer triage was explaining that their infrastructure wasn't broken—the failure was completely internal to AWS and invisible to their monitoring systems.

DNS Deep Dive: The Technical Lessons

Why DNS Is the Most Dangerous Single Point of Failure

DNS failures are uniquely catastrophic in distributed systems because:

The Thundering Herd Problem at AWS Scale

When DynamoDB recovered, DWFM tried to re-establish leases for the entire EC2 fleet simultaneously. This is a textbook example of congestive collapse:

Recovery Load > System Capacity
→ Requests timeout faster than they can be processed
→ Clients retry, increasing load further
→ System becomes less available during recovery than during outage

The "Cleanup as a Weapon" Anti-Pattern

The cleanup process that deleted "stale" DNS plans was doing the right thing locally but the wrong thing globally. In distributed systems, cleanup operations that can't be rolled back are dangerous.

Distributed Systems Lesson: Independence at the component level doesn't guarantee independence at the interaction level. The race condition was in the interaction between DNS Planner and DNS Enactor, not in either component individually.

What Good Disaster Recovery Looked Like

During this incident, some customers were completely unaffected. Here's what worked:

The Multi-Region Lesson: The value of multi-region architecture isn't just about AZ failures—it's about any single-region dependency, including internal AWS dependencies you can't see.

AWS's Response and What Changed

AWS's immediate response was decisive but drastic:

In November 2025, AWS introduced the DNS failover feature:

The Broader Lesson: The fix for a race condition in automation isn't just fixing the race condition—it's adding invariants that prevent the system from ever reaching an invalid state.

Personal Takeaways

Working customer-facing triage during this incident taught me several critical lessons:

The Hardest Incidents Are Invisible Ones

When customers can see the failure (server down, network partition), they understand what's happening. When the failure is invisible to them but their applications are broken, trust erodes quickly.

DNS Failures Are Uniquely Bad

Unlike most failures, DNS failures are cached. Fixing the root cause doesn't immediately fix the symptom. This creates a secondary wave of customer confusion when their "fixes" don't work.

At AWS Scale, Recovery Is a Distributed Systems Problem

The DWFM congestive collapse showed that at sufficient scale, even recovery becomes a distributed systems challenge requiring careful orchestration.

The Value of True Multi-Region Architecture

This incident reinforced that multi-region isn't just about protecting against natural disasters or AZ failures—it's about protecting against any single-region dependency, including ones inside the cloud provider itself.

Final Thought: This incident affected 70,000+ organizations and caused hundreds of millions in losses, but it also demonstrated the incredible interconnectedness of modern cloud infrastructure. A race condition in a DNS management system became a global business continuity event. That's both the power and the fragility of the cloud.