Back to Portfolio
AWS Cloud Engineer — October 2025
When DNS Ate the Cloud
A first-person account of the October 19-20, 2025 AWS us-east-1 outage that took down 70,000+ organizations due to a race condition in DynamoDB's DNS management system
The Moment It Started
It was 11:48 PM PDT on October 19th when the first customer calls started coming in. I was on the AWS support team that night, and what began as isolated DynamoDB connection errors quickly escalated into something we'd never seen before.
"My DynamoDB calls are timing out," the first customer reported. Then another: "EC2 instances won't launch." Within minutes: "Lambda functions failing," "ECS tasks stuck," "My entire application is down."
The pattern that emerged was both subtle and devastating: everything in us-east-1 that touched DynamoDB was failing, but each service was failing in its own unique way. Customers were seeing their own infrastructure as the problem when the actual failure was completely invisible to them.
The Invisible Failure: The hardest incidents to triage are the ones where the root cause is invisible to customers. Their monitoring showed their infrastructure failing, but the actual problem was deep inside AWS's internal DNS management system.
The Root Cause: A Race Condition in DNS Management
The failure originated in DynamoDB's internal automated DNS management system, which was designed with two independent components for high availability:
DNS Planner
Monitors load balancer health and creates DNS update plans
DNS Enactor
Applies DNS changes via Route 53 API calls
The race condition unfolded like this:
1
11:48 PM PDT
DNS Enactor #1 experiences unusually high delays
Processing a routine DNS update plan, but running much slower than normal
2
11:49 PM PDT
DNS Planner generates new plans
Unaware of Enactor #1's delays, continues creating fresh DNS update plans
3
11:50 PM PDT
DNS Enactor #2 starts processing newer plans
Begins applying the latest DNS plans and runs cleanup process
4
11:51 PM PDT
The Race Condition Fires
Enactor #1 finishes its delayed run just as Enactor #2's cleanup deletes the "stale" plan, removing ALL IP addresses for DynamoDB's regional endpoint
Why Empty DNS Records Are Catastrophic: When a DNS record becomes empty, clients receive NXDOMAIN responses. These get cached for the TTL duration, meaning even after AWS fixed the DNS record, clients that had cached the NXDOMAIN kept failing until their cache expired.
This wasn't a Route 53 bug. Route 53 faithfully executed what it was told to do. The bug was in the system that told Route 53 what to do. Once the race condition fired, the system was left in an inconsistent state that prevented any further automated DNS updates.
The Cascade: How One DNS Failure Became a Multi-Hour Outage
What made this incident particularly devastating wasn't just the initial DNS failure—it was how that failure cascaded through AWS's interconnected systems:
DynamoDB DNS Failure
All IP addresses removed from DNS record
↓
EC2 DropletWorkflow Manager (DWFM)
Manages leases for physical servers, depends on DynamoDB
↓
DWFM Recovery Causes Congestive Collapse
Entire EC2 fleet tries to renew leases simultaneously
↓
Network Manager Backlog
New EC2 instances launch without network configuration
↓
NLB Health Check Flapping
Instances fail health checks, get removed, then restored
↓
Lambda, ECS, EKS, Fargate Failures
All depend on EC2 instance launches
DNS
11:48 PM - 2:25 AM
DynamoDB DNS Resolution Fails
All internal and external traffic to DynamoDB fails DNS resolution
EC2
2:25 AM - 5:28 AM
EC2 Congestive Collapse
DWFM recovery causes thundering herd problem across entire EC2 fleet
NET
5:28 AM - 8:00 AM
Network Configuration Delays
Backlog of network configurations causes ongoing issues
What Customers Experienced vs. What Was Actually Happening
The most challenging aspect of this incident was that customers' own monitoring showed their infrastructure as the problem:
| What Customers Saw |
Actual Root Cause |
| "My DynamoDB calls are timing out" |
DNS NXDOMAIN responses - no IPs to connect to |
| "My EC2 instances won't launch" |
DWFM congestive collapse preventing lease acquisition |
| "My Lambda functions are failing" |
Underlying EC2 capacity unavailable for function execution |
| "My app is down but EC2 shows running" |
Network config delays and NLB health check flapping |
| "I fixed it but it's still broken" |
DNS negative caching - NXDOMAIN cached until TTL expires |
Key Insight: The hardest part of customer triage was explaining that their infrastructure wasn't broken—the failure was completely internal to AWS and invisible to their monitoring systems.
DNS Deep Dive: The Technical Lessons
Why DNS Is the Most Dangerous Single Point of Failure
DNS failures are uniquely catastrophic in distributed systems because:
- Negative Caching: NXDOMAIN responses get cached for the TTL duration
- Universal Dependency: Every network connection starts with DNS resolution
- Invisible Failure Mode: Applications see connection timeouts, not DNS failures
The Thundering Herd Problem at AWS Scale
When DynamoDB recovered, DWFM tried to re-establish leases for the entire EC2 fleet simultaneously. This is a textbook example of congestive collapse:
Recovery Load > System Capacity
→ Requests timeout faster than they can be processed
→ Clients retry, increasing load further
→ System becomes less available during recovery than during outage
The "Cleanup as a Weapon" Anti-Pattern
The cleanup process that deleted "stale" DNS plans was doing the right thing locally but the wrong thing globally. In distributed systems, cleanup operations that can't be rolled back are dangerous.
Distributed Systems Lesson: Independence at the component level doesn't guarantee independence at the interaction level. The race condition was in the interaction between DNS Planner and DNS Enactor, not in either component individually.
What Good Disaster Recovery Looked Like
During this incident, some customers were completely unaffected. Here's what worked:
- Multi-region active-active architectures: No single region dependency
- DynamoDB Global Tables: Could fail over reads to other regions
- Route 53 health checks + failover routing: Automatically routed to healthy regions
- ElastiCache in front of DynamoDB: Cache absorbed reads during outage
- Circuit breakers in application code: Failed fast instead of hanging
- Hardcoded IPs (ironically): Some customers with this bad practice worked during DNS failure
The Multi-Region Lesson: The value of multi-region architecture isn't just about AZ failures—it's about any single-region dependency, including internal AWS dependencies you can't see.
AWS's Response and What Changed
AWS's immediate response was decisive but drastic:
- Disabled DynamoDB DNS Planner and DNS Enactor automation worldwide
- Switched to manual DNS management until safeguards could be implemented
- Conducted a comprehensive review of all automated DNS management systems
In November 2025, AWS introduced the DNS failover feature:
- Services can specify backup DNS records that activate automatically
- Invariant checking prevents DNS records from ever being set to empty
- Staggered recovery mechanisms prevent thundering herd problems
The Broader Lesson: The fix for a race condition in automation isn't just fixing the race condition—it's adding invariants that prevent the system from ever reaching an invalid state.
Personal Takeaways
Working customer-facing triage during this incident taught me several critical lessons:
The Hardest Incidents Are Invisible Ones
When customers can see the failure (server down, network partition), they understand what's happening. When the failure is invisible to them but their applications are broken, trust erodes quickly.
DNS Failures Are Uniquely Bad
Unlike most failures, DNS failures are cached. Fixing the root cause doesn't immediately fix the symptom. This creates a secondary wave of customer confusion when their "fixes" don't work.
At AWS Scale, Recovery Is a Distributed Systems Problem
The DWFM congestive collapse showed that at sufficient scale, even recovery becomes a distributed systems challenge requiring careful orchestration.
The Value of True Multi-Region Architecture
This incident reinforced that multi-region isn't just about protecting against natural disasters or AZ failures—it's about protecting against any single-region dependency, including ones inside the cloud provider itself.
Final Thought: This incident affected 70,000+ organizations and caused hundreds of millions in losses, but it also demonstrated the incredible interconnectedness of modern cloud infrastructure. A race condition in a DNS management system became a global business continuity event. That's both the power and the fragility of the cloud.