A Technical Retrospective on Infrastructure Resilience During Crisis
On March 1st, 2026, I was halfway through my morning coffee when the first alerts started flooding in. What began as routine monitoring quickly escalated into one of the most challenging incidents I've faced as a Cloud Engineer at AWS.
Iranian retaliatory drone and missile strikes hit AWS me-central-1 (UAE/Dubai) data centers on March 1-2, 2026. Physical objects struck the facility causing sparks, fire, and complete power loss including backup generators. This wasn't a software failure—it was a physical infrastructure disaster.
First impact detected. mec1-az2 and mec1-az3 went offline immediately. Fire department responded and cut all power including backup generators for safety.
mec1-az1 also affected due to facility damage. me-south-1 (Bahrain) experiencing degraded connectivity due to regional network disruption.
AWS public statement: Recovery to take at least a day, requires repair of facilities, cooling and power systems, coordination with local authorities.
Facility repairs begin. Data extraction from impaired AZs not possible—recovery must come from pre-existing cross-region backups.
What made this uniquely challenging wasn't just the scale—it was the nature of the failure. When both the control plane AND data plane go down simultaneously, your recovery options become severely constrained. The fundamental lesson I learned: you can only recover from what existed BEFORE the event.
Customers running multi-AZ workloads were not impacted. Single-AZ customers faced total loss unless they had cross-region backups. Multi-AZ architecture isn't optional—it's the difference between "not impacted" and "total loss."
The hardest conversations were with customers who had never considered cross-region disaster recovery. I spent the first 48 hours walking through the same decision tree with hundreds of customers:
This distinction became critical during the incident. The control plane (AWS APIs, console) was completely unavailable for the affected AZs. The data plane (running EC2 instances, active RDS connections) was also down due to physical infrastructure failure. This meant no API access, no snapshot creation, no emergency backups.
Every customer conversation followed this pattern:
The customers with existing cross-region backups could migrate immediately. Those without had to wait—and hope their data survived the physical damage.
Customers pointing to me-central-1.console.aws.amazon.com could switch to alternate regional console endpoints. This simple change restored management access for resources in unaffected regions.
One of the biggest surprises was how service quotas became a migration blocker. Customers trying to restore hundreds of EC2 instances hit default limits in target regions. I found myself submitting quota increase requests at 2 AM, explaining the emergency situation to the service teams.
Working with customers during the incident, we developed several migration patterns based on what backup resources were available:
For customers with AWS MGN (Application Migration Service) agents already installed, we could perform live migrations—but only if the data plane was still accessible, which wasn't the case for most affected instances.
Container workloads presented unique challenges. ECS tasks that were running survived initially, but the control plane was unavailable for management operations.
The most subtle issue we encountered: EKS IRSA (IAM Roles for Service Accounts) trust policies. When recreating clusters in new regions, customers forgot to update OIDC provider associations, leading to silent AccessDenied failures that took hours to debug.
When AWS APIs were unavailable, we fell back to traditional data transfer methods:
The most technically interesting challenge was KMS key management during cross-region migration. This is where many customers got stuck, and it revealed some subtle aspects of AWS encryption.
KMS keys cannot be migrated between regions. Even multi-region keys don't solve this for integrated services like RDS, EBS, and S3 SSE-KMS—they treat multi-region keys as single-region keys.
Every encrypted resource required the same pattern: decrypt in source region → re-encrypt with destination region key.
For S3 objects encrypted with SSE-KMS, S3 Batch Operations handled the decrypt/re-encrypt automatically during CopyObject operations. This was a lifesaver for customers with millions of encrypted objects.
Symmetric imported key material is NOT interoperable between regions, even with identical key material. However, asymmetric and HMAC imported keys ARE interoperable for client-side encryption—just not for integrated AWS services.
Beyond the immediate technical challenges, this incident exposed several architectural assumptions that many customers (and frankly, some AWS engineers) hadn't fully considered:
The shared responsibility model became viscerally clear: AWS is responsible for the facility, customers are responsible for their DR architecture. No amount of AWS engineering can protect against customer architectural choices during a physical disaster.
This wasn't academic anymore. During the incident, understanding this distinction determined what recovery actions were possible:
You cannot create snapshots during an AZ failure. Backup strategy must include cross-region copies BEFORE an event. This seems obvious in retrospect, but many customers learned this lesson the hard way.
Based on what I learned working with customers through this incident, here are the architectural patterns that made the difference between "not impacted" and "total loss":
Multi-AZ deployment is not optional—it's the baseline for any production workload. This incident made that viscerally clear to every customer I worked with.
The customers who recovered fastest had practiced their DR procedures. Not just documentation—actual drills. They knew their RTO/RPO numbers because they had measured them, not estimated them.
Three months later, I still think about this incident regularly. It changed how I approach architecture reviews and customer conversations about resilience.
The hardest conversations weren't about technology—they were about explaining to customers that their data might be gone forever. No amount of technical skill can recover from architectural decisions made months earlier.
What struck me most was how the incident revealed the gap between theoretical DR planning and practical implementation. Customers had disaster recovery plans, but they hadn't accounted for:
The customers who recovered successfully shared common traits: they had practiced their DR procedures, maintained cross-region backups, and understood the shared responsibility model. Most importantly, they treated multi-AZ architecture as a requirement, not an option.
Some data was lost forever. Not because of AWS engineering failures, but because of architectural choices. The incident was a harsh reminder that in the cloud, as in life, you can't recover from what you never backed up.
As I write this retrospective, I'm reminded that infrastructure resilience isn't just about technology—it's about understanding the constraints of the systems we build on and designing within those constraints. The me-central-1 incident taught me that the difference between a minor inconvenience and a business-ending disaster often comes down to decisions made long before the crisis hits.
The customers who emerged stronger from this incident didn't just restore their systems—they fundamentally rethought their approach to resilience. That's the real lesson: disasters don't just test your backups, they test your assumptions.