Back to Portfolio
AWS Cloud Engineer (CSE) — March 2026

The me-central-1 Incident

A Technical Retrospective on Infrastructure Resilience During Crisis

Apurva Desai March 2026 15 min read

Incident Overview

On March 1st, 2026, I was halfway through my morning coffee when the first alerts started flooding in. What began as routine monitoring quickly escalated into one of the most challenging incidents I've faced as a Cloud Engineer at AWS.

Critical Infrastructure Failure

Iranian retaliatory drone and missile strikes hit AWS me-central-1 (UAE/Dubai) data centers on March 1-2, 2026. Physical objects struck the facility causing sparks, fire, and complete power loss including backup generators. This wasn't a software failure—it was a physical infrastructure disaster.

March 1, 2026 - 14:30 UTC

First impact detected. mec1-az2 and mec1-az3 went offline immediately. Fire department responded and cut all power including backup generators for safety.

March 1, 2026 - 15:45 UTC

mec1-az1 also affected due to facility damage. me-south-1 (Bahrain) experiencing degraded connectivity due to regional network disruption.

March 1, 2026 - 16:00 UTC

AWS public statement: Recovery to take at least a day, requires repair of facilities, cooling and power systems, coordination with local authorities.

March 2, 2026 - 08:00 UTC

Facility repairs begin. Data extraction from impaired AZs not possible—recovery must come from pre-existing cross-region backups.

What made this uniquely challenging wasn't just the scale—it was the nature of the failure. When both the control plane AND data plane go down simultaneously, your recovery options become severely constrained. The fundamental lesson I learned: you can only recover from what existed BEFORE the event.

Key Insight

Customers running multi-AZ workloads were not impacted. Single-AZ customers faced total loss unless they had cross-region backups. Multi-AZ architecture isn't optional—it's the difference between "not impacted" and "total loss."

Technical Triage

The hardest conversations were with customers who had never considered cross-region disaster recovery. I spent the first 48 hours walking through the same decision tree with hundreds of customers:

Control Plane vs Data Plane

This distinction became critical during the incident. The control plane (AWS APIs, console) was completely unavailable for the affected AZs. The data plane (running EC2 instances, active RDS connections) was also down due to physical infrastructure failure. This meant no API access, no snapshot creation, no emergency backups.

The Decision Tree

Every customer conversation followed this pattern:

Customer Assessment Framework
Do you have cross-region AMIs/snapshots? ├── YES → Migrate now to alternate region │ ├── Check service quotas in target region │ ├── Request limit increases if needed │ └── Begin restoration process └── NO → Wait for facility recovery ├── Prepare for potential data loss ├── Document lessons learned └── Plan multi-AZ architecture for future

The customers with existing cross-region backups could migrate immediately. Those without had to wait—and hope their data survived the physical damage.

Console Access Workaround

Customers pointing to me-central-1.console.aws.amazon.com could switch to alternate regional console endpoints. This simple change restored management access for resources in unaffected regions.

Service Quota Reality Check

One of the biggest surprises was how service quotas became a migration blocker. Customers trying to restore hundreds of EC2 instances hit default limits in target regions. I found myself submitting quota increase requests at 2 AM, explaining the emergency situation to the service teams.

Migration Patterns

Working with customers during the incident, we developed several migration patterns based on what backup resources were available:

EC2 Migration Strategies

Cross-Region AMI Copy
# Copy AMI to target region aws ec2 copy-image \ --source-region me-central-1 \ --source-image-id ami-12345678 \ --name "emergency-restore-$(date +%Y%m%d)" \ --region eu-west-1 # Copy EBS snapshots aws ec2 copy-snapshot \ --source-region me-central-1 \ --source-snapshot-id snap-12345678 \ --region eu-west-1

For customers with AWS MGN (Application Migration Service) agents already installed, we could perform live migrations—but only if the data plane was still accessible, which wasn't the case for most affected instances.

RDS Recovery Approaches

RDS Recovery Options
  • Cross-region read replicas: Promote to primary (fastest recovery)
  • Automated snapshots: Copy to target region and restore
  • Native dumps: When API unavailable but DB connectivity intact
RDS Cross-Region Restore
# Copy DB snapshot to target region aws rds copy-db-snapshot \ --source-db-snapshot-identifier mydb-snapshot-20260301 \ --target-db-snapshot-identifier mydb-emergency-restore \ --source-region me-central-1 \ --target-region eu-west-1 # Restore from copied snapshot aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier mydb-restored \ --db-snapshot-identifier mydb-emergency-restore

Container Workloads (ECS/EKS)

Container workloads presented unique challenges. ECS tasks that were running survived initially, but the control plane was unavailable for management operations.

EKS IRSA Critical Issue

The most subtle issue we encountered: EKS IRSA (IAM Roles for Service Accounts) trust policies. When recreating clusters in new regions, customers forgot to update OIDC provider associations, leading to silent AccessDenied failures that took hours to debug.

Data Plane Fallbacks

When AWS APIs were unavailable, we fell back to traditional data transfer methods:

Direct Data Transfer (Linux)
# rsync over SSH for file systems rsync -avz -e "ssh -i key.pem" \ /data/ ec2-user@target-instance:/data/ # MySQL dump piped to S3 (when RDS API down but connectivity intact) mysqldump --all-databases | \ aws s3 cp - s3://emergency-backup/mysql-dump-$(date +%Y%m%d).sql

KMS Deep Dive

The most technically interesting challenge was KMS key management during cross-region migration. This is where many customers got stuck, and it revealed some subtle aspects of AWS encryption.

KMS Regional Constraint

KMS keys cannot be migrated between regions. Even multi-region keys don't solve this for integrated services like RDS, EBS, and S3 SSE-KMS—they treat multi-region keys as single-region keys.

The Re-encryption Pattern

Every encrypted resource required the same pattern: decrypt in source region → re-encrypt with destination region key.

EBS Snapshot Re-encryption
# Copy encrypted snapshot with new KMS key aws ec2 copy-snapshot \ --source-region me-central-1 \ --source-snapshot-id snap-encrypted-source \ --target-region eu-west-1 \ --kms-key-id arn:aws:kms:eu-west-1:123456789012:key/target-key-id \ --encrypted
RDS Encrypted Snapshot Migration
# Copy encrypted DB snapshot with target region KMS key aws rds copy-db-snapshot \ --source-db-snapshot-identifier encrypted-db-snap \ --target-db-snapshot-identifier migrated-db-snap \ --source-region me-central-1 \ --target-region eu-west-1 \ --kms-key-id arn:aws:kms:eu-west-1:123456789012:key/target-key-id

S3 Batch Operations for Re-encryption

For S3 objects encrypted with SSE-KMS, S3 Batch Operations handled the decrypt/re-encrypt automatically during CopyObject operations. This was a lifesaver for customers with millions of encrypted objects.

Imported Key Material Gotcha

Symmetric imported key material is NOT interoperable between regions, even with identical key material. However, asymmetric and HMAC imported keys ARE interoperable for client-side encryption—just not for integrated AWS services.

What This Incident Revealed

Beyond the immediate technical challenges, this incident exposed several architectural assumptions that many customers (and frankly, some AWS engineers) hadn't fully considered:

Shared Responsibility in Physical Disasters

The shared responsibility model became viscerally clear: AWS is responsible for the facility, customers are responsible for their DR architecture. No amount of AWS engineering can protect against customer architectural choices during a physical disaster.

Single Points of Failure Exposed

The Control Plane vs Data Plane Distinction

This wasn't academic anymore. During the incident, understanding this distinction determined what recovery actions were possible:

Impact Assessment Framework
Control Plane Down + Data Plane Down = Total Loss ├── No API access for snapshots ├── No emergency backup creation ├── No graceful shutdown procedures └── Recovery only from pre-existing cross-region resources Control Plane Down + Data Plane Up = Limited Recovery ├── Direct data extraction possible (rsync, mysqldump) ├── Application-level backups feasible ├── Graceful data migration to alternate regions └── Time-sensitive but manageable
The Backup Timing Constraint

You cannot create snapshots during an AZ failure. Backup strategy must include cross-region copies BEFORE an event. This seems obvious in retrospect, but many customers learned this lesson the hard way.

Architecture Recommendations

Based on what I learned working with customers through this incident, here are the architectural patterns that made the difference between "not impacted" and "total loss":

Resilience Baseline

Multi-AZ deployment is not optional—it's the baseline for any production workload. This incident made that viscerally clear to every customer I worked with.

Proactive Cross-Region Strategy

AWS Backup Cross-Region Copy Rule
{ "BackupPlan": { "BackupPlanName": "CrossRegionDR", "Rules": [ { "RuleName": "DailyBackupWithCrossRegionCopy", "TargetBackupVault": "default", "ScheduleExpression": "cron(0 2 ? * * *)", "CopyActions": [ { "DestinationBackupVaultArn": "arn:aws:backup:eu-west-1:123456789012:backup-vault:dr-vault", "Lifecycle": { "DeleteAfterDays": 30 } } ] } ] } }

Service-Specific Patterns

Operational Readiness

DR Drills Are Not Optional

The customers who recovered fastest had practiced their DR procedures. Not just documentation—actual drills. They knew their RTO/RPO numbers because they had measured them, not estimated them.

Identity and Access Resilience

Multi-Region SAML Configuration
# Configure SAML IdP with multiple regional endpoints Primary: https://signin.aws.amazon.com/saml Backup: https://eu-west-1.signin.aws.amazon.com/saml Backup: https://us-west-2.signin.aws.amazon.com/saml # IAM Identity Center multi-region replication (Feb 2026+) aws sso-admin create-instance-access-control-attribute-configuration \ --instance-arn arn:aws:sso:::instance/ssoins-primary \ --access-control-attributes file://multi-region-config.json

Lessons Learned

Three months later, I still think about this incident regularly. It changed how I approach architecture reviews and customer conversations about resilience.

Personal Takeaways

The hardest conversations weren't about technology—they were about explaining to customers that their data might be gone forever. No amount of technical skill can recover from architectural decisions made months earlier.

What struck me most was how the incident revealed the gap between theoretical DR planning and practical implementation. Customers had disaster recovery plans, but they hadn't accounted for:

The customers who recovered successfully shared common traits: they had practiced their DR procedures, maintained cross-region backups, and understood the shared responsibility model. Most importantly, they treated multi-AZ architecture as a requirement, not an option.

The Uncomfortable Truth

Some data was lost forever. Not because of AWS engineering failures, but because of architectural choices. The incident was a harsh reminder that in the cloud, as in life, you can't recover from what you never backed up.

As I write this retrospective, I'm reminded that infrastructure resilience isn't just about technology—it's about understanding the constraints of the systems we build on and designing within those constraints. The me-central-1 incident taught me that the difference between a minor inconvenience and a business-ending disaster often comes down to decisions made long before the crisis hits.

The customers who emerged stronger from this incident didn't just restore their systems—they fundamentally rethought their approach to resilience. That's the real lesson: disasters don't just test your backups, they test your assumptions.