The Secret Life of AWS: The Controlled Burn (Chaos Engineering)

How to prove your serverless resilience by intentionally breaking it using the AWS Fault Injection Simulator

#AWS #FaultInjection #ChaosEngineering #Reliability

Margaret is a senior software engineer. Timothy is her junior colleague. They work in a grand Victorian library in London — the kind of place where code quality is the unspoken objective, and craftsmanship is the only thing that matters.

Episode 72

Timothy was leaning back in his chair in the grand Victorian library they used as their studio. He was casually scrolling through his CloudWatch dashboards, enjoying the flat, green lines. His continent-spanning architecture was fully deployed. His multi-region failover was configured. His Dead-Letter Queues were empty, and his FinOps budgets were safely under the limit.

"It is finished," Timothy declared as Margaret walked past his desk. "The architecture is completely bulletproof. We have accounted for every possible failure mode."

"Have we?" Margaret asked, pausing to look at his screen. "How do you know the Route 53 DNS failover we built in Episode 70 will actually redirect traffic if us-east-1 goes down?"

"Because I wrote the CDK code," Timothy replied confidently. "The logic is sound. The health checks are configured. If the primary region fails, the traffic will route to Oregon. It is mathematically certain."

Margaret pulled up a chair. "Architecture is not a mathematical proof, Timothy. It is a living, breathing ecosystem. You have spent months building an incredible array of fire extinguishers, but you have never actually started a fire to see if they work. You are relying on hope, and hope is not an engineering strategy. It is time to implement Chaos Engineering using the AWS Fault Injection Simulator."

The Experiment

Margaret opened the AWS Console and navigated to the Fault Injection Simulator (FIS).

"Chaos Engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions," Margaret explained. "Instead of waiting for a 3:00 AM emergency to find out if our failover works, we are going to intentionally break us-east-1 right now, in broad daylight."

She opened Timothy's CDK stack and defined a new FIS Experiment Template.

const { CfnExperimentTemplate } = require('aws-cdk-lib/aws-fis');

const regionalOutageExperiment = new CfnExperimentTemplate(this, 'RegionalOutageSimulation', {
    description: 'Simulate API Gateway 500 Errors to test Route 53 multi-region failover',
    roleArn: chaosRole.roleArn,
    // 1. The Safety Valve: Stop the chaos if it impacts global revenue
    stopConditions: [{
        source: 'aws:cloudwatch:alarm',
        value: globalRevenueDropAlarm.alarmArn 
    }],
    // 2. The Blast Radius: Target ONLY the primary Virginia API
    targets: {
        'PrimaryApiGateway': {
            resourceType: 'aws:apigateway:rest-api',
            resourceTags: { 'Environment': 'Production', 'Region': 'us-east-1' },
            selectionMode: 'ALL'
        }
    },
    // 3. The Fault: Inject 100% internal server errors for 5 minutes
    actions: {
        'Inject500Errors': {
            actionId: 'aws:fis:inject-api-internal-error',
            targets: { 'RestApis': 'PrimaryApiGateway' },
            parameters: {
                duration: 'PT5M', 
                percentage: '100' 
            }
        }
    }
});

The Game Day

"Notice the stopConditions," Margaret pointed out. "We remember the three rules of chaos: start with a hypothesis, minimize the blast radius, and always have a stop condition. If our failover fails and global revenue drops, that CloudWatch alarm triggers and FIS immediately aborts the experiment, restoring the API."

Timothy swallowed hard. "So we are going to intentionally return 500 Internal Server Errors to our live customers? Shouldn't we run this in the staging environment first?"

"We did, yesterday," Margaret replied calmly. "Staging proves the configuration. But only production traffic reveals production weaknesses. Furthermore, before we press this button, we practice the golden rule of a Game Day: communication. I have already notified the customer support team and the on-call engineers. A coordinated chaos experiment is an engineering exercise. An unannounced one is a panic."

She clicked Start Experiment.

Timothy immediately pulled up his monitoring dashboards. Within seconds, the primary API Gateway in Virginia began throwing massive spikes of 500 errors.

"The Route 53 health checks are failing," Timothy narrated, his heart racing slightly despite his earlier confidence. "One failure. Two failures. Three failures... The DNS switch just flipped."

He switched to the Oregon dashboard. Instantly, traffic began flooding into the us-west-2 standby compute layer. Because the DynamoDB Global Table was already synchronized, every customer's cart was intact. The total downtime was less than ninety seconds. After five minutes, FIS automatically stopped the API errors in Virginia, and traffic smoothly routed back to the primary region.

Timothy let out a breath he didn't realize he was holding. "It actually worked."

"Now you know it works," Margaret smiled. "You have moved from theoretical resilience to proven resilience. Your safety nets are no longer just code; they are verified reality."

Key Concepts Introduced

Chaos Engineering & The Three Rules: The discipline of deliberately injecting failures into a software system to discover systemic weaknesses. It relies on three core rules: 1) Start with a clear hypothesis, 2) Minimize the blast radius, and 3) Always have an automated stop condition.

AWS Fault Injection Simulator (FIS): A fully managed service for running fault injection experiments on AWS workloads. FIS allows teams to simulate real-world outages (like API throttling, network latency, database instance termination, or CPU stress) in a controlled, auditable manner.

Blast Radius & Stop Conditions: The blast radius strictly defines the boundary of the experiment (e.g., targeting a single region or a specific resource tag). Stop conditions are automated "abort buttons"—linked to CloudWatch Alarms—that instantly halt the chaos experiment if a critical business metric (like global revenue or overall error rate) drops below a safe threshold.

Game Days: Dedicated, planned periods where engineering teams execute chaos experiments in a production or production-like environment. They require strict cross-team communication. The goal is to train the team on incident response and verify that automated disaster recovery mechanisms execute correctly under stress.

Aaron Rose is a software engineer and technology writer at tech-reader.blog.

Catch up on the latest explainer videos, podcasts, and industry discussions below.

Search This Blog

Tech-Reader.blog