The Secret Life of AWS: Canary Deployments with AWS CodeDeploy

How to safely release new code to 10% of users and automatically roll back if something breaks

#AWS #CodeDeploy #CanaryDeployment #DevOps

Margaret is a senior software engineer. Timothy is her junior colleague. They work in a grand Victorian library in London — the kind of place where code quality is the unspoken objective, and craftsmanship is the only thing that matters.

Episode 73

Timothy was staring at the blinking cursor in his terminal, his finger hovering uncertainly over the Enter key. He had just finished writing a major update to their checkout microservice, but he couldn't bring himself to execute the deployment command.

Margaret walked into the grand Victorian library studio and noticed his hesitation. "Having second thoughts about your code?"

"The code is fine," Timothy sighed, pulling his hand away from the keyboard. "It passed all our local unit tests. But our architecture is so massive now. If I hit deploy, AWS Lambda instantly swaps out the old code for the new code across the entire continent. If there is a hidden bug that only appears in production, 100% of our customers will immediately start getting 500 Internal Server Errors. We will lose thousands of dollars before I can manually revert it."

"Your infrastructure is now region-resilient. That's impressive," Margaret said, taking a seat. "But there's one thing it can't survive: a bad line of code. Remember when we first explored canary deployments back in Episode 58? That was for individual Lambda functions. Now we need to apply that exact same traffic-shifting pattern at scale across our entire continent-spanning API. We need a Canary Deployment using AWS CodeDeploy."

The Ten Percent Test

Margaret took the keyboard and opened Timothy's AWS Cloud Development Kit (CDK) stack.

"Instead of replacing the live code for everyone simultaneously," Margaret explained, "AWS CodeDeploy allows us to shift traffic gradually at the API Gateway level. We can route just ten percent of our live customer traffic to your new version. The other ninety percent stay on the old, stable version."

She updated his deployment configuration to implement the pattern.

const { LambdaDeploymentGroup, LambdaDeploymentConfig } = require('aws-cdk-lib/aws-codedeploy');
const { Alias } = require('aws-cdk-lib/aws-lambda');

// 1. Create a Lambda Alias to act as the traffic router
const prodAlias = new Alias(this, 'ProdAlias', {
    aliasName: 'live',
    version: newCheckoutFunctionVersion,
});

// 2. Configure the safe deployment strategy using AWS CodeDeploy
const deploymentGroup = new LambdaDeploymentGroup(this, 'SafeDeployment', {
    alias: prodAlias,
    deploymentConfig: LambdaDeploymentConfig.CANARY_10PERCENT_5MINUTES,
    
    // 3. The Safety Net: Auto-rollback if CloudWatch detects errors
    alarms: [checkoutErrorRateAlarm]
});

The Automated Rollback

Timothy studied the infrastructure logic. "CANARY_10PERCENT_5MINUTES. So CodeDeploy routes ten percent of our checkouts to my new code and holds that state for exactly five minutes."

"Exactly," Margaret smiled. "Those ten percent of users act as our 'canaries in the coal mine.' During that five-minute window, CodeDeploy actively monitors the CloudWatch alarm we attached. If your new code has a hidden production bug and error rates spike, the alarm triggers."

"And then what?" Timothy asked. "Does it alert me to fix it?"

"It does alert you, but it doesn't wait for you," Margaret said. "If that alarm breaches, CodeDeploy immediately halts the deployment and automatically rolls 100% of the traffic back to the old, stable version. It happens in milliseconds. The blast radius of your bad code is contained to just a handful of users for a few minutes, rather than bringing down the entire continent."

"And if the alarm doesn't trigger?" Timothy asked.

"Then after five minutes of healthy metrics, CodeDeploy confidently routes the remaining ninety percent of traffic to the new version," Margaret replied. "The deployment completes automatically. We could also configure a LINEAR deployment—shifting 10% of traffic every minute—but a 5-minute Canary gives our alarms enough time to gather reliable metric data. Just keep in mind that running two versions of your function simultaneously during the deployment window incurs a very slight cost overlap, but it is infinitely cheaper than a massive production outage."

Timothy pressed Enter and watched the terminal as the deployment pipeline kicked off. For the first time since their user base exploded, he wasn't terrified of releasing new code. His deployments were no longer a leap of faith; they were a calculated, automated, and mathematically safe transition.

Key Concepts Introduced

All-at-Once Deployments: The default deployment strategy for AWS Lambda, where 100% of live traffic is instantly shifted from the old version of a function to the new version. While fast, it carries massive risk—if the new code contains a critical bug, the entire user base is immediately impacted.

Canary Deployments: A safe deployment strategy where new code is released to a small, controlled percentage of users (e.g., 10%) for a set evaluation period. If the new version proves stable, the remaining traffic is eventually shifted over. This pattern scales from single functions (as seen in Episode 58) to massive API Gateway routing rules.

Linear Deployments: An alternative to Canary deployments where traffic is shifted in equal increments with an equal number of minutes between each increment (e.g., shifting 10% of traffic every minute until 100% is reached).

AWS CodeDeploy & Lambda Aliases: CodeDeploy is a fully managed deployment service that automates software releases. In a serverless architecture, it works by manipulating a Lambda Alias (a pointer to a specific function version). CodeDeploy smoothly shifts the weights on the Alias to route incoming API Gateway traffic between the old and new code. Running multiple function versions concurrently during these shifts does incur a minor, temporary cost overlap.

Automated Rollbacks: The practice of linking deployment pipelines to operational health metrics (like CloudWatch Alarms). If an alarm triggers during the canary evaluation window, the deployment system automatically aborts the release and safely reroutes all traffic back to the previous stable version without requiring human intervention.

Aaron Rose is a software engineer and technology writer at tech-reader.blog.

Catch up on the latest explainer videos, podcasts, and industry discussions below.

Search This Blog

Tech-Reader.blog