Solve: ECS Rollouts and Rollbacks—How to Keep Your CDK Deployments from Breaking in Production


Solve: ECS Rollouts and Rollbacks—How to Keep Your CDK Deployments from Breaking in Production








By now, you've solved the initial ECS deploy paradox and safely updated your service to use your real container image. That means you’ve gone from “why won’t this even launch?” to “we’re deploying our own code into ECS now.”

But there’s a next level—not just deploying successfully, but deploying safely.

In this post, we’ll show how to strengthen your deployment pipeline by focusing on what happens after CDK runs: ECS rollout behavior, container health checks, rollback settings, and optional traffic shifting. These aren’t luxuries—they’re the difference between weekend peace and a Saturday pager alert.


What Happens During an ECS Deployment (And Why It Matters)

Every time you change your ECS task definition—whether it’s a new image, an updated env var, or a different port—ECS creates a new revision. When CDK deploys that change, ECS replaces your old tasks with new ones, gradually.

This is called a rolling update. By default:
  • ECS starts a few new tasks using the new definition
  • It waits for them to become “healthy”
  • Then it drains and stops the old ones

This process is governed by two key settings:
  • minimumHealthyPercent (how many old tasks must stay running)
  • maximumPercent (how many new + old tasks ECS can run during the update)

If your app takes time to warm up or fails to start cleanly, ECS may kill it before it even gets a chance. That’s where health checks and rollback controls come in.


Health Checks Are Your First Line of Defense

To stop bad deploys early, you need to teach ECS what “healthy” looks like.

There are two places to define this:

  • In the container itself using the Docker HEALTHCHECK instruction: 

dockerfile
HEALTHCHECK CMD curl -f http://localhost:3000/healthz || exit 1

  • In the ECS service, especially if you're using a load balancer:

typescript
service.targetGroup.configureHealthCheck({
  path: '/healthz',
  interval: Duration.seconds(30),
  healthyThresholdCount: 2,
  unhealthyThresholdCount: 2,
});

You can also set a grace period:

typescript
healthCheckGracePeriod: Duration.seconds(60),

This gives your container time to boot and pass its checks before ECS starts judging it. That one line has saved many a rollout.


Circuit Breakers: Let ECS Roll Back a Bad Deploy

Even with health checks, mistakes happen. That’s why you should always enable the ECS circuit breaker: 

typescript
circuitBreaker: { rollback: true }

If ECS sees that the new task revision is failing to stabilize (e.g., tasks are restarting, failing health checks, or being killed), it stops the rollout and reverts back to the last known good version.

You can monitor this via:
  • ECS console (look for "Deployment failed. Rolling back.")
  • CloudWatch Events or logs (optional alarm triggers)
  • cdk deploy output (will show service not stable)

Without the circuit breaker, ECS just fails silently or leaves your service half-broken.


Optional: Blue/Green Deployments with Traffic Shifting

If you want even more control, AWS offers blue/green deployments via CodeDeploy. This means:
  • ECS starts new tasks in parallel (the “green” set)
  • You gradually shift traffic over (10% → 50% → 100%)
  • You have a chance to validate and approve the change

It sounds great—but it adds real complexity. You need:
  • A load balancer with production and test listeners
  • A CodeDeploy application and deployment group
  • Additional IAM permissions and rollback hooks

For most teams, CDK's default rolling update plus a circuit breaker is enough. But if you're handling sensitive workloads or regulated environments, blue/green is worth exploring—especially if paired with automated smoke tests during the shift.


Observability: What to Watch After You Deploy

Rollbacks and health checks are great—but they’re reactive. Observability is your proactive safety net.

Here’s what you should watch:
  • CloudWatch Alarms on:
    - Task CPU/memory overuse
    - Application error rate
    - Deployment failures
  • Logs from new task definitions (using ECS console or CloudWatch Logs)
  • EventBridge Rules to catch failed deploys or excessive restarts
  • Container metrics like restart count and exit codes

In CDK, you can wire some of this up using built-in constructs, or feed data into dashboards you already maintain.


Conclusion: Production-Ready Means Fail-Ready

Deploying with CDK and ECS is satisfying—but it only becomes trustworthy when it survives failure. What you’ve added here—health checks, rollbacks, alarms—isn’t complexity. It’s insurance.

You’re no longer walking on eggshells after each deploy. You’re operating like someone who expects success but plans for turbulence.

If you’d like to go further, we can explore runtime automation patterns next—like triggering rollbacks from alarms or pausing traffic shifts on anomaly detection.

But for now? You’re stable. You’re ready. You’re live.

* * * 

Written by Aaron Rose, software engineer and technology writer at Tech-Reader.blog.

Comments

Popular posts from this blog

The New ChatGPT Reason Feature: What It Is and Why You Should Use It

Raspberry Pi Connect vs. RealVNC: A Comprehensive Comparison

Running AI Models on Raspberry Pi 5 (8GB RAM): What Works and What Doesn't