The Secret Life of AWS: The Backup Plan (SQS Dead-Letter Queues & Redrive)

 

The Secret Life of AWS: The Backup Plan (SQS Dead-Letter Queues & Redrive)

How to catch, store, and replay data when async events fail

#AWS #SQS #DeadLetterQueue #EventDriven




Margaret is a senior software engineer. Timothy is her junior colleague. They work in a grand Victorian library in London — the kind of place where code quality is the unspoken objective, and craftsmanship is the only thing that matters.

Episode 69

Timothy was staring at an urgent vendor memo with a look of deep concern. He turned his monitor toward Margaret as she walked into the grand Victorian library they used as their studio.

"We have a massive integration problem," Timothy announced. "Our external payment platform is temporarily going offline from May 1 to May 17 for a major system migration. That is sixteen days of downtime."

"A scheduled, multi-week outage for a core dependency," Margaret observed. "How does our current architecture handle it?"

"Badly," Timothy admitted. "Our billing microservice publishes an asynchronous PaymentProcessed event to EventBridge, which routes it to a Lambda function that updates the external portal. I know EventBridge has built-in retries, but the default maximum retry window is only 24 hours. If the portal is down for sixteen days, EventBridge will exhaust its retries and drop the events into the void. We are going to permanently lose thousands of billing records."

"You have discovered the silent danger of asynchronous architecture," Margaret said, taking a seat. "When synchronous APIs fail, the user sees an error immediately. When asynchronous events fail after their retry window, the data vanishes without a sound. We need to build a safety net to catch those dropped events. We need an Amazon SQS Dead-Letter Queue."

The Event Graveyard

Margaret opened the AWS Console and navigated to Amazon Simple Queue Service (SQS).

"For EventBridge targets, the safety net is a standard SQS queue acting as a holding tank for messages that cannot be processed," Margaret explained. "We configure EventBridge so that when it finally gives up on an event after 24 hours of retries, it does not delete it. Instead, it routes the exact payload into this DLQ."

She updated Timothy's infrastructure code using the AWS Cloud Development Kit (CDK) to attach the DLQ to his existing EventBridge rule:

const { Queue, Duration } = require('aws-cdk-lib/aws-sqs');
const { Rule } = require('aws-cdk-lib/aws-events');
const { LambdaFunction } = require('aws-cdk-lib/aws-events-targets');

// 1. Create the safety net (The DLQ)
const billingDeadLetterQueue = new Queue(this, 'BillingDLQ', {
    queueName: 'payment-portal-dlq',
    retentionPeriod: Duration.days(14) // Hold events safely for up to 14 days
});

// 2. Configure the EventBridge Target with the DLQ
const paymentPortalTarget = new LambdaFunction(updatePortalLambda, {
    deadLetterQueue: billingDeadLetterQueue,
    maxEventAge: Duration.hours(24),
    retryAttempts: 185
});

// 3. Attach the target to the existing rule
new Rule(this, 'PaymentProcessedRule', {
    eventPattern: { detailType: ['PaymentProcessed'] },
    targets: [paymentPortalTarget]
});

The Redrive

Timothy studied the code. "So during the sixteen-day migration, EventBridge will try to send the events, fail, execute all 185 retry attempts over 24 hours, and then quietly park the failed payloads into payment-portal-dlq. The data is safe."

"Exactly," Margaret said. "We also set up a CloudWatch Alarm on that queue. If the ApproximateNumberOfMessagesVisible metric goes above zero, you get an alert. That tells you events are failing and landing in the net. And it is not just for massive outages. Sometimes an event payload is just malformed—a 'poison pill' that will always crash the function. The DLQ catches those too, isolating the bad data so it doesn't clog the system."

"But what do we do on May 18th, when the payment platform finally comes back online?" Timothy asked. "Do I have to write a custom script to read thousands of messages out of the queue and manually inject them back into the system?"

"You used to have to do exactly that," Margaret smiled. "But AWS recently introduced DLQ Redrive. It is a native feature in the SQS console. With the click of a button, AWS will take every single message sitting in the Dead-Letter Queue and push it directly back to the original Lambda function target. The system will process the backlog of all the paused events from the last sixteen days automatically."

Timothy updated his architecture diagram. His asynchronous system was no longer a black box where data could silently disappear. By establishing a Dead-Letter Queue, his architecture was now resilient enough to survive even the most catastrophic downstream outages.


Key Concepts Introduced

Asynchronous Data Loss
In event-driven architectures, services process data in the background. If a downstream service is offline, message brokers (like EventBridge or SNS) will retry for a configured period. Once those retries are exhausted, the broker natively drops the message, resulting in permanent, silent data loss if a safety mechanism is not in place.

Dead-Letter Queue (DLQ)
An Amazon SQS queue specifically designated as a holding tank for messages or events that fail to process successfully. By attaching a DLQ to event targets or Lambda functions, developers ensure that "poison pill" messages (malformed data) or events that fail due to prolonged downstream outages are caught and stored safely for future investigation.

Maximum Event Age & Retry Attempts
Configuration settings in EventBridge and Lambda that define how long the service should attempt to deliver a payload before giving up and routing it to the DLQ. (e.g., matching a 24-hour window to roughly 185 exponential retry attempts).

SQS DLQ Redrive
A native AWS feature that simplifies operational recovery. Instead of writing custom code to extract and replay failed messages, developers can use the Redrive feature in the AWS Console to automatically send all messages in a DLQ back to their original target (like a Lambda function) for reprocessing once the underlying issue is resolved.


Aaron Rose is a software engineer and technology writer at tech-reader.blog

Catch up on the latest explainer videos, podcasts, and industry discussions below.


Comments

Popular posts from this blog

Insight: The Great Minimal OS Showdown—DietPi vs Raspberry Pi OS Lite

The New ChatGPT Reason Feature: What It Is and Why You Should Use It

Running AI Models on Raspberry Pi 5 (8GB RAM): What Works and What Doesn't