AWS SQS Error: SQS Dead-Letter Queue (DLQ) Redrive Misconfigured

 

AWS SQS Error: SQS Dead-Letter Queue (DLQ) Redrive Misconfigured

When messages never reach a Dead-Letter Queue, SQS isn’t broken. It’s doing exactly what it was told.





Problem

You’ve configured a Dead-Letter Queue (DLQ) for Amazon SQS.

Messages fail processing repeatedly.
But nothing ever shows up in the DLQ.

No errors.
No warnings.
Just a growing sense that SQS is ignoring your configuration.


Clarifying the Issue

This is almost never a bug in SQS.

It is almost always a redrive failure caused by one of three things:

❌ The redrive policy is incomplete or incorrect
❌ Messages are being deleted before SQS can redrive them
❌ SQS is silently unable to write to the DLQ (permissions or encryption)

SQS only moves messages to a DLQ under very strict conditions.
If any requirement is unmet, messages will retry — or be dropped — without ceremony.


Why It Matters

DLQs are meant to be your last line of defense:

  • Capture poison messages
  • Preserve failed data
  • Stop infinite retry loops
  • Enable debugging and reprocessing

When DLQs don’t work:

  • Bad messages churn invisibly
  • Lambda costs rise
  • Downstream systems degrade
  • Failures become harder to diagnose

A broken DLQ is worse than no DLQ — because it creates false confidence.


Key Terms

  • Dead-Letter Queue (DLQ) – A queue that receives messages after repeated failures
  • Redrive Policy – Configuration linking a source queue to a DLQ
  • maxReceiveCount – Number of receives before SQS moves a message to the DLQ
  • Receive Count – How many times a message has been delivered
  • Explicit Delete – Consumer deletes a message, ending its lifecycle
  • Silent Drop – Message discarded because SQS cannot redrive it

Steps at a Glance

  1. Confirm the redrive policy and DLQ permissions
  2. Verify maxReceiveCount is understood correctly
  3. Ensure failed messages are not deleted
  4. Understand what increments receive count
  5. Validate consumer behavior (especially Lambda)

Detailed Steps

Step 1: Confirm Redrive Policy, Permissions, and Encryption

DLQs are configured on the source queue, not the DLQ itself.

Creating a DLQ alone does nothing.
Linking it incorrectly does nothing.

Even worse: if SQS cannot write to the DLQ, messages are dropped silently after retries.

Action

  • Link it: Source Queue → Redrive Policy → Select DLQ
  • Permit it: Ensure the DLQ Access Policy allows sqs:SendMessage from the source queue (or account)
  • Decrypt it: If the DLQ uses KMS, ensure SQS has:

    • kms:GenerateDataKey
    • kms:Decrypt
  • ❌ If permissions are missing, SQS will drop messages silently

This is one of the most dangerous DLQ failure modes because nothing errors loudly.


Step 2: Verify maxReceiveCount

This is the most misunderstood setting.

maxReceiveCount means:

“After this many receives, move the message to the DLQ.”

It does not mean:

  • Processing attempts
  • Lambda retries
  • Visibility timeout expirations

❌ Setting it to 1 and expecting instant DLQ
❌ Setting it too high and assuming DLQ is broken

✅ Typical values: 3–5

Action

  • Set maxReceiveCount deliberately
  • Align it with how many retries you actually want

Step 3: Ensure Failed Messages Are NOT Deleted

In SQS, deletion is final.

If your consumer deletes a message — even after failure — SQS considers it successfully processed and will never redrive it.

Common traps:

  • finally { deleteMessage() }
  • Catching exceptions and returning success
  • Lambda handlers swallowing errors

❌ Delete on failure
✅ Let the message become visible again

Action

  • Audit delete logic carefully
  • Delete only after true success

Step 4: Understand What Increments Receive Count

A receive happens when:

  • A consumer receives the message
  • The visibility timeout begins

It increments even if:

  • Processing never starts
  • Lambda times out
  • The consumer crashes

This means:

  • Visibility timeout + retries drive DLQ behavior
  • Fast failures can hit maxReceiveCount quickly

Action

  • Ensure visibility timeout matches processing time
  • Expect receive count to increase even on crashes

Step 5: Validate Lambda-Specific Behavior

When using Lambda with SQS:

❌ Returning success tells SQS the entire batch succeeded
❌ One failure can requeue every message in the batch

✅ Enable partial batch failure handling (ReportBatchItemFailures)
✅ Let only failed messages retry

If Lambda never signals failure, the DLQ will never trigger.


Pro Tips

  • DLQs only work if messages are allowed to fail
  • Deleting a message is irreversible
  • maxReceiveCount counts receives, not errors
  • Test DLQs intentionally with bad payloads
  • Queue types must match: FIFO → FIFO, Standard → Standard

Conclusion

When messages never reach a Dead-Letter Queue, SQS isn’t broken.

It’s doing exactly what it was told.

DLQs are precise tools. They only activate when:

  • A valid redrive policy exists
  • Permissions and encryption allow redrive
  • Receives exceed the threshold
  • Messages are not deleted prematurely

Once those conditions are aligned, DLQs work predictably — and failures become visible again.


Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Comments

Popular posts from this blog

The New ChatGPT Reason Feature: What It Is and Why You Should Use It

Insight: The Great Minimal OS Showdown—DietPi vs Raspberry Pi OS Lite

Raspberry Pi Connect vs. RealVNC: A Comprehensive Comparison