AWS SQS Error: SQS Redrive Back to Source

AWS SQS Error: SQS Redrive Back to Source

This is not a failure of the Redrive tool. It is a timing and logic failure.





Problem

You have a Dead-Letter Queue (DLQ) full of failed messages.
You identified the bug in your consumer code.
You used the SQS console to "Start DLQ Redrive" to move messages back to the source queue.

The DLQ drains to zero.
But minutes (or seconds) later, the DLQ fills up again with the exact same messages.

No processing succeeded.
No data was saved.
You just created a loop.


Clarifying the Issue

This is not a failure of the Redrive tool.
It is a timing and logic failure.

When you redrive a message:

  1. SQS moves it back to the Source Queue.
  2. It re-exposes the message to consumers immediately.
  3. If the consumer fails again, the message cycles back to the DLQ.

The Failure: If the consumer code is not actually fixed, or if the data itself is invalid ("poison"), the consumer will fail again. The message will increment its ReceiveCount until it hits the limit, and SQS will dutifully toss it back into the DLQ.


Why It Matters

Blindly redriving messages creates chaos:

  • Infinite Loops: You pay for every receive and every redrive cycle.
  • Log Noise: Error logs explode, hiding new errors.
  • False Confidence: You think the backlog is clearing, but it’s just churning.
  • Throttling: A massive redrive can overwhelm downstream databases.

Key Terms

  • Redrive: The process of moving messages from a DLQ back to a Source Queue.
  • Poison Message: A message that is malformed and will always cause a crash, no matter how many times it is retried.
  • Receive Count: The counter SQS uses to decide when to DLQ a message.
  • Backoff: Deliberately adding delay (visibility timeout) to slow down processing.

Steps at a Glance

  1. Stop! Do not redrive until the consumer is verified fixed.
  2. Analyze the DLQ payload first.
  3. Understand the Receive Count trap.
  4. Consider a Slow Redrive (throttling).
  5. Watch for Poison Messages.

Detailed Steps

Step 1: Fix the Consumer BEFORE Redriving

This sounds obvious, but it is the #1 cause of redrive loops.
Redriving does not fix bugs. It only retries execution.

The Trap:
Thinking "Maybe it was just a glitch/timeout" and redriving without checking logs.

Action:

  • Isolate one message from the DLQ.
  • Test it locally against your consumer code.
  • Verify the fix has been deployed to the actual environment.
  • Only then start the redrive.

Step 2: Analyze the Payload (Is it Poison?)

Sometimes the code is fine, but the message is bad.

  • Missing JSON fields?
  • Negative ID numbers?
  • Strings where integers should be?

If the message is "poison," redriving it 100 times will result in 100 failures.

Action:

  • Inspect the Body of several DLQ messages.
  • If the data is invalid, do not redrive.
  • Delete them, or move them to a separate "Investigation" queue for manual repair.

Step 3: Understand the Receive Count Trap

Do not assume redriving gives you a "clean slate."
When messages are redriven, they are reintroduced into the source queue, but DLQ conditions may still be met almost immediately.

In practice:

  • Messages are available to consumers instantly.
  • If the underlying error persists, the message will fail, increment ReceiveCount, and hit the limit again very quickly.

This happens fast. If you redrive 10,000 messages and your consumers are broken, you will generate 30,000 failed invocations in minutes.


Step 4: Redrive at a Safe Pace

If you have 100,000 failed messages, dumping them all instantly onto the Source Queue can DDoS your own database.

The Fix:

  • Velocity Control: Don't just click "Redrive All" if your downstream systems are fragile.
  • Custom Redrive: For massive volumes, consider writing a script to move messages in small batches (e.g., 50 at a time).
  • Console Speed: The AWS Console redrive is fast—be sure your SQL or APIs can handle the sudden spike.

Step 5: Validate the "Clean" State

After redriving, do not close the tab. Watch two metrics:

  1. Source Queue Depth: Should go UP, then DOWN.
  2. DLQ Depth: Should go to ZERO and stay there.

❌ If Source goes down and DLQ goes back up: Stop consumers immediately. You are burning money on a loop.


Pro Tips

  • One-Message Test: Redrive a single message first to prove the fix works.
  • Dead ends are dead: If a message is malformed (poison), delete it. Redriving won't fix bad JSON.
  • Logs are truth: Watch your application logs during the redrive, not just the queue depth.
  • Disable the Source Trigger (Optional): For massive redrives, you might want to disable the Lambda trigger, load the messages, and then re-enable it to control the flow.

Conclusion

Redriving is not a "Retry" button—it is a "Reload" button.
It reloads the queue and fires again.

If you haven't changed the target (fixed the code) or the ammo (fixed the data), you will miss again.

  • Fix the code.
  • Check the data.
  • Then redrive.

Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Comments

Popular posts from this blog

The New ChatGPT Reason Feature: What It Is and Why You Should Use It

Insight: The Great Minimal OS Showdown—DietPi vs Raspberry Pi OS Lite

Raspberry Pi Connect vs. RealVNC: A Comprehensive Comparison