AWS Lambda Error – Lambda Invoked Too Many Times (Retry Storm Diagnostic Guide)

 

AWS Lambda Error – Lambda Invoked Too Many Times (Retry Storm Diagnostic Guide)

A practical diagnostic guide for resolving Lambda retry storms, where a persistent failure causes AWS services to repeatedly invoke your function—sometimes thousands of times—until systems fail, costs spike, or downstream components collapse.





Problem

Your Lambda is being invoked far more than expected. You may see:

  • Exploding CloudWatch invocation counts
  • Surging costs
  • Thousands of failed attempts in minutes
  • Event source backlog warnings
  • High iterator age (DynamoDB Streams)

You might not see a specific Lambda error, but the pattern is unmistakable:

Retry Storm.

Common causes include:

  • A bug causing every invocation to fail
  • Misconfigured retry policies
  • SQS without a Redrive Policy (infinite retries)
  • DynamoDB Streams shard blockage
  • SNS or EventBridge repeatedly invoking Lambda due to publish failures

Clarifying the Issue

Each AWS service has a different retry behavior, and this determines how retry storms form:

  • SQS (poll-based): Lambda polls SQS; failures return the message to the queue → infinite retries unless Redrive Policy + DLQ is configured
  • DynamoDB Streams: A single bad record blocks an entire shard → iterator age climbs
  • Kinesis: Similar to DynamoDB; one poisoned record freezes the shard
  • SNS: Retries for hours with exponential backoff
  • EventBridge: Retries for 24 hours unless DLQ/event bus target is configured
  • Async Lambda Sources (S3, SES, EventBridge, SNS invokes via async): Lambda retries twice, then sends to DLQ if configured

Retry storms occur when one message or event repeatedly fails, and the event source keeps trying again.


Why It Matters

Retry storms cause:

  • Runaway AWS costs
  • Stuck pipelines
  • Lost or duplicated messages
  • Cascading failures across microservices
  • Very noisy on-call incidents
  • Hot partitions in DynamoDB Streams
  • SLAs being blown apart

Stopping the storm quickly is critical.


Key Terms

  • Retry Storm – A feedback loop where an event source keeps re-invoking a failing Lambda.
  • Redrive Policy – For SQS, defines when messages move to a DLQ.
  • DLQ (Dead Letter Queue) – Stores failed events so processing can continue.
  • Iterator Age – For DynamoDB Streams/Kinesis, measures how far behind a consumer is.
  • Poison Pill – A message that can never succeed, causing infinite retries.

Steps at a Glance

  1. Identify the event source that is triggering retries
  2. Inspect CloudWatch logs for actual error messages
  3. Pause or isolate the failing event source
  4. Apply correct retry control based on service type
  5. For SQS: configure a Redrive Policy, not a Lambda DLQ
  6. For DynamoDB Streams: handle the blocking record
  7. For SNS/EventBridge: ensure proper DLQs or fallback targets
  8. Fix the underlying code error
  9. Re-enable event sources safely

Detailed Steps

Step 1: Identify the event source.

For poll-based sources, this works:

aws lambda list-event-source-mappings --function-name my-fn

This reveals:

  • SQS
  • DynamoDB Streams
  • Kinesis

However:
This command does not show SNS, EventBridge, S3, SES, or API Gateway.

To discover push-based sources, use:

aws lambda get-policy --function-name my-fn

Look for:

  • SNS topic ARNs
  • EventBridge rules
  • S3 bucket notifications

Or check the Triggers tab in the Lambda console.

Now that you know the source, investigate the failure.


Step 2: Inspect CloudWatch logs.

Tail logs:

aws logs tail /aws/lambda/my-fn --since 5m

Look for:

  • The persistent error
  • Whether all messages fail identically
  • Whether failures come in bursts (e.g., SNS retry pattern)

Now pause the storm.


Step 3: Pause or isolate the malfunctioning event source.

Different services require different strategies:

  • SQS: Disable the event source mapping:
  aws lambda update-event-source-mapping \
    --uuid <uuid> \
    --enabled false
  • DynamoDB Streams / Kinesis: Disable the mapping the same way as SQS.
  • SNS: DO NOT aws sns unsubscribe unless you're ready to lose configuration. Instead, disable the Lambda trigger via console or subscription attributes.
  • EventBridge: Disable the rule temporarily.

With the storm paused, apply the proper remediation.


Step 4: Apply correct retry control (service-specific).

Retry behavior varies dramatically.

Asynchronous Lambda Sources (S3, SES, EventBridge async, SNS async)

Lambda retries twice, then sends to Lambda DLQ if configured:

aws lambda update-function-configuration \
  --function-name my-fn \
  --dead-letter-config TargetArn=arn:aws:sqs:us-east-1:123:my-dlq

SNS (Push model)

SNS itself retries for hours unless the subscription has a DLQ configured:

Set SNS DLQ via:

aws sns set-subscription-attributes \
  --subscription-arn <arn> \
  --attribute-name RedrivePolicy \
  --attribute-value '{"deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:my-dlq"}'

EventBridge

Add a DLQ or secondary event bus target.


Step 5: For SQS → Lambda, configure the Redrive Policy (NOT the Lambda DLQ).

This is the critical correction.

Setting a DLQ on the Lambda does nothing for SQS triggers.

You must configure the DLQ on the queue itself:

aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123/my-queue \
  --attributes '{"RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123:my-dlq\",\"maxReceiveCount\":\"5\"}"}'

This is the only way to stop infinite retries.


Step 6: For DynamoDB Streams and Kinesis, handle the blocking record.

A single poison pill will:

  • Stop the shard
  • Freeze all records behind it
  • Increase iterator age
  • Cause a severe backlog

You must:

  • Log the failing record
  • Patch the code to handle malformed items
  • Or manually move/delete/fix the bad item upstream

Then re-enable the event source mapping.


Step 7: Fix your Lambda code’s underlying error.

Now that storms are paused and retries controlled:

  • Add validation
  • Improve error messages
  • Add guard clauses
  • Handle missing fields
  • Add try/catch around risky operations

Once fixed, redeploy.


Step 8: Re-enable event sources.

For SQS/DynamoDB/Kinesis:

aws lambda update-event-source-mapping \
  --uuid <uuid> \
  --enabled true

For SNS/EventBridge:
Enable the subscription or rule again.

Storm is resolved.


Pro Tips

  • Configure DLQs before production.
  • SQS with no Redrive Policy is a ticking time bomb.
  • Add log statements to distinguish “retryable” vs “terminal” errors.
  • For high-throughput systems, monitor:

    • Iterator Age
    • SQS ApproximateReceiveCount
    • SNS DeliveryFailures

Conclusion

Lambda retry storms are dangerous because they are silent, rapid, and expensive.

But with proper identification of the event source, correct use of DLQs vs Redrive Policies, safe SNS/EventBridge handling, and awareness of shard blockage behavior, you can restore stability quickly and prevent future cascades.


Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Comments

Popular posts from this blog

The New ChatGPT Reason Feature: What It Is and Why You Should Use It

Insight: The Great Minimal OS Showdown—DietPi vs Raspberry Pi OS Lite

Raspberry Pi Connect vs. RealVNC: A Comprehensive Comparison