AWS Lambda Error – Lambda Invoked Too Many Times (Retry Storm Diagnostic Guide)

#aws #lambda #devops #cloud

A practical diagnostic guide for resolving Lambda retry storms, where a persistent failure causes AWS services to repeatedly invoke your function—sometimes thousands of times—until systems fail, costs spike, or downstream components collapse.

Problem

Your Lambda is being invoked far more than expected. You may see:

Exploding CloudWatch invocation counts
Surging costs
Thousands of failed attempts in minutes
Event source backlog warnings
High iterator age (DynamoDB Streams)

You might not see a specific Lambda error, but the pattern is unmistakable:

Retry Storm.

Common causes include:

A bug causing every invocation to fail
Misconfigured retry policies
SQS without a Redrive Policy (infinite retries)
DynamoDB Streams shard blockage
SNS or EventBridge repeatedly invoking Lambda due to publish failures

Clarifying the Issue

Each AWS service has a different retry behavior, and this determines how retry storms form:

SQS (poll-based): Lambda polls SQS; failures return the message to the queue → infinite retries unless Redrive Policy + DLQ is configured
DynamoDB Streams: A single bad record blocks an entire shard → iterator age climbs
Kinesis: Similar to DynamoDB; one poisoned record freezes the shard
SNS: Retries for hours with exponential backoff
EventBridge: Retries for 24 hours unless DLQ/event bus target is configured
Async Lambda Sources (S3, SES, EventBridge, SNS invokes via async): Lambda retries twice, then sends to DLQ if configured

Retry storms occur when one message or event repeatedly fails, and the event source keeps trying again.

Why It Matters

Retry storms cause:

Runaway AWS costs
Stuck pipelines
Lost or duplicated messages
Cascading failures across microservices
Very noisy on-call incidents
Hot partitions in DynamoDB Streams
SLAs being blown apart

Stopping the storm quickly is critical.

Key Terms

Retry Storm – A feedback loop where an event source keeps re-invoking a failing Lambda.
Redrive Policy – For SQS, defines when messages move to a DLQ.
DLQ (Dead Letter Queue) – Stores failed events so processing can continue.
Iterator Age – For DynamoDB Streams/Kinesis, measures how far behind a consumer is.
Poison Pill – A message that can never succeed, causing infinite retries.

Steps at a Glance

Identify the event source that is triggering retries
Inspect CloudWatch logs for actual error messages
Pause or isolate the failing event source
Apply correct retry control based on service type
For SQS: configure a Redrive Policy, not a Lambda DLQ
For DynamoDB Streams: handle the blocking record
For SNS/EventBridge: ensure proper DLQs or fallback targets
Fix the underlying code error
Re-enable event sources safely

Detailed Steps

Step 1: Identify the event source.

For poll-based sources, this works:

aws lambda list-event-source-mappings --function-name my-fn

This reveals:

SQS
DynamoDB Streams
Kinesis

However:
This command does not show SNS, EventBridge, S3, SES, or API Gateway.

To discover push-based sources, use:

aws lambda get-policy --function-name my-fn

Look for:

SNS topic ARNs
EventBridge rules
S3 bucket notifications

Or check the Triggers tab in the Lambda console.

Now that you know the source, investigate the failure.

Step 2: Inspect CloudWatch logs.

Tail logs:

aws logs tail /aws/lambda/my-fn --since 5m

Look for:

The persistent error
Whether all messages fail identically
Whether failures come in bursts (e.g., SNS retry pattern)

Now pause the storm.

Step 3: Pause or isolate the malfunctioning event source.

Different services require different strategies:

SQS: Disable the event source mapping:

  aws lambda update-event-source-mapping \
    --uuid <uuid> \
    --enabled false

DynamoDB Streams / Kinesis: Disable the mapping the same way as SQS.
SNS: DO NOT aws sns unsubscribe unless you're ready to lose configuration. Instead, disable the Lambda trigger via console or subscription attributes.
EventBridge: Disable the rule temporarily.

With the storm paused, apply the proper remediation.

Step 4: Apply correct retry control (service-specific).

Retry behavior varies dramatically.

Asynchronous Lambda Sources (S3, SES, EventBridge async, SNS async)

Lambda retries twice, then sends to Lambda DLQ if configured:

aws lambda update-function-configuration \
  --function-name my-fn \
  --dead-letter-config TargetArn=arn:aws:sqs:us-east-1:123:my-dlq

SNS (Push model)

SNS itself retries for hours unless the subscription has a DLQ configured:

Set SNS DLQ via:

aws sns set-subscription-attributes \
  --subscription-arn <arn> \
  --attribute-name RedrivePolicy \
  --attribute-value '{"deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:my-dlq"}'

EventBridge

Add a DLQ or secondary event bus target.

Step 5: For SQS → Lambda, configure the Redrive Policy (NOT the Lambda DLQ).

This is the critical correction.

Setting a DLQ on the Lambda does nothing for SQS triggers.

You must configure the DLQ on the queue itself:

aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123/my-queue \
  --attributes '{"RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123:my-dlq\",\"maxReceiveCount\":\"5\"}"}'

This is the only way to stop infinite retries.

Step 6: For DynamoDB Streams and Kinesis, handle the blocking record.

A single poison pill will:

Stop the shard
Freeze all records behind it
Increase iterator age
Cause a severe backlog

You must:

Log the failing record
Patch the code to handle malformed items
Or manually move/delete/fix the bad item upstream

Then re-enable the event source mapping.

Step 7: Fix your Lambda code’s underlying error.

Now that storms are paused and retries controlled:

Add validation
Improve error messages
Add guard clauses
Handle missing fields
Add try/catch around risky operations

Once fixed, redeploy.

Step 8: Re-enable event sources.

For SQS/DynamoDB/Kinesis:

aws lambda update-event-source-mapping \
  --uuid <uuid> \
  --enabled true

For SNS/EventBridge:
Enable the subscription or rule again.

Storm is resolved.

Pro Tips

Configure DLQs before production.
SQS with no Redrive Policy is a ticking time bomb.
Add log statements to distinguish “retryable” vs “terminal” errors.
For high-throughput systems, monitor:
- Iterator Age
- SQS ApproximateReceiveCount
- SNS DeliveryFailures

Conclusion

Lambda retry storms are dangerous because they are silent, rapid, and expensive.

But with proper identification of the event source, correct use of DLQs vs Redrive Policies, safe SNS/EventBridge handling, and awareness of shard blockage behavior, you can restore stability quickly and prevent future cascades.

Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Search This Blog

Tech-Reader.blog