AWS Lambda: “The Silent Retry” — When Lambda Quietly Runs Twice (or Thrice)
A silent retry is Lambda’s way of asking, “Are you sure you finished?”
Problem
Everything looks normal.
Your Lambda reports success, CloudWatch shows no alarms, and the system seems healthy.
Then you notice:
- Two database inserts where there should be one.
- A user notified twice.
- A payment processed again.
No errors. No warnings. Just… duplicates.
Welcome to The Silent Retry — when AWS Lambda automatically re-invokes your function after an asynchronous failure or timeout you never saw.
Lambda isn’t broken. It’s doing exactly what it was designed to do.
You just didn’t realize it was doing it again.
Clarifying the Issue
By default, when Lambda is invoked asynchronously (for example, from S3, SNS, or EventBridge), AWS manages the retry logic for you — up to two more times if the first attempt fails or times out.
The catch: if your function times out or the runtime freezes after returning 200 OK but before completing all work, AWS can’t tell whether it succeeded.
So it retries. Silently.
That’s how you end up with duplicate side-effects — two records written, two emails sent, or the same event processed twice.
Root causes include:
- Functions that time out mid-work.
- Asynchronous invocations without idempotent design.
- Missing or misconfigured DLQs or destinations.
- Poor observability — no correlation IDs, so retries look like new events.
AWS gives you three silent retry loops to think about:
- Service-level retries (SNS, S3, EventBridge, etc.)
- Lambda’s own async retry logic
- SDK / client retries inside your code
These can stack on top of each other, multiplying quietly.
Why It Matters
Silent retries break the contract of at-most-once execution.
You get at-least-once — which sounds harmless until your side-effects aren’t idempotent.
That means:
- Payments may double-charge.
- Database inserts may duplicate.
- Notifications may annoy your customers.
Even worse, retries appear hours later as AWS drains the retry queue, so the symptoms look random and unrelated.
This is one of the most expensive bugs to find after deployment, because by then, the damage is already in production data.
Key Terms
- Idempotency: The property of an operation that can run multiple times without changing the result.
- Asynchronous Invocation: Invocation mode where AWS queues the request and retries automatically.
- Dead Letter Queue (DLQ): A target SQS or SNS topic where failed async events go after retries.
- Event Destinations: Modern replacement for DLQs that captures both success and failure outcomes.
- Correlation ID: A unique event identifier used to detect duplicate processing across retries.
Steps at a Glance
- Detect retry patterns in CloudWatch metrics and logs.
- Add correlation IDs to track event lineage.
- Make all side-effects idempotent.
- Configure DLQs or destinations for async sources.
- Review and tune timeout settings.
- Monitor for delayed or batch retries.
Detailed Steps
Step 1: Detect retry patterns
Run this AWS CLI command to examine how often Lambda reinvokes the same function:
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Invocations \
--dimensions Name=FunctionName,Value=ProcessOrders \
--start-time $(date -u -d '15 minutes ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) \
--statistics Sum \
--period 60
Compare this with:
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Errors \
--dimensions Name=FunctionName,Value=ProcessOrders \
--start-time $(date -u -d '15 minutes ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) \
--statistics Sum \
--period 60
If Invocations keep rising while Errors stay flat, you likely have silent retries (same event replayed).
Look for bursts of invocations within identical timestamps.
Step 2: Add correlation IDs
Include a unique event ID in every log line and side-effect:
import uuid, json
def handler(event, context):
correlation_id = event.get("id") or str(uuid.uuid4())
print(f"[{correlation_id}] Processing order event")
process_order(event, correlation_id)
Then search logs for the same correlation ID appearing multiple times — that’s a retry.
Step 3: Make side-effects idempotent
If your Lambda writes to DynamoDB, S3, or sends notifications, make sure duplicate processing doesn’t break anything.
Example with DynamoDB conditional write:
table.put_item(
Item=record,
ConditionExpression="attribute_not_exists(OrderId)"
)
This prevents overwriting if a retry replays the same OrderId.
Step 4: Configure DLQs or destinations
Never let retries vanish into the void.
Configure a DLQ (for legacy) or an Event Destination (recommended):
aws lambda update-function-configuration \
--function-name ProcessOrders \
--destination-config '{"OnFailure":{"Destination":"arn:aws:sqs:us-east-1:123456789012:LambdaFailures"}}'
This ensures failed or exhausted events are captured for inspection.
Step 5: Review timeout settings
If your function is being retried silently, check whether it’s timing out.
A near-timeout completion looks like success but triggers a retry.
aws lambda get-function-configuration \
--function-name ProcessOrders \
--query "Timeout"
If you see frequent retries near that timeout duration, raise the timeout or break the work into smaller chunks.
Step 6: Monitor for delayed retries
Retries for asynchronous Lambda can happen minutes or hours later.
Use CloudWatch Logs Insights:
fields @timestamp, @message
| filter @message like "correlation_id"
| sort @timestamp desc
Group by correlation ID to see if the same event reappears later.
Pro Tip #1: Silence Isn’t Success
A retry queue that never drains is the quietest failure you’ll ever have.
Measure “retries attempted” as carefully as “errors thrown.”
Pro Tip #2: Fail Once, Not Twice
If you must fail, fail loudly the first time.
Retries are for recovery, not repetition.
Conclusion
A silent retry is Lambda’s way of asking, “Are you sure you finished?”
If you don’t answer clearly — it will assume you didn’t.
By making your functions idempotent, tracking correlation IDs, and surfacing retry metrics, you turn invisible replays into visible signals.
In serverless systems, no news isn’t good news — it’s déjà vu.
Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.
.jpeg)

Comments
Post a Comment