AWS Lambda: “The Silent Retry” — When Lambda Quietly Runs Twice (or Thrice)

#aws #lambda #devops #cloud

A silent retry is Lambda’s way of asking, “Are you sure you finished?”

Problem

Everything looks normal.

Your Lambda reports success, CloudWatch shows no alarms, and the system seems healthy.

Then you notice:

Two database inserts where there should be one.
A user notified twice.
A payment processed again.

No errors. No warnings. Just… duplicates.

Welcome to The Silent Retry — when AWS Lambda automatically re-invokes your function after an asynchronous failure or timeout you never saw.

Lambda isn’t broken. It’s doing exactly what it was designed to do.

You just didn’t realize it was doing it again.

Clarifying the Issue

By default, when Lambda is invoked asynchronously (for example, from S3, SNS, or EventBridge), AWS manages the retry logic for you — up to two more times if the first attempt fails or times out.

The catch: if your function times out or the runtime freezes after returning 200 OK but before completing all work, AWS can’t tell whether it succeeded.

So it retries. Silently.

That’s how you end up with duplicate side-effects — two records written, two emails sent, or the same event processed twice.

Root causes include:

Functions that time out mid-work.
Asynchronous invocations without idempotent design.
Missing or misconfigured DLQs or destinations.
Poor observability — no correlation IDs, so retries look like new events.

AWS gives you three silent retry loops to think about:

Service-level retries (SNS, S3, EventBridge, etc.)
Lambda’s own async retry logic
SDK / client retries inside your code

These can stack on top of each other, multiplying quietly.

Why It Matters

Silent retries break the contract of at-most-once execution.
You get at-least-once — which sounds harmless until your side-effects aren’t idempotent.

That means:

Payments may double-charge.
Database inserts may duplicate.
Notifications may annoy your customers.

Even worse, retries appear hours later as AWS drains the retry queue, so the symptoms look random and unrelated.

This is one of the most expensive bugs to find after deployment, because by then, the damage is already in production data.

Key Terms

Idempotency: The property of an operation that can run multiple times without changing the result.
Asynchronous Invocation: Invocation mode where AWS queues the request and retries automatically.
Dead Letter Queue (DLQ): A target SQS or SNS topic where failed async events go after retries.
Event Destinations: Modern replacement for DLQs that captures both success and failure outcomes.
Correlation ID: A unique event identifier used to detect duplicate processing across retries.

Steps at a Glance

Detect retry patterns in CloudWatch metrics and logs.
Add correlation IDs to track event lineage.
Make all side-effects idempotent.
Configure DLQs or destinations for async sources.
Review and tune timeout settings.
Monitor for delayed or batch retries.

Detailed Steps

Step 1: Detect retry patterns

Run this AWS CLI command to examine how often Lambda reinvokes the same function:

aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Invocations \
  --dimensions Name=FunctionName,Value=ProcessOrders \
  --start-time $(date -u -d '15 minutes ago' +%FT%TZ) \
  --end-time $(date -u +%FT%TZ) \
  --statistics Sum \
  --period 60

Compare this with:

aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Errors \
  --dimensions Name=FunctionName,Value=ProcessOrders \
  --start-time $(date -u -d '15 minutes ago' +%FT%TZ) \
  --end-time $(date -u +%FT%TZ) \
  --statistics Sum \
  --period 60

If Invocations keep rising while Errors stay flat, you likely have silent retries (same event replayed).

Look for bursts of invocations within identical timestamps.

Step 2: Add correlation IDs

Include a unique event ID in every log line and side-effect:

import uuid, json

def handler(event, context):
    correlation_id = event.get("id") or str(uuid.uuid4())
    print(f"[{correlation_id}] Processing order event")
    process_order(event, correlation_id)

Then search logs for the same correlation ID appearing multiple times — that’s a retry.

Step 3: Make side-effects idempotent

If your Lambda writes to DynamoDB, S3, or sends notifications, make sure duplicate processing doesn’t break anything.

Example with DynamoDB conditional write:

table.put_item(
    Item=record,
    ConditionExpression="attribute_not_exists(OrderId)"
)

This prevents overwriting if a retry replays the same OrderId.

Step 4: Configure DLQs or destinations

Never let retries vanish into the void.

Configure a DLQ (for legacy) or an Event Destination (recommended):

aws lambda update-function-configuration \
  --function-name ProcessOrders \
  --destination-config '{"OnFailure":{"Destination":"arn:aws:sqs:us-east-1:123456789012:LambdaFailures"}}'

This ensures failed or exhausted events are captured for inspection.

Step 5: Review timeout settings

If your function is being retried silently, check whether it’s timing out.

A near-timeout completion looks like success but triggers a retry.

aws lambda get-function-configuration \
  --function-name ProcessOrders \
  --query "Timeout"

If you see frequent retries near that timeout duration, raise the timeout or break the work into smaller chunks.

Step 6: Monitor for delayed retries

Retries for asynchronous Lambda can happen minutes or hours later.

Use CloudWatch Logs Insights:

fields @timestamp, @message
| filter @message like "correlation_id"
| sort @timestamp desc

Group by correlation ID to see if the same event reappears later.

Pro Tip #1: Silence Isn’t Success

A retry queue that never drains is the quietest failure you’ll ever have.
Measure “retries attempted” as carefully as “errors thrown.”

Pro Tip #2: Fail Once, Not Twice

If you must fail, fail loudly the first time.
Retries are for recovery, not repetition.

Conclusion

A silent retry is Lambda’s way of asking, “Are you sure you finished?”

If you don’t answer clearly — it will assume you didn’t.

By making your functions idempotent, tracking correlation IDs, and surfacing retry metrics, you turn invisible replays into visible signals.

In serverless systems, no news isn’t good news — it’s déjà vu.

Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Search This Blog

Tech-Reader.blog