AWS Lambda: Phantom Retries — When Logs Multiply but Events Don’t Exist

 

AWS Lambda: Phantom Retries — When Logs Multiply but Events Don’t Exist

Phantom retries can make a stable system look unstable





Problem

Your logs show the same Lambda event twice.

The same RequestId appears again.

CloudWatch metrics show two invocations where you expected one.

You check your SQS queue or Kinesis stream — and there’s no duplicate message.

No second trigger. Nothing out of the ordinary.

Welcome to Phantom Retries — the haunting illusion of multiple Lambda executions that never actually happened.

In this scenario, Lambda appears to “retry” work that it never re-received, leaving engineers chasing ghosts in their logs.


Clarifying the Issue

Phantom retries occur when signals, not actions, repeat. Lambda’s event model is at-least-once delivery, which means retries can happen — but what often looks like a retry is instead the result of lag, visibility timeout overlaps, or duplicate reporting at the monitoring layer.

A few common sources of confusion lead to this:

CloudWatch Log Delay – A log stream may show entries out of order or repeated across shards, especially during high concurrency. What looks like a “second” invocation is often a late-arriving write from a previous execution. Think of it like a slow mail service: a letter sent yesterday might arrive at the same time as one sent today, making it appear as if two were sent at once.

Partial Batch Reprocessing – In SQS and Kinesis integrations, if even one record in a batch fails, the entire batch is retried. That can make successfully processed records appear to “re-run.”

Visibility Timeout Overlaps – When a function runs close to the SQS visibility timeout, the message can become visible again while still being processed. If the function finishes late, both the “first” and “second” attempts may appear in CloudWatch logs even though only one completed.

Async Delivery Retries – Asynchronous invocations (such as from EventBridge or SNS) retry automatically on transient errors — sometimes after the function has already succeeded. These can appear as duplicates in logs while the invocation count stays correct.

Phantom retries don’t come from bad code — they come from how AWS propagates signals across its distributed logging, queuing, and delivery systems.


Why It Matters

Phantom retries waste time, money, and trust.

Engineers may tighten retry logic, add unnecessary locks, or even throttle concurrency based on misleading data.

False retry signals can:

  • Inflate CloudWatch “Invocation” and “Error” counts.
  • Cause over-billing when duplicated traces appear tied to the same event.
  • Lead to incorrect assumptions about concurrency saturation.
  • Undermine confidence in production monitoring and post-incident forensics.

A clean event pipeline depends on clean visibility. Phantom retries pollute that visibility.


Key Terms

  • At-Least-Once Delivery – AWS guarantees that each message is processed at least once but possibly more than once.
  • Partial Batch Response – A mechanism where a failed record in a batch causes the rest to be retried.
  • Visibility Timeout – The period an SQS message stays hidden after being received. If the consumer doesn’t delete it before this time expires, it becomes visible again.
  • Idempotent Token – A unique key or fingerprint used to ensure the same event doesn’t get processed twice.
  • Idempotency – The property of an operation that can be applied multiple times without changing the result beyond the initial application.
  • Trace Correlation – Mapping a single event across multiple logs or metrics using a consistent identifier.

Steps at a Glance

  1. Confirm whether retries actually occurred.
  2. Trace by RequestId and EventId instead of timestamps.
  3. Align visibility timeouts with Lambda execution time.
  4. Implement partial batch handling.
  5. Deduplicate at the application layer.
  6. Separate “events processed” from “logs generated.”

Detailed Steps

Step 1: Confirm whether retries actually occurred

Start with the data source. Use CloudWatch metrics (InvocationsErrorsThrottles) to verify that AWS itself reports more invocations — not just duplicate logs.

If the metrics don’t rise, you’re not seeing a real retry — you’re seeing a reporting artifact.


Step 2: Trace by RequestId and EventId

Each Lambda invocation includes a RequestId. For SQS, each message includes a MessageId; for Kinesis, a SequenceNumber.

Cross-reference these IDs:

aws logs filter-log-events --log-group-name /aws/lambda/my-function \
  --filter-pattern "RequestId" | grep <request-id>

If the same RequestId appears twice, you’re looking at a duplicate log write — not a retry.

If the RequestId differs but the event fingerprint matches, you’re seeing a legitimate retry.


Step 3: Align visibility timeouts

In SQS-triggered Lambdas, ensure the visibility timeout is longer than the maximum Lambda runtime:

aws sqs set-queue-attributes \
  --queue-url <url> \
  --attributes VisibilityTimeout=120

For example, if your function runs for 90 seconds, a 120-second timeout ensures messages won’t reappear before processing finishes.


Step 4: Handle partial batches explicitly

Enable partial batch responses to acknowledge successful records while retrying only failed ones.

aws lambda update-event-source-mapping \
  --uuid <uuid> \
  --function-name my-function \
  --function-response-types ReportBatchItemFailures

Then, in your handler, report failed records:

def handler(event, context):
    failed = []
    for record in event["Records"]:
        try:
            process(record)  # Attempt to process each record individually
        except Exception as e:
            # Mark the failed message for retry
            failed.append({"itemIdentifier": record["messageId"]})  
    return {"batchItemFailures": failed}

This prevents good messages from being re-processed.


Step 5: Deduplicate at the application layer

Use an idempotency key or fingerprint derived from event payloads to ensure each message is processed only once — regardless of how many times it’s delivered:

key = hash(event["body"])
if already_processed(key):
    return {"status": "duplicate"}
process_event(key)

Even if AWS retries, your application won’t.


Step 6: Separate event metrics from log metrics

Create separate CloudWatch dashboards for:

  • Events processed (based on payload IDs or success counters)
  • Logs generated (based on RequestIds)

This distinction lets you see which anomalies come from the data layer and which from the monitoring layer.


Pro Tip #1: Don’t Debug the Ghosts

When logs show duplicates but metrics don’t, resist the urge to “fix” a problem that isn’t real.

Validate the signal before you touch production code.


Pro Tip #2: Treat Observability as a Source of Truth — but Verify It

Your logging and metrics pipelines are mirrors, not microscopes. Always confirm they reflect reality before assuming correlation equals cause.


Conclusion

Phantom retries can make a stable system look unstable. They inflate logs, distort metrics, and create false alarms that erode confidence in your operations.

The solution isn’t more retry logic — it’s clarity. By aligning timeouts, handling partial batches, and using idempotent keys, you turn phantom noise into actionable insight.

In a world of distributed systems and asynchronous delivery, not every repeated signal means repeated work. Sometimes, it’s just the echo of success.


Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Comments

Popular posts from this blog

The New ChatGPT Reason Feature: What It Is and Why You Should Use It

Raspberry Pi Connect vs. RealVNC: A Comprehensive Comparison

Insight: The Great Minimal OS Showdown—DietPi vs Raspberry Pi OS Lite