AWS Lambda: “The Stuck Invocation” — When Lambda Never Finishes but Never Fails

 

AWS Lambda: “The Stuck Invocation” — When Lambda Never Finishes but Never Fails

A “stuck” Lambda isn’t a mystery — it’s a symptom of unfinished work





Problem

Some Lambda functions never fail, but they also never complete. They sit in execution limbo—accumulating concurrency, blocking new invocations, and driving up duration metrics without producing a single error log.

They don’t crash. They don’t return. They just stay... stuck.

Clarifying the Issue

A stuck invocation occurs when the Lambda runtime never reaches a completion signal—returncallback, or context.done(). The container stays alive until the configured timeout expires. The invoker waits, the metrics climb, and nothing new gets processed.

Common root causes, depending on the runtime, include:

  • Node.js: Unawaited async calls or improper event loop management. (The solution involving context.callbackWaitsForEmptyEventLoop is detailed in Step 3.)
  • All Runtimes: Blocking I/O, such as socket reads, unclosed streams, or subprocesses waiting forever.
  • Python/Java: Background threads or daemons that keep the runtime alive after the handler ends.

Why It Matters

A stuck invocation is more dangerous than a crash. When a function fails, you get an error, a retry, and a paper trail. When it hangs, AWS bills you for the entire runtime window and silently throttles future invocations.

These invisible stalls can:

  • Inflate duration and cost metrics.
  • Block concurrency and throttle scaling.
  • Obscure the root cause during debugging.

Key Terms

  • Event Loop: The Node.js runtime structure that processes asynchronous operations.
  • Blocking I/O: A process that prevents the Lambda from completing until data is read or written.
  • Execution Timeout: The maximum runtime configured for your function (default 3 seconds, up to 900 seconds).
  • Deadlock: A state where two or more tasks wait on each other indefinitely.
  • Zombies: Lambda instances that consume concurrency but perform no useful work.

Steps at a Glance

  1. Detect stuck invocations.
  2. Review configured timeouts.
  3. Await or close async tasks.
  4. Handle background threads cleanly.
  5. Layer timeouts defensively.
  6. Monitor for patterns of saturation.

Detailed Steps

Step 1: Detect Stuck Invocations

Use CloudWatch metrics to identify Lambdas that start but never finish. You’ll focus on Maximum Duration and ConcurrentExecutions.

Option 1 — Quick Duration Check

This command retrieves recent invocation durations so you can spot any that run up to the timeout limit:

aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Duration \
  --dimensions Name=FunctionName,Value=ProcessOrders \
  --statistics Average Maximum \
  --period 60 \
  --start-time $(date -u --date='10 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

Line in question:

--statistics Average Maximum \

This flag tells CloudWatch to return both Average and Maximum values for invocation duration.

Look at: the "Maximum" value in the output—it shows the longest-running invocation in the sample window. If "Maximum" equals your configured timeout (for example 90000 ms = 90 s), those invocations hung until AWS forcibly terminated them.

Where this timeout is configured:

  • Console: Configuration tab → General Configuration → Timeout (default 3 seconds; max 15 minutes).
  • CLI: aws lambda update-function-configuration --function-name ProcessOrders --timeout 90
  • IaC (Terraform/SAM): The timeout or Timeout property in your deployment file.

Sample Output:

{
  "Datapoints": [
    { "Timestamp": "2025-10-17T20:02:00Z", "Average": 89950.0, "Maximum": 90000.0 }
  ],
  "Label": "Duration"
}

✅ Look at: "Maximum": 90000.0 ← matches timeout → stuck invocation detected.

Option 2 — Duration + Concurrency Overlay

This variant layers in ConcurrentExecutions to show when invocations are hanging and piling up:

aws cloudwatch get-metric-data \
  --metric-data-queries '[
    {
      "Id": "duration",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/Lambda",
          "MetricName": "Duration",
          "Dimensions": [{"Name": "FunctionName","Value":"ProcessOrders"}]
        },
        "Period": 60,
        "Stat": "Maximum"
      }
    },
    {
      "Id": "concurrency",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/Lambda",
          "MetricName": "ConcurrentExecutions"
        },
        "Period": 60,
        "Stat": "Maximum"
      }
    }
  ]' \
  --start-time $(date -u --date='10 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

Line in question:

"Id": "duration",

This defines the metric you’re pulling—the Duration values.

Look at:

  • "duration.Values" → Maximum durations for each interval.
  • "concurrency.Values" → Active concurrent executions.

If "duration.Values" are near timeout and "concurrency.Values" rise, invocations are hanging and not releasing capacity.

Sample Output:

{
  "MetricDataResults": [
    { "Id": "duration", "Values": [88000.0, 89000.0, 90000.0] },
    { "Id": "concurrency", "Values": [12.0, 13.0, 14.0] }
  ]
}

✅ Look at: "duration.Values" (≈ timeout) + "concurrency.Values" (increasing) → active zombie invocations.


Step 2: Review Configured Timeouts

Check your Lambda’s configured timeout to ensure it reflects realistic runtime expectations. Default is 3 seconds; max is 900 seconds. Misaligned timeouts cause early termination or wasted compute.

Step 3: Await or Close Async Tasks

Always await asynchronous operations before returning. If you use Node.js, context.callbackWaitsForEmptyEventLoop defaults to true, ensuring all promises resolve before Lambda exits.

exports.handler = async (event, context) => {
  // context.callbackWaitsForEmptyEventLoop defaults to 'true'
  await db.connect();
  const result = await processOrders(event);
  await db.close();
  return result;
};

Step 4: Handle Background Threads Cleanly

In Python or Java, ensure any background threads or daemons end gracefully before Lambda exits.

Step 5: Layer Timeouts Defensively

Implement layered timeouts in downstream SDK calls (e.g., S3, DynamoDB, external APIs) so slow dependencies don’t outlive your function’s window.

Step 6: Monitor for Saturation Patterns

Use CloudWatch alarms on both Duration and ConcurrentExecutions to catch rising concurrency without corresponding completions.


Pro Tip #1: A Hanging Lambda Is Operational Debt

A Lambda that doesn’t fail isn’t reliable—it’s an expensive zombie. Kill it fast, log it clearly, and redesign the logic to self-terminate when dependencies stall.

Pro Tip #2: Force Visibility with Internal Timeouts

Don’t rely solely on AWS to time out your function. Implement internal timers to abort gracefully and report the stall in logs.


Conclusion

A stuck invocation doesn’t announce itself—it hides in clean dashboards and normal billing. By explicitly monitoring Duration.Maximum and ConcurrentExecutions, enforcing timeouts, and ensuring clean async handling, you restore visibility and control.

In Lambda, silence isn’t stability—it’s how functions get stuck.


Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Comments

Popular posts from this blog

The New ChatGPT Reason Feature: What It Is and Why You Should Use It

Raspberry Pi Connect vs. RealVNC: A Comprehensive Comparison

Insight: The Great Minimal OS Showdown—DietPi vs Raspberry Pi OS Lite