NAT Gateway Timeouts — Lambda in a Private Subnet Can’t Reach the Internet

 

NAT Gateway Timeouts — Lambda in a Private Subnet Can’t Reach the Internet

Once you validate routing, confirm NAT Gateway placement and health, enable DNS, configure SGs and NACLs properly, and adopt VPC endpoints where appropriate, you restore predictable outbound performance.





Problem

A Lambda function running inside a private VPC subnet suddenly starts timing out whenever it calls external services — S3, DynamoDB, STS, third-party APIs, anything requiring outbound access. CloudWatch shows long durations with little or no log output. The function simply hangs until the timeout expires.

This is the classic symptom of a broken NAT Gateway path.


Clarifying the Issue

A Lambda inside a private subnet cannot reach the internet directly. It relies on the following chain:

Lambda ENI → Route Table → NAT Gateway → Internet Gateway → External Service

If any part of this path is misconfigured or degraded, Lambda does not fail fast — it waits. The result is a full timeout with no helpful logs.

Common failure modes include:

  • Missing or incorrect 0.0.0.0/0 route
  • Route table pointing to an unhealthy or unreachable NAT Gateway
  • NAT Gateway deployed in a private subnet (incorrect)
  • NAT Gateway in a different AZ (allowed, but discouraged due to cross-AZ cost/latency)
  • NAT Gateway failures or throttling under load
  • DNS resolution disabled at VPC level
  • Security Group egress rules too restrictive
  • NACLs blocking ephemeral return traffic
  • Incorrect or conflicting VPC endpoint configurations

From Lambda’s perspective, everything looks normal — until the request hangs.


Why It Matters

A failing NAT Gateway effectively isolates your Lambda from most AWS APIs and all external services. This causes:

  • API stalls and cascading delays
  • SQS/SNS processing failures
  • STS credential errors
  • Broken authentication flows
  • Hanging ETL pipelines
  • Costly retry storms

NAT reliability is foundational. If the NAT path is broken, the Lambda is blind.


Key Terms

  • Private Subnet: A subnet with no route to an Internet Gateway.
  • NAT Gateway: Provides outbound internet access from private subnets.
  • Route Table: Defines traffic paths for a subnet.
  • VPC Endpoint: Private AWS service access that bypasses NAT.
  • Security Groups (SGs): Stateful virtual firewalls.
  • Network ACLs (NACLs): Stateless subnet-level filters.

Steps at a Glance

  1. Check logs for long, silent timeouts.
  2. Verify route table has a correct 0.0.0.0/0 → NAT entry.
  3. Confirm NAT Gateway is healthy and in a public subnet.
  4. Enable VPC DNS support.
  5. Validate Security Group outbound rules.
  6. Validate NACL rules for ephemeral ports.
  7. Test outbound connectivity inside Lambda.
  8. Replace NAT with VPC Endpoints when possible.

Detailed Steps

Step 1: Check CloudWatch for Hanging Behavior

Look for full-duration timeouts:

REPORT RequestId: ... Duration: 30000 ms    Billed Duration: 30000 ms

No stack trace. No clues. Just a stall.
This strongly indicates network egress failure.


Step 2: Validate the Route Table

Your private subnet route table must contain:
0.0.0.0/0 → nat-xxxxxxxxxxxxx

Check via CLI:

aws ec2 describe-route-tables \
  --route-table-ids rtb-123 \
  --query "RouteTables[*].Routes"

Common issues:

  • Missing 0.0.0.0/0 route.
  • Route pointing to "local" only.
  • Route pointing to a NAT Gateway in a different AZ. (This works technically, but is discouraged due to cross-AZ charges, latency, and failure-domain coupling).

Best practice: Keep NAT Gateway routing within the same AZ as the Lambda’s ENI.


Step 3: Confirm NAT Gateway Health and Placement

Check NAT Gateway state:

aws ec2 describe-nat-gateways \
  --nat-gateway-ids nat-123 \
  --query "NatGateways[*].State"

  • Valid state: available
  • Problem states: pendingfaileddeleting

Critical architectural requirement:
A NAT Gateway must be deployed in a public subnet, meaning that specific subnet must have:
0.0.0.0/0 → igw-xxxxxxxxxxxxxx

If the NAT Gateway is placed in a private subnet, outbound traffic can never reach the Internet Gateway — Lambda will always hang.


Step 4: Confirm DNS Support is Enabled

Lambda must resolve AWS service endpoints (e.g., sts.amazonaws.com).
Check DNS attributes:

aws ec2 describe-vpc-attribute \
  --vpc-id vpc-123 \
  --attribute enableDnsSupport

Both must be true:

  • enableDnsSupport
  • enableDnsHostnames

If DNS is off, nothing works — not even AWS internal API calls.


Step 5: Validate Security Group Egress Rules

Security Groups are stateful — return traffic is automatically allowed if outbound is permitted.
Ensure the Lambda’s SG allows outbound HTTPS:

  • Outbound Protocol: TCP
  • Port: 443
  • Destination: 0.0.0.0/0

If outbound port 443 is blocked, Lambda cannot talk to anything outside the VPC.


Step 6: Validate NACLs (Stateless Filters)

NACLs are stateless, meaning you must allow both outbound requests and inbound response traffic explicitly.

Required Rules for Lambda as a Client:

DirectionProtocolPortsPurpose
OutboundTCP443Traffic leaving Lambda to Internet
InboundTCP1024–65535Return traffic from Internet to Lambda

Note: Many real outages come from NACLs allowing Outbound 443 but blocking Inbound 1024–65535 (ephemeral return ports).


Step 7: Test Outbound Connectivity Inside Lambda

Use only standard libraries to verify connectivity:

Node.js

const https = require("https");
https.get("https://aws.amazon.com", res => {
  console.log("reachable:", res.statusCode);
});

Python

import urllib.request
print(urllib.request.urlopen("https://aws.amazon.com").getcode())

If this hangs → your NAT path is broken.


Step 8: Replace NAT with VPC Endpoints (Best Practice)

Many Lambdas do not need raw internet access. For AWS-hosted services, VPC Endpoints are faster, safer, and often cheaper.

Gateway Endpoints (Free):

  • S3
  • DynamoDB
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-123 \
  --service-name com.amazonaws.us-east-1.s3 \
  --vpc-endpoint-type Gateway \
  --route-table-ids rtb-123

Interface Endpoints (PrivateLink - Billed):

  • SQS, SNS, STS, Secrets Manager, EventBridge

These endpoints bypass NAT but incur hourly + data charges, so compare costs carefully.


Pro Tips

Pro Tip #1: NAT Can Become a Bottleneck at Scale
Large Lambda bursts → NAT saturation → cascading latency.

Pro Tip #2: VPC Endpoint Cost Awareness

  • Gateway endpoints → free
  • Interface endpoints → billed per hour + per GB Use them intentionally.

Pro Tip #3: Warm Starts Don’t Fix NAT Problems
Provisioned Concurrency can reduce init latency, but it cannot repair a broken network path.

Pro Tip #4: Monitor NAT Gateways
Watch CloudWatch metrics for PacketDropCountErrorPortAllocation, and BytesIn/BytesOut spikes. These correlate directly with Lambda stalls.


Conclusion

NAT Gateway timeouts are one of the most common — and most frustrating — Lambda VPC failures. The function is healthy; the network path is not. Once you validate routing, confirm NAT Gateway placement and health, enable DNS, configure SGs and NACLs properly, and adopt VPC endpoints where appropriate, you restore predictable outbound performance.

This is disciplined VPC-aware Lambda engineering — building stable, resilient serverless applications on top of a clear, intentional network design.


Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Comments

Popular posts from this blog

The New ChatGPT Reason Feature: What It Is and Why You Should Use It

Insight: The Great Minimal OS Showdown—DietPi vs Raspberry Pi OS Lite

Raspberry Pi Connect vs. RealVNC: A Comprehensive Comparison