The Coffee Break That Caught the Bug

 

The Coffee Break That Caught the Bug

How one moment of doubt saved my first production deploy





Hi, I'm Maya.

Four months into my first job as a backend engineer at FinTrack, I was about to deploy my first AWS Lambda to production. Solo. No hand-holding, no senior engineer double-checking my work.

The function was simple: process incoming transaction data, enrich it with metadata, and store it in our data warehouse. Tests passed. Code review approved. My finger was literally hovering over the merge button.

But something felt... off.

The Setup

I'd spent two weeks building this Lambda. It handled real customer financial data—transaction amounts, merchant details, timestamps. The kind of data you absolutely cannot mess up.

The code was clean. I even added a simple in-memory cache to avoid redundant API calls. The metadata_cache was defined at the top level of my Python file, so it would persist across multiple invocations if the same container was reused.

# Module-level cache - persists across warm container invocations!
metadata_cache = {}

def lambda_handler(event, context):
    transaction_id = event['transaction_id']
    user_id = event['user_id']

    # Cache metadata to avoid redundant API calls
    if user_id not in metadata_cache:
        metadata_cache[user_id] = fetch_user_metadata(user_id)

    enriched_data = {
        'transaction_id': transaction_id,
        'metadata': metadata_cache[user_id],
        'timestamp': event['timestamp']
    }

    return store_transaction(enriched_data)

My test suite covered everything. Edge cases. Error handling. Retries. All green.

Elena, our senior engineer, had approved the PR with a simple "LGTM 🚢"

It was 4:30 PM on a Friday. I could merge this, watch it deploy, and head into the weekend knowing I'd shipped my first real feature.

But I didn’t click merge.

The Nagging Feeling

Something bothered me. I couldn't articulate what.

I'd read the AWS Lambda documentation cover to cover. I knew about cold starts and warm containers. I understood that Lambda might reuse the same execution environment across multiple invocations to save time.

But my tests had run the function dozens of times. They all passed. What was I worried about?

I stood up. Grabbed my coffee mug. Walked to the window.

Our office overlooks downtown Seattle. I stared at the traffic below, not really seeing it. My mind was elsewhere, turning over the code like a Rubik’s cube.

Warm containers. Reused execution environments. Same Lambda, different invocations.

And then it hit me.

Same Lambda. Same... cache?

The Test I Almost Didn't Run

I sat back down and opened the AWS console. My Lambda was deployed in our staging environment. I pulled up the test interface and created two test events:

Test Event 1:

{
  "transaction_id": "txn_001",
  "user_id": "user_alice",
  "timestamp": "2025-10-04T16:35:00Z"
}

Test Event 2:

{
  "transaction_id": "txn_002", 
  "user_id": "user_bob",
  "timestamp": "2025-10-04T16:35:30Z"
}

I ran the first test. Success. Alice’s metadata populated correctly.

I immediately ran the second test.

The response came back. I scanned the JSON output.

My stomach dropped.

Bob’s transaction was showing Alice’s metadata.

The Realization

I stared at my screen. Refreshed. Ran the test again.

Same result. Bob’s transaction. Alice’s metadata.

The cache was persisting between invocations.

Oh no.

I looked at the top of my code:

metadata_cache = {}

That variable. The one I’d defined at the module level. I’d assumed each invocation would have a clean slate, but when AWS reused the warm container, it reused that variable.

If I’d deployed this to production, user Alice’s financial metadata would have bled into user Bob’s transaction record. And user Bob’s into user Carol’s. A cascading privacy violation affecting every customer whose request hit the same warm Lambda container.

I felt cold.

This wasn’t a theoretical bug. This was data leakage. This was the kind of mistake that ends up in incident reports and regulatory filings.

The Fix

The solution was embarrassingly simple: move the cache initialization inside the handler function.

# No global cache here
def lambda_handler(event, context):
    metadata_cache = {} # The fix: a new cache for each invocation

    transaction_id = event['transaction_id']
    user_id = event['user_id']

    # Cache metadata to avoid redundant API calls
    if user_id not in metadata_cache:
        metadata_cache[user_id] = fetch_user_metadata(user_id)

    enriched_data = {
        'transaction_id': transaction_id,
        'metadata': metadata_cache[user_id],
        'timestamp': event['timestamp']
    }

    return store_transaction(enriched_data)

By initializing the dictionary inside the function, every invocation gets its own fresh, clean state.

I ran the two test events again. Alice’s metadata stayed with Alice. Bob’s stayed with Bob.

Green across the board.

The Validation

Monday morning, I walked Elena through what had happened. I showed her the bug, the fix, the test results.

She nodded slowly. “How’d you catch it?”

“Something felt wrong. I tested the same Lambda twice in a row, back-to-back. Saw the data leak.”

“Most people find this bug in production,” she said. “You know why your tests didn’t catch it?”

I thought about it. “My unit tests mocked the Lambda handler. They never actually tested container reuse.”

“Exactly. Testing individual invocations in isolation won’t reveal state leakage. You need to test the same instance multiple times.” She paused. “Good instinct to run it manually.”

What I Learned

That coffee break saved me from what could have been a career-defining mistake—the bad kind.

But here’s the thing: it wasn’t magic intuition. It was pattern recognition. It was all those blog posts I’d read about Lambda best practices, the Stack Overflow threads about Python gotchas, the AWS documentation about execution contexts. My subconscious had assembled the pieces; I just needed to give it space to surface.

Now, before every deployment, I ask myself: What happens if this runs twice?

For Lambda functions, I test sequential invocations manually. I verify that state doesn’t leak between calls. I treat module-level state and mutable defaults—dictionaries, lists, sets—as production landmines.

And when something feels off, even if I can’t articulate why, I don’t ignore it.

I step away. I grab coffee. I stare out the window.

Because the bugs you catch before deployment are the ones that make you a better engineer.


Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Comments

Popular posts from this blog

The New ChatGPT Reason Feature: What It Is and Why You Should Use It

Raspberry Pi Connect vs. RealVNC: A Comprehensive Comparison

Insight: The Great Minimal OS Showdown—DietPi vs Raspberry Pi OS Lite