The Secret Life of AWS: The Fire Drill
Why panic is a bug, how to write a runbook, and the power of the "5 Whys"
Part 13 of The Secret Life of AWS
It was 2:14 AM when the phone screamed.
Timothy jolted awake, heart pounding. He fumbled for the device in the dark. The screen glowed with a terrifying red notification from the CloudWatch Alarm he had set up just yesterday: CRITICAL: PaymentGatewayFailure.
Panic set in. He rushed to his desk, eyes blurring. He logged into the console and saw red lines everywhere. The database CPU was at 100%. The Lambda functions were timing out.
"I have to fix it," he muttered. "I have to do something."
He started clicking. He rebooted the database. He increased the Lambda timeout. He quickly wrote a hotfix for the payment code and deployed it without testing.
The red lines didn't stop. They got worse.
Margaret found him at 8:00 AM, slumped over his keyboard, head in his hands.
"I broke it," Timothy whispered. "I tried to fix it, but I made it worse. You can fire me now."
Margaret pulled up a chair and sat next to him. She didn't look angry. She looked calm.
"You are not fired, Timothy," she said softly. "But you are finished with 'Hero Mode'. Today, we discuss Incident Response."
The Runbook (Instructions for Wartime)
Margaret placed a blank binder on the desk.
"Last night, you panicked," she said. "Panic is a bug. When the alarm rings at 2:00 AM, your brain shuts down. You lose 20 IQ points instantly. You cannot be trusted to make complex decisions."
"So what should I have done?"
"You should have opened this," she tapped the binder. "This is a Runbook."
"A Runbook is a set of simple, dumb instructions written in peacetime to be read in wartime. It tells you exactly what to do, step by step."
She picked up a pen and wrote an example on the whiteboard:
Runbook: High Database CPU
- Check CloudWatch: Is CPU > 80%?
- Check Active Queries: Look for queries running longer than 5 minutes.
- Action: If a stuck query is found, KILL that specific query ID. Do NOT reboot.
- Escalate: If not resolved in 10 minutes, page the Senior Engineer.
"You do not think," Margaret emphasized. "You follow the script. Pilots use checklists when engines fail. Surgeons use checklists. Engineers must do the same."
"And most importantly," she added. "You communicate. You tell the team 'I am investigating.' Do not try to fix it in silence."
The Blameless Post-Mortem (The Investigation)
Timothy looked at the binder. "Okay. I will write the steps. But... I still broke the system. I deployed a bad fix."
"We need to find out why," Margaret said. "We will hold a Post-Mortem."
Timothy flinched. "That sounds like an autopsy."
"It is. But in this library, we follow one rule: Blamelessness."
"We do not ask 'Who broke it?'" she explained. "We ask 'How did the system allow this to happen?' If you were able to deploy broken code at 2:00 AM, that is not your fault. That is the system's fault for letting you."
The 5 Whys (Finding the Root)
"So how do we fix the system?" Timothy asked.
"We use the 5 Whys," Margaret said. "Let's trace the fire back to the spark."
- Why did the site go down?
Because the database CPU spiked to 100%.
Why did the CPU spike?
Because a new Lambda function sent too many heavy queries at once.
Why did the Lambda send heavy queries?
Because the code contained a loop that didn't close connections.
Why was that bad code deployed?
Because Timothy wrote a hotfix at 2:00 AM and bypassed the testing stage.
Why was he able to bypass testing?
Root Cause: Because the deployment pipeline does not force a test run before deploying to production.
Margaret circled the last line.
"We do not fire Timothy," she said. "We fix the pipeline so that no one—not even me—can deploy without passing tests first. We fix the guardrails."
The Lesson
Timothy sat up straighter. The weight of failure was lifting, replaced by a list of tasks.
"I need to write a Runbook for 'High CPU'," he said. "And I need to lock down the deployment permissions."
"Precisely," Margaret smiled. "Reliability is not about being perfect, Timothy. Computers break. Humans make mistakes."
She handed him the binder.
"Reliability is about how calmly you handle the fire, and how much you learn from the ashes."
Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.
.jpeg)

Comments
Post a Comment