The Secret Life of AWS: The Checklist (AWS Systems Manager & Runbooks)

 

The Secret Life of AWS: The Checklist (AWS Systems Manager & Runbooks)

Why your brain is a single point of failure at 3 AM, and how a Runbook is your solution





Part 36 of The Secret Life of AWS

It was 3:14 AM.

The vibration of the phone on the nightstand sounded like a jackhammer in the quiet room. Timothy fumbled for it, knocking a glass of water over in the process.

SMS: ALARM: "Checkout-Error-High" in US-East-1.

He groaned, rubbed his eyes, and opened his laptop. The screen brightness blinded him. He squinted at the AWS Console, his brain feeling like it was full of cotton.

"Okay," he whispered, typing with one finger. "Go to... logs. Which log group? Was it Checkout-Prod or Prod-Checkout?"

He clicked the wrong one. Empty.

He clicked back.

He tried to write a Logs Insights query (from Episode 33) but forgot the syntax. Was it filter @message or filter message?

Syntax Error.

By the time he finally found the error—a simple database timeout—it was 3:25 AM. The outage had lasted 11 minutes.

The next morning, Timothy looked like a zombie.

"I knew what to do," he told Margaret, clutching his coffee. "I just... couldn't remember how to do it in the dark."

"Of course you couldn't," Margaret said. "You were relying on your memory. And at 3:00 AM, your memory is a Single Point of Failure."

"We need to get the process out of your head," she said. "We need a Runbook."

The Document

Margaret navigated to AWS Systems Manager (SSM).

"A Runbook is a set of instructions," she explained. "But in the cloud, we don't write them on paper. We write them as code."

She clicked Documents -> Create document.
She named it: Restart-Checkout-Service.

"This is an SSM Document," she said. "It defines the exact steps to take when things go wrong. We can make it do anything: run a shell script, query logs, or restart a Lambda function."

She typed in a simple definition:

description: "Emergency Restart for Checkout Service"
schemaVersion: "0.3"
mainSteps:
  - name: "restart_function"
    action: "aws:executeAwsApi"
    inputs:
      Service: "lambda"
      Api: "updateFunctionConfiguration"
      FunctionName: "CheckoutFunction"
      Environment: 
        Variables:
          FORCE_RESTART: "{{ global:DATE_TIME }}"
  - name: "verify_health"
    action: "aws:invokeLambdaFunction"
    inputs:
      FunctionName: "HealthCheckFunction"

"Now," Margaret said. "We don't just rely on the script. We add a Manual Approval step."

"Why?" Timothy asked.

"Because automation is powerful, but dangerous," she warned. "You don't want a robot restarting production without a human saying 'Yes' first. This gives you the speed of a script, but the safety of a pilot."

The Test

"One more thing," Margaret added. "A runbook that hasn't been tested is just a wish."

She clicked Execute Automation.
The script ran. Green checks appeared next to every step.
Status: Success.

"Now, let's link this to your Alarm."

She went back to the Alarm from Episode 35. Under "Actions," she added an Systems Manager OpsItem.
"Now, when the alarm fires, it will create a ticket that links directly to this Runbook."

The Next Night

Two nights later. 3 AM.

Bzzzt.

Timothy woke up. He didn't panic. He didn't try to remember log group names.

He opened the alert on his phone. It had a link: View OpsItem.

He clicked it.
The screen showed the error.
Right below it was a button: 
Execute Runbook: Restart-Checkout-Service.

Timothy looked at the button.
He clicked Approve.

The bar turned green.
Recovery Complete.

Time elapsed: 45 seconds.

Timothy put his phone down, rolled over, and went back to sleep.


Key Concepts

  • AWS Systems Manager (SSM) Documents: Templates that define a sequence of actions to perform on your managed instances or AWS resources. They are "executable documentation."
  • Runbook: A routine compilation of procedures and operations which the system administrator or operator handles.
  • Cognitive Offloading: The process of reducing the mental workload by using external tools (like checklists or scripts) so you don't have to "think" during a crisis.
  • MTTR (Mean Time To Resolution): The average time it takes to fix a broken system. Runbooks drastically lower this number.

Timothy has learned the golden rule of Operations:

Don't think at 3:00 AM. Just follow the checklist.


Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Comments

Popular posts from this blog

The New ChatGPT Reason Feature: What It Is and Why You Should Use It

Insight: The Great Minimal OS Showdown—DietPi vs Raspberry Pi OS Lite

Raspberry Pi Connect vs. RealVNC: A Comprehensive Comparison