The Secret Life of AWS: The Night Watchman (CloudWatch Alarms & SNS)

 

The Secret Life of AWS: The Night Watchman (CloudWatch Alarms & SNS)

AWS Alerting 101: A Guide to CloudWatch Alarms and SNS Text Messages





Part 35 of The Secret Life of AWS

Timothy loved his new AWS Cloudwatch Dashboard (Part 34). In fact, he loved it a little too much.

He was sitting at his desk, staring intently at the "Engine Gauge" (Lambda Concurrency). He took a bite of his sandwich, his eyes never leaving the screen.

"Timothy," Margaret said, pausing at his desk. "Why are you eating lunch here? It is a beautiful day outside."

"I can't leave, Margaret," Timothy said, chewing anxiously. "What if the Checkout Function fails? What if the database latency spikes? If I'm not here to see the dashboard turn red, the customers will be furious."

Margaret sighed. "Timothy, the dashboard is for diagnosis, not for surveillance."

"Do you stare at your ceiling all night and wonder if your smoke detector works?" she asked.

"No," Timothy said. "I know my smoke detector works."

"Exactly," Margaret smiled. "You sleep. The detector watches. And if there is smoke, it lets you know."

"We need to install a smoke detector for your application. We need a Night Watchman."

The Alarm (The Eyes)

Margaret navigated to the CloudWatch console and clicked Alarms.

"An Alarm is a simple rule," she explained. "It watches a single metric, and if it crosses a line you draw, it changes state."

She clicked Create Alarm.

  1. Select Metric: She chose CheckoutFunction -> Errors.
  2. Statistic: Sum. (We want to know the total number of errors).
  3. Period: 1 minute.

"Now, we define the line," she said.

Threshold: Greater than or equal to 1.

"This means," Margaret said, "if even one user gets an error in a 1-minute period, the alarm goes into the ALARM state."

"Okay," Timothy said. "So the alarm turns red in the console. But if I'm at lunch, I still won't see it."

"That is why the Alarm needs a voice," Margaret said.

The Notification (The Voice)

She scrolled down to the Notification section.
"When this alarm triggers," the console read, "send a message to..."

She clicked Create new topic.
She named it: Critical-System-Alerts.

"This is Amazon SNS (Simple Notification Service)," Margaret explained. "Think of it as a megaphone. The Alarm whispers into the megaphone, and the megaphone shouts to everyone on the list."

She typed Timothy's email address into the box: timothy@example.com.
Then, she added his phone number for an SMS text message.

"Pro Tip," she added. "In a real company, we wouldn't just text you. We would point this megaphone at SlackPagerDuty, or a ticketing system so the whole team knows."

She hit Create Alarm.

The Test

"Now," Margaret said, "go to lunch."

Timothy hesitated. "Really?"

"Go," she ordered. "I will break the system while you are gone."

Timothy walked to the breakroom. He heated up his coffee. He looked out the window. He felt a strange vibration in his pocket.

Bzzzt.

He pulled out his phone.
text message: ALARM: "Checkout-Error-High" in US-East-1. Threshold crossed: 1 >= 1.

Timothy ran back to his desk. Margaret was standing there, smiling.

"I manually triggered an error," she admitted. "And look."

She pointed to his phone. "The cloud tapped you on the shoulder."

Timothy looked at the message. It was concise. It was immediate. And most importantly, it allowed him to look away.

"This is called being On-Call," Margaret said. "It sounds stressful, but it is actually liberating. It means you don't have to worry about the system unless your phone buzzes."

Timothy put his phone in his pocket and picked up his sandwich.

"I think I'll go eat this outside," he said.

"Good idea," Margaret replied. "The Watchman has the shift."


Key Concepts

  • Amazon CloudWatch Alarms: A feature that watches a single metric over a specific time period and performs one or more actions based on the value of the metric relative to a threshold.
  • Amazon SNS (Simple Notification Service): A fully managed messaging service. In this context, it acts as the "Pub/Sub" system that takes the alert from CloudWatch and "Publishes" it to "Subscribers" (like Email, SMS, Slack, or PagerDuty).
  • Alarm States:
  • OK: Everything is fine.
  • ALARM: The metric has breached the threshold.
  • INSUFFICIENT_DATA: The Watchman can't see clearly (e.g., no data is coming in). This is often an alarm in itself!

  • Interrupt vs. Polling: A fundamental shift in operations. Instead of "Polling" (checking the dashboard constantly), you rely on "Interrupts" (being notified only when necessary).


Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Comments

Popular posts from this blog

The New ChatGPT Reason Feature: What It Is and Why You Should Use It

Insight: The Great Minimal OS Showdown—DietPi vs Raspberry Pi OS Lite

Raspberry Pi Connect vs. RealVNC: A Comprehensive Comparison