The Secret Life of AWS: The Watchtower

 

The Secret Life of AWS: The Watchtower

You cannot manage what you cannot measure: A guide to Metrics, Logs, and Alarms.





Part 12 of The Secret Life of AWS

Timothy leaned back in his chair, feet up on the desk. The room was quiet.

"I am done," he announced. "The system is built. The functions trigger, the state machine coordinates, and the database stores. I can finally relax."

Margaret walked over and tapped a stack of papers on his desk. "Is the system working, Timothy?"

"Of course," Timothy said. "I tested it yesterday."

"Is it working right now?" Margaret asked. "Is the database CPU high? Did the last three payments fail? Is the inventory function timing out?"

Timothy hesitated. "I... I assume so. No one has complained."

"Hope is not a strategy," Margaret said, pulling him out of the chair. "You cannot manage what you cannot measure."

She led him to the chalkboard. "Today, we discuss AWS CloudWatch. It is the eyes and ears of your infrastructure."

Metrics (The Numbers)

Margaret drew a simple graph on the board—a line trending upwards.

"First," she said, "we have Metrics. These are numerical data points that AWS collects automatically over time."

She listed a few examples:

  • CPUUtilization: How hard is the server working?
  • InvocationCount: How many times did your function run?
  • Duration: How long did the function take to finish?

"Think of this as the pulse," Margaret explained. "It tells you that something is happening, and how much of it is happening. If your 'Duration' metric suddenly jumps from 200ms to 10 seconds, you know you have a performance problem."

"So I have to check these graphs every minute?" Timothy asked.

"No," Margaret said. "That is what Alarms are for."

Alarms (The Threshold)

Margaret drew a red horizontal line across the top of the graph.

"We define a Threshold," she said. "For example, 'If CPU Utilization goes above 80% for 5 minutes.' If the metric crosses that line, CloudWatch triggers an Alarm."

"What does the alarm do?"

"It takes action," Margaret said. "It can send you an email via SNS (Simple Notification Service). It can trigger an Auto Scaling group to add more servers. It can even reboot an EC2 instance."

"So the system tells me when it is broken," Timothy realized.

"Exactly. You do not watch the graph. You wait for the alarm."

Logs (The Diagnosis)

"But wait," Timothy said. "The Alarm tells me that something is wrong. It doesn't tell me why."

"Correct," Margaret said. "If your Lambda function fails, the Metric just says 'Error Count: 1'. It does not tell you that you had a typo in your variable name."

She moved to the other side of the board and drew a series of text lines.

"For the 'Why', we need Logs."

"CloudWatch Logs captures the raw text output from your applications," she explained. "Every time your code crashes, it prints the reason here—like Error: Connection Refused at 10.0.0.5."

"So when the Alarm rings..."

"You go to the Logs," Margaret finished. "You check the timestamp of the alarm and read the error message. The Metric is the symptom; the Log is the diagnosis."

"Just be careful," she added. "Storing logs forever is expensive. Set a Retention Policy to delete them after 30 days, or you will pay for history you do not need."

Dashboards (The Single View)

Timothy looked at the board. "So I have Metrics in one place, Alarms in another, and Logs in a third."

"That can be scattered," Margaret admitted. "That is why we build Dashboards."

She drew a large rectangle containing several small graphs and a list of text.

"A Dashboard is a custom view. You can put your database CPU, your Lambda error rate, and your most recent log errors all on one screen. It gives you a Single Pane of Glass to see the health of your entire system at a glance."

The Lesson

Timothy looked at the empty room. It was still quiet, but now he understood that silence didn't mean success.

"I need to set up an Alarm for 'Payment Failures'," Timothy said. "If that happens, I need to know instantly."

"Precisely," Margaret smiled. "And create a Dashboard for the morning. I do not want to ask you if the system is working, Timothy. I want you to point to the screen and show me."

She handed him the chalk.

"Now, go define your thresholds. A silent system is only good if you know it is actually running."


Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Comments

Popular posts from this blog

The New ChatGPT Reason Feature: What It Is and Why You Should Use It

Insight: The Great Minimal OS Showdown—DietPi vs Raspberry Pi OS Lite

Raspberry Pi Connect vs. RealVNC: A Comprehensive Comparison