The War Room: When One Customer Nearly Broke the Internet

 

The War Room: When One Customer Nearly Broke the Internet

Inside the room where the internet gets saved. A dramatized account of the four-hour crisis between Cloudflare and AWS.


Aaron Rose

Aaron Rose       
Software Engineer & Technology Writer


The War Room: Inside Cloudflare's August 21 Crisis

The technical report from Cloudflare's August 21, 2025, network incident is a masterclass in post-mortem analysis. It details the cascade of events that caused severe congestion between their infrastructure and AWS us-east-1, affecting thousands of customers for four hours. But what the report doesn't capture is the human story—the fear, the controlled chaos, and the split-second decisions made inside the war room. This is my imagined account of what that day might have been like.


5:22 PM UTC - Conference Room Zulu

The war room at Cloudflare's San Francisco headquarters buzzed with the nervous energy of controlled chaos. Multiple monitors displayed cascading graphs, network topology diagrams, and real-time metrics that painted an increasingly alarming picture. Harvey Chen, Principal Engineering Manager and today's Incident Commander, stood at the head of the table like a general surveying a battlefield.

"Alright people, let's cut through the noise," Harvey announced, his voice carrying that particular brand of confidence that only comes from having weathered dozens of these storms. "We don't have problems here—we have solutions we haven't implemented yet. Talk to me."

At his laptop, Mike Rodriguez, a 26-year-old Site Reliability Engineer just eight months into his job, was glued to his screen. His fingers flew across the keyboard, a flurry of motion that belied the calm focus of a gamer in a high-stakes tournament. He lived and breathed packet flows.

"Harvey, this is unlike anything I've seen," Mike said, not looking up from his screen. "At 4:27 PM we saw traffic from AWS us-east-1 double—no, triple—in the span of minutes. But it's not distributed traffic. This is concentrated. One customer, pulling massive amounts of cached data."

"Define massive," Harvey said, walking over to peer at Mike's monitor.

"We're talking terabytes per minute. They're requesting objects from our cache like they're trying to download the entire internet." Mike's cursor traced the spike on his graph. "Our direct peering links with AWS are completely saturated. Queue depths are through the roof."

Donna Kim, Senior Operations Manager and the unofficial nerve center of any crisis, looked up from her three phones and two laptops. She had the uncanny ability to manage executive communications, customer escalations, and vendor coordination simultaneously while still being two steps ahead of everyone else in the room.

"I've already got AWS on standby," she said before anyone could ask. "Their network team is seeing the same congestion we are. Oh, and Harvey? Marketing wants to know if they should prepare a statement."

"Tell Marketing to prepare for success, not failure," Harvey replied smoothly. "We're going to fix this."

From the corner of the room, Louis Steinberg, Senior Infrastructure Engineer, was hunched over his workstation running what appeared to be seventeen different diagnostic scripts. His three monitors showed packet traces, BGP route tables, and network latency heatmaps. Louis was brilliant, but he had the unfortunate habit of catastrophizing during incidents.

"This is bad, Harvey. This is really, really bad," Louis muttered, running his hands through his hair. "I'm seeing packet loss across multiple peering points. And look—" He pointed at his screen with a trembling finger. "The DCI link to our offsite interconnect is running at 95% capacity. We're one failed component away from complete service degradation."

Harvey walked over and put a reassuring hand on Louis's shoulder. "Louis, breathe. What's the actual customer impact?"

"Latency is through the roof. Timeouts are spiking. Customers with origins in AWS us-east-1 are basically dead in the water," Louis replied, pulling up another dashboard. "Our SLO metrics are in free fall."

Mike suddenly straightened up, his eyes widening as he stared at his screen. "Guys... AWS just started withdrawing BGP prefixes. They're trying to reroute traffic away from the congested links."

"That's good, right?" Louis asked hopefully.

"No," Mike said, his voice dropping. "That's very bad." His eyes met Louis's. "They're pushing all that traffic onto our secondary paths through the offsite interconnect. Louis, what's the capacity on that DCI link?"

Louis's face went pale. "It's... it's scheduled for an upgrade next month. Current capacity is nowhere near what we need if AWS dumps all this traffic on us."

Harvey's jaw tightened slightly—the only outward sign that he was processing the gravity of the situation. In the span of thirty seconds, their incident had gone from "challenging" to "potentially catastrophic."

"Donna, get me a direct line to AWS's incident commander. Not a support ticket, not a regular channel. I want to talk to whoever is making the BGP decisions right now."

"Already dialing," Donna replied, phone pressed to her ear.

Mike was frantically typing, his screen now showing traffic flows that looked like a digital heart attack. "Harvey, the prefix withdrawals are making everything worse. We've got nowhere to send this traffic. It's like AWS just closed half the exits on a freeway during rush hour."

"Options, Mike. I need options."

"We could rate-limit the customer causing this," Mike suggested. "But that's going to take time to implement safely. Or we could try to engineer traffic to less congested paths, but with AWS pulling prefixes..."

"We're in a box," Louis said, his voice rising. "They're withdrawing routes faster than we can adapt. Our queues are melting down. I'm seeing drops on critical customer traffic."

Harvey stood in the center of the room, synthesizing information at lightning speed. Four years of managing incidents had taught him that panic was contagious, but so was confidence.

"Here's what we're going to do," he announced. "Mike, start working on rate-limiting for this customer immediately. I don't care if it's elegant—I care if it works. Louis, I need you to identify every available path to AWS that isn't completely saturated. Get creative."

"Harvey," Donna interrupted, holding up her phone. "AWS incident commander on line one. Their name is Sarah Wood, Senior Network Engineer."

Harvey took the phone. "Sarah, this is Harvey Chen at Cloudflare. Your BGP withdrawals are creating a traffic jam on our end. We need to coordinate before this gets worse."

Through the speakerphone, Sarah's voice was tense but professional. "Harvey, we're seeing complete saturation on our direct interconnects with you. We're trying to distribute the load, but—"

"But you're pushing traffic onto paths that can't handle it," Harvey finished. "We need to work together here. Can you hold off on further withdrawals while we implement customer rate limiting?"

There was a pause. "How long do you need?"

Harvey looked at Mike, who held up both hands—ten fingers.

"Ten minutes," Harvey said.

"You've got it. But Harvey, if this doesn't work, we're going to have to take more aggressive action."

Harvey hung up and turned to his team. "You heard the woman. Ten minutes. Mike, where are we on rate limiting?"

Mike's fingers were flying across his keyboard. "I'm building the config now. This customer is pulling from hundreds of different cache keys, but I can create a pattern match based on their IP ranges and request signatures."

"Louis, backup paths?"

Louis was deep in a routing table that looked like digital spaghetti. "There's a path through our Chicago pop that has capacity, but it's going to add latency. And there's some dark fiber to New York that we could light up, but—"

"Do it. All of it. Donna, customer communications?"

"Status page is updated, support team is briefed, and I'm drafting targeted notifications for affected customers," Donna replied without missing a beat. "Also, your boss wants an update."

"Tell him we're handling it."

The next few minutes were a carefully orchestrated dance of controlled chaos. Mike deployed his rate limiting rules while monitoring their impact in real-time. Louis worked with the network operations center to activate backup paths and rebalance traffic flows. Donna juggled communications with AWS, internal stakeholders, and increasingly anxious customers.

At 7:05 PM UTC, Mike's rate limiting began to take effect.

"Traffic's dropping," he announced. "The customer's requests are being throttled. Queue depths are starting to normalize."

Louis looked up from his routing configurations. "Packet drops are decreasing. Latency is still high, but it's trending in the right direction."

Harvey allowed himself a small smile. "Donna, tell Sarah at AWS that we're seeing improvement. They can start reverting their BGP changes."

But they weren't done yet. As AWS began re-advertising prefixes, the team had to carefully manage the transition to prevent traffic from swinging wildly between paths. It was like performing surgery on a patient who was running a marathon.

"Easy does it," Harvey murmured as they watched traffic patterns slowly stabilize. "Like landing a plane in a storm."

By 8:18 PM UTC, four hours after the incident began, normal service had been restored. The war room, which had been buzzing with controlled urgency, finally began to quiet.

Harvey looked around at his team. Mike was slumped over his laptop, exhausted but victorious. Louis was still monitoring metrics with the intensity of someone watching a pot that might boil over at any moment. Donna was on her phone with the communications team, already planning the post-incident customer outreach.

"Alright, people," Harvey said, loosening his collar slightly. "We just handled what could have been a career-defining disaster. Mike, that rate limiting implementation was textbook. Louis, your traffic engineering kept us from losing half the internet. Donna, you managed more moving pieces than a chess grandmaster."

Mike looked up with a tired grin. "So what's our next move, Harvey?"

"Our next move is learning from this," Harvey replied. "We're going to build better safeguards, upgrade our infrastructure, and make sure no single customer can ever hold the internet hostage again."

Donna was already taking notes. "Post-incident review meeting scheduled for tomorrow at 9 AM. I'll send out the calendar invite."

Louis finally leaned back in his chair. "You know what the scariest part of this was? How fast it escalated. One minute we're looking at unusual traffic, the next minute we're coordinating with AWS to prevent an internet-scale outage."

"That's why we do what we do," Harvey said. "Anyone can keep the lights on when everything's working. We keep them on when everything's falling apart."

As the team began packing up their laptops and heading home, Harvey remained in the war room for a few more minutes, studying the incident timeline they'd just lived through. Tomorrow would bring detailed root cause analysis, infrastructure planning, and process improvements. But tonight, they'd earned the right to feel proud.

They'd just saved the internet. Again.


The actual incident involved complex interactions between Cloudflare's edge infrastructure, AWS's network engineering decisions, and one customer's unusual traffic patterns. While this narrative is imagined, it's grounded in the technical realities described in Cloudflare's detailed incident report. The real heroes of this story are the engineers who work around the clock to keep our digital infrastructure running, often under immense pressure and with little public recognition.


Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of The Rose Theory series on math and physics.

Comments

Popular posts from this blog

The New ChatGPT Reason Feature: What It Is and Why You Should Use It

Raspberry Pi Connect vs. RealVNC: A Comprehensive Comparison

Running AI Models on Raspberry Pi 5 (8GB RAM): What Works and What Doesn't