The Rate Limiting Cascade: Lessons from Cloudflare's August 21 Incident
Aaron Rose
Software Engineer & Technology Writer
On August 21, 2025, a single customer's traffic pattern triggered a four-hour outage that affected thousands of users accessing AWS us-east-1 through Cloudflare. This wasn't a DDoS attack, a BGP hijack, or a hardware failure. It was something far more insidious: a rate limiting cascade failure that exposed fundamental gaps in how we architect resilience across interconnected systems.
The incident serves as a masterclass in how local problems become global ones when proper safeguards are missing at multiple layers. More importantly, it demonstrates why rate limiting can't be an afterthought—it must be a first-class architectural concern that spans the entire request lifecycle.
The Anatomy of a Cascade Failure
The Trigger: Runaway Customer Traffic
At 16:27 UTC, a single Cloudflare customer began making massive requests for cached objects stored in AWS us-east-1. The scale was unprecedented: terabytes per minute of response traffic, equivalent to downloading the entire Wikipedia database every few minutes.
This wasn't malicious—it was likely a misconfigured batch job, a runaway script, or an application with aggressive retry logic that went haywire. But the impact was the same: legitimate traffic that overwhelmed the infrastructure.
The First Failure: The Customer's Application
The customer's application lacked proper rate limiting and backoff mechanisms. In distributed systems, this is the equivalent of driving without brakes. Every well-architected application should implement:
# Example: Exponential backoff with jitter
import time
import random
class RateLimitedClient:
def __init__(self, max_requests_per_second=10):
self.max_rps = max_requests_per_second
self.last_request_time = 0
self.retry_count = 0
def make_request(self, url):
# Rate limiting
now = time.time()
time_since_last = now - self.last_request_time
min_interval = 1.0 / self.max_rps
if time_since_last < min_interval:
time.sleep(min_interval - time_since_last)
try:
response = requests.get(url, timeout=30)
self.retry_count = 0 # Reset on success
return response
except requests.exceptions.RequestException:
# Exponential backoff with jitter
delay = min(300, (2 ** self.retry_count) + random.uniform(0, 1))
time.sleep(delay)
self.retry_count += 1
raise
The Amplification: AWS's Egress Saturation
When the customer's requests hit AWS, they generated massive response traffic that completely saturated the direct peering connections between AWS and Cloudflare. Think of it as trying to empty a swimming pool through a garden hose—the infrastructure simply wasn't designed for this volume.
The Second Failure: AWS's Egress Controls
AWS lacked sufficient egress rate limiting to prevent a single customer from monopolizing shared network resources. In a multi-tenant environment, this is a critical oversight.
AWS's response was technically sound but operationally disastrous: they began withdrawing BGP prefixes from the congested peering connections, attempting to reroute traffic to less congested paths.
# BGP prefix withdrawal - what AWS likely did
router bgp 65000
neighbor 192.0.2.1 route-map EMERGENCY_WITHDRAW out
route-map EMERGENCY_WITHDRAW permit 10
match ip address prefix-list CONGESTED_PREFIXES
set community no-export
This seemed logical—if direct paths are congested, use indirect ones. But it created a new problem: the indirect paths had even less capacity.
The Cascade: Traffic Engineering Gone Wrong
AWS's BGP withdrawals pushed traffic onto Cloudflare's secondary paths through an offsite network interconnection switch. This Data Center Interconnect (DCI) link was already scheduled for a capacity upgrade and couldn't handle the sudden influx.
The Third Failure: Cloudflare's Traffic Isolation
Cloudflare lacked per-customer traffic budgets and automatic throttling mechanisms. When AWS rerouted the traffic, Cloudflare had no way to preferentially drop or throttle the problematic customer's requests.
The result was a cascade failure:
- Customer generates excessive traffic
- AWS's direct links saturate
- AWS withdraws BGP prefixes to reroute traffic
- Alternative paths also saturate
- Packet loss and latency spike for all customers
The Missing Safeguards: A Multi-Layer Analysis
This incident revealed gaps at every layer of the infrastructure stack. Let's examine what should have been in place:
Layer 1: Application-Level Rate Limiting
Every application making requests to external services should implement rate limiting and circuit breaker patterns:
# Circuit breaker implementation
from enum import Enum
import time
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=60):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.last_failure_time = None
self.state = CircuitState.CLOSED
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
else:
raise Exception("Circuit breaker is open")
try:
result = func(*args, **kwargs)
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.CLOSED
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
raise e
Layer 2: Cloud Provider Egress Controls
AWS should have implemented per-customer egress rate limiting to prevent any single tenant from saturating shared network resources:
Conceptual AWS Config Rule for Egress Monitoring
# Example: AWS Config rule for egress monitoring
EgressRateLimitRule:
Type: AWS::Config::ConfigRule
Properties:
Source:
Owner: AWS
SourceIdentifier: EC2_INSTANCE_DETAILED_MONITORING_ENABLED
MaximumExecutionFrequency: TwentyFour_Hours
Parameters:
# Custom logic to monitor egress rates per customer
MaxEgressRateMbps: "1000" # 1 Gbps per customer
AlertThresholdPercent: "80"
Better yet, implement dynamic traffic shaping:
Dynamic Egress Rate Limiter Implementation
# Conceptual egress rate limiter
class EgressRateLimiter:
def __init__(self, customer_id, base_limit_mbps=100):
self.customer_id = customer_id
self.base_limit = base_limit_mbps
self.current_limit = base_limit_mbps
self.burst_allowance = base_limit_mbps * 2
def should_allow_request(self, current_rate_mbps):
# Allow bursts but throttle sustained high traffic
if current_rate_mbps > self.burst_allowance:
self.current_limit = max(
self.base_limit,
self.current_limit * 0.8 # Gradual throttling
)
return current_rate_mbps <= self.current_limit
return True
Layer 3: CDN/Edge Traffic Budgets
Cloudflare's planned solution—per-customer traffic budgets—addresses the final layer:
Per-Customer Traffic Budget System
# Conceptual per-customer traffic budget
class CustomerTrafficBudget:
def __init__(self, customer_id, daily_budget_gb=100):
self.customer_id = customer_id
self.daily_budget = daily_budget_gb * 1024 * 1024 * 1024 # Convert to bytes
self.current_usage = 0
self.reset_time = time.time() + 86400 # 24 hours
def consume_bandwidth(self, bytes_used):
if time.time() > self.reset_time:
self.current_usage = 0
self.reset_time = time.time() + 86400
self.current_usage += bytes_used
# Implement graduated throttling
usage_percent = self.current_usage / self.daily_budget
if usage_percent > 0.9:
return False # Block requests
elif usage_percent > 0.8:
# Introduce artificial delay
time.sleep(random.uniform(0.1, 0.5))
return True
BGP Withdrawals: When Network Engineering Backfires
AWS's decision to withdraw BGP prefixes during the incident highlights a critical challenge in network operations: the tools designed to manage traffic can themselves become sources of instability.
BGP withdrawals are a blunt instrument. When AWS withdrew prefixes from congested peering connections, they essentially told the internet: "Don't send traffic this way." But the internet had to send it somewhere, and the alternative paths were even less prepared.
A Better Approach: Graduated Traffic Engineering
Instead of binary BGP withdrawals, network operators should implement graduated traffic engineering:
Graduated BGP Traffic Engineering
# Graduated traffic engineering approach
router bgp 65000
# Instead of withdrawing completely, adjust routing preferences
neighbor 192.0.2.1 route-map TRAFFIC_ENGINEERING out
route-map TRAFFIC_ENGINEERING permit 10
match ip address prefix-list CONGESTED_PREFIXES
# Increase AS path length to make route less preferred
set as-path prepend 65000 65000 65000
route-map TRAFFIC_ENGINEERING permit 20
# Normal routes continue as usual
This approach gradually shifts traffic away from congested links without creating sudden routing changes that can overwhelm alternative paths.
Monitoring and Alerting: Early Warning Systems
One of the most striking aspects of this incident was how quickly it escalated. From the initial traffic surge to complete saturation was a matter of minutes. Effective monitoring could have provided early warning:
Key Metrics to Monitor
Critical Metrics for Cascade Prevention
# Critical metrics for cascade failure prevention
monitoring_metrics = {
"customer_egress_rate": {
"threshold": "100 Mbps sustained for 5 minutes",
"action": "automatic_rate_limiting"
},
"peering_link_utilization": {
"threshold": "80% utilization",
"action": "traffic_engineering_alert"
},
"bgp_route_changes": {
"threshold": "10 prefix changes in 5 minutes",
"action": "network_operations_alert"
},
"customer_request_patterns": {
"threshold": "10x normal request rate",
"action": "circuit_breaker_evaluation"
}
}
Implementing Proactive Alerts
Proactive Cascade Failure Detection
# Example monitoring implementation
class CascadeFailureDetector:
def __init__(self):
self.baseline_metrics = {}
self.alert_thresholds = {
'traffic_spike_multiplier': 5,
'sustained_high_traffic_minutes': 10,
'concurrent_customer_alerts': 3
}
def analyze_traffic_pattern(self, customer_id, current_metrics):
baseline = self.baseline_metrics.get(customer_id, {})
# Detect traffic spikes
if current_metrics.get('requests_per_minute', 0) > \
baseline.get('avg_requests_per_minute', 0) * self.alert_thresholds['traffic_spike_multiplier']:
return {
'alert_level': 'WARNING',
'message': f'Traffic spike detected for customer {customer_id}',
'recommended_action': 'enable_rate_limiting'
}
return None
Lessons for System Architects
This incident offers several crucial lessons for anyone designing distributed systems:
1. Design for Blast Radius Containment
Every system should be designed to contain failures within well-defined boundaries. In this case, one customer's behavior affected thousands of others because proper isolation wasn't in place.
Implementation principle: Assume every component will misbehave eventually, and design safeguards accordingly.
2. Implement Defense in Depth
Rate limiting shouldn't exist at just one layer—it should be implemented throughout the stack:
- Application layer: Circuit breakers, backoff, request queuing
- Infrastructure layer: Per-tenant resource limits, egress controls
- Network layer: Traffic shaping, graduated routing preferences
- Edge layer: Customer budgets, automatic throttling
3. Coordination is Critical
The fact that AWS's well-intentioned BGP withdrawals made the situation worse highlights the importance of coordination between network operations teams. In a multi-provider environment, traffic engineering decisions by one party can have unintended consequences for others.
4. Monitor Aggregate Impact, Not Just Individual Metrics
Traditional monitoring focuses on individual services or customers. This incident shows the importance of monitoring aggregate effects—how does one customer's traffic pattern affect the broader system?
Building Resilient Interconnected Systems
The August 21st incident wasn't just a Cloudflare problem or an AWS problem—it was an internet architecture problem. As our systems become more interconnected, we need to think about resilience differently.
The Shared Responsibility Model for Operational Resilience
Just as we have shared responsibility models for security, we need them for operational resilience:
- Customers: Implement proper rate limiting, circuit breakers, and graceful degradation
- Infrastructure providers: Enforce per-tenant resource limits and provide traffic management tools
- Network operators: Coordinate traffic engineering decisions and implement graduated controls
- Edge providers: Implement customer isolation and automatic safeguards
Future-Proofing Internet Infrastructure
This incident points toward several areas where the industry needs to evolve:
- Standardized traffic management protocols that allow providers to coordinate responses to unusual traffic patterns
- Automatic traffic budgeting that dynamically adjusts based on overall system health
- Cross-provider monitoring that gives visibility into how traffic patterns affect the broader ecosystem
- Graduated response frameworks that replace binary controls (like BGP withdrawals) with more nuanced approaches
Conclusion
The Cloudflare incident of August 21st, 2025, serves as a stark reminder that in our interconnected world, local problems can quickly become global ones. But it also provides a roadmap for building more resilient systems.
The solution isn't perfect prediction or elimination of failures—it's building systems that fail gracefully and contain the blast radius of problems when they occur. Rate limiting isn't just about protecting individual services; it's about protecting the entire ecosystem.
As we continue to build increasingly complex distributed systems, the lessons from this incident become more critical. Every application we build, every infrastructure decision we make, and every operational procedure we implement should ask: "How does this behavior affect not just our system, but the broader ecosystem we're part of?"
The internet works because millions of systems cooperate effectively. Incidents like this remind us that with that cooperation comes responsibility—to build systems that are good citizens in the global network we all depend on.
The technical details in this analysis are based on Cloudflare's public incident report. Code examples are provided for illustrative purposes and should be adapted for specific environments and requirements.
Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of The Rose Theory series on math and physics.
Comments
Post a Comment