The Rate Limiting Cascade: Lessons from Cloudflare's August 21 Incident

#aws #Cloudflare #Internet #Network

On August 21, 2025, a single customer's traffic pattern triggered a four-hour outage that affected thousands of users accessing AWS us-east-1 through Cloudflare. This wasn't a DDoS attack, a BGP hijack, or a hardware failure. It was something far more insidious: a rate limiting cascade failure that exposed fundamental gaps in how we architect resilience across interconnected systems.

The incident serves as a masterclass in how local problems become global ones when proper safeguards are missing at multiple layers. More importantly, it demonstrates why rate limiting can't be an afterthought—it must be a first-class architectural concern that spans the entire request lifecycle.

The Anatomy of a Cascade Failure

The Trigger: Runaway Customer Traffic

At 16:27 UTC, a single Cloudflare customer began making massive requests for cached objects stored in AWS us-east-1. The scale was unprecedented: terabytes per minute of response traffic, equivalent to downloading the entire Wikipedia database every few minutes.

This wasn't malicious—it was likely a misconfigured batch job, a runaway script, or an application with aggressive retry logic that went haywire. But the impact was the same: legitimate traffic that overwhelmed the infrastructure.

The First Failure: The Customer's Application

The customer's application lacked proper rate limiting and backoff mechanisms. In distributed systems, this is the equivalent of driving without brakes. Every well-architected application should implement:

# Example: Exponential backoff with jitter
import time
import random

class RateLimitedClient:
    def __init__(self, max_requests_per_second=10):
        self.max_rps = max_requests_per_second
        self.last_request_time = 0
        self.retry_count = 0

    def make_request(self, url):
        # Rate limiting
        now = time.time()
        time_since_last = now - self.last_request_time
        min_interval = 1.0 / self.max_rps

        if time_since_last < min_interval:
            time.sleep(min_interval - time_since_last)

        try:
            response = requests.get(url, timeout=30)
            self.retry_count = 0  # Reset on success
            return response
        except requests.exceptions.RequestException:
            # Exponential backoff with jitter
            delay = min(300, (2 ** self.retry_count) + random.uniform(0, 1))
            time.sleep(delay)
            self.retry_count += 1
            raise

The Amplification: AWS's Egress Saturation

When the customer's requests hit AWS, they generated massive response traffic that completely saturated the direct peering connections between AWS and Cloudflare. Think of it as trying to empty a swimming pool through a garden hose—the infrastructure simply wasn't designed for this volume.

The Second Failure: AWS's Egress Controls

AWS lacked sufficient egress rate limiting to prevent a single customer from monopolizing shared network resources. In a multi-tenant environment, this is a critical oversight.

AWS's response was technically sound but operationally disastrous: they began withdrawing BGP prefixes from the congested peering connections, attempting to reroute traffic to less congested paths.

# BGP prefix withdrawal - what AWS likely did
router bgp 65000
  neighbor 192.0.2.1 route-map EMERGENCY_WITHDRAW out

route-map EMERGENCY_WITHDRAW permit 10
  match ip address prefix-list CONGESTED_PREFIXES
  set community no-export

This seemed logical—if direct paths are congested, use indirect ones. But it created a new problem: the indirect paths had even less capacity.

The Cascade: Traffic Engineering Gone Wrong

AWS's BGP withdrawals pushed traffic onto Cloudflare's secondary paths through an offsite network interconnection switch. This Data Center Interconnect (DCI) link was already scheduled for a capacity upgrade and couldn't handle the sudden influx.

The Third Failure: Cloudflare's Traffic Isolation

Cloudflare lacked per-customer traffic budgets and automatic throttling mechanisms. When AWS rerouted the traffic, Cloudflare had no way to preferentially drop or throttle the problematic customer's requests.

The result was a cascade failure:

Customer generates excessive traffic
AWS's direct links saturate
AWS withdraws BGP prefixes to reroute traffic
Alternative paths also saturate
Packet loss and latency spike for all customers

The Missing Safeguards: A Multi-Layer Analysis

This incident revealed gaps at every layer of the infrastructure stack. Let's examine what should have been in place:

Layer 1: Application-Level Rate Limiting

Every application making requests to external services should implement rate limiting and circuit breaker patterns:

# Circuit breaker implementation
from enum import Enum
import time

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED

    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker is open")

        try:
            result = func(*args, **kwargs)
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()

            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN

            raise e

Layer 2: Cloud Provider Egress Controls

AWS should have implemented per-customer egress rate limiting to prevent any single tenant from saturating shared network resources:

Conceptual AWS Config Rule for Egress Monitoring

# Example: AWS Config rule for egress monitoring
EgressRateLimitRule:
  Type: AWS::Config::ConfigRule
  Properties:
    Source:
      Owner: AWS
      SourceIdentifier: EC2_INSTANCE_DETAILED_MONITORING_ENABLED
    MaximumExecutionFrequency: TwentyFour_Hours
    Parameters:
      # Custom logic to monitor egress rates per customer
      MaxEgressRateMbps: "1000"  # 1 Gbps per customer
      AlertThresholdPercent: "80"

Better yet, implement dynamic traffic shaping:

Dynamic Egress Rate Limiter Implementation

# Conceptual egress rate limiter
class EgressRateLimiter:
    def __init__(self, customer_id, base_limit_mbps=100):
        self.customer_id = customer_id
        self.base_limit = base_limit_mbps
        self.current_limit = base_limit_mbps
        self.burst_allowance = base_limit_mbps * 2

    def should_allow_request(self, current_rate_mbps):
        # Allow bursts but throttle sustained high traffic
        if current_rate_mbps > self.burst_allowance:
            self.current_limit = max(
                self.base_limit, 
                self.current_limit * 0.8  # Gradual throttling
            )
            return current_rate_mbps <= self.current_limit
        return True

Layer 3: CDN/Edge Traffic Budgets

Cloudflare's planned solution—per-customer traffic budgets—addresses the final layer:

Per-Customer Traffic Budget System

# Conceptual per-customer traffic budget
class CustomerTrafficBudget:
    def __init__(self, customer_id, daily_budget_gb=100):
        self.customer_id = customer_id
        self.daily_budget = daily_budget_gb * 1024 * 1024 * 1024  # Convert to bytes
        self.current_usage = 0
        self.reset_time = time.time() + 86400  # 24 hours

    def consume_bandwidth(self, bytes_used):
        if time.time() > self.reset_time:
            self.current_usage = 0
            self.reset_time = time.time() + 86400

        self.current_usage += bytes_used

        # Implement graduated throttling
        usage_percent = self.current_usage / self.daily_budget

        if usage_percent > 0.9:
            return False  # Block requests
        elif usage_percent > 0.8:
            # Introduce artificial delay
            time.sleep(random.uniform(0.1, 0.5))

        return True

BGP Withdrawals: When Network Engineering Backfires

AWS's decision to withdraw BGP prefixes during the incident highlights a critical challenge in network operations: the tools designed to manage traffic can themselves become sources of instability.

BGP withdrawals are a blunt instrument. When AWS withdrew prefixes from congested peering connections, they essentially told the internet: "Don't send traffic this way." But the internet had to send it somewhere, and the alternative paths were even less prepared.

A Better Approach: Graduated Traffic Engineering

Instead of binary BGP withdrawals, network operators should implement graduated traffic engineering:

Graduated BGP Traffic Engineering

# Graduated traffic engineering approach
router bgp 65000
  # Instead of withdrawing completely, adjust routing preferences
  neighbor 192.0.2.1 route-map TRAFFIC_ENGINEERING out

route-map TRAFFIC_ENGINEERING permit 10
  match ip address prefix-list CONGESTED_PREFIXES
  # Increase AS path length to make route less preferred
  set as-path prepend 65000 65000 65000

route-map TRAFFIC_ENGINEERING permit 20
  # Normal routes continue as usual

This approach gradually shifts traffic away from congested links without creating sudden routing changes that can overwhelm alternative paths.

Monitoring and Alerting: Early Warning Systems

One of the most striking aspects of this incident was how quickly it escalated. From the initial traffic surge to complete saturation was a matter of minutes. Effective monitoring could have provided early warning:

Key Metrics to Monitor

Critical Metrics for Cascade Prevention

# Critical metrics for cascade failure prevention
monitoring_metrics = {
    "customer_egress_rate": {
        "threshold": "100 Mbps sustained for 5 minutes",
        "action": "automatic_rate_limiting"
    },
    "peering_link_utilization": {
        "threshold": "80% utilization",
        "action": "traffic_engineering_alert"
    },
    "bgp_route_changes": {
        "threshold": "10 prefix changes in 5 minutes",
        "action": "network_operations_alert"
    },
    "customer_request_patterns": {
        "threshold": "10x normal request rate",
        "action": "circuit_breaker_evaluation"
    }
}

Implementing Proactive Alerts

Proactive Cascade Failure Detection

# Example monitoring implementation
class CascadeFailureDetector:
    def __init__(self):
        self.baseline_metrics = {}
        self.alert_thresholds = {
            'traffic_spike_multiplier': 5,
            'sustained_high_traffic_minutes': 10,
            'concurrent_customer_alerts': 3
        }

    def analyze_traffic_pattern(self, customer_id, current_metrics):
        baseline = self.baseline_metrics.get(customer_id, {})

        # Detect traffic spikes
        if current_metrics.get('requests_per_minute', 0) > \
           baseline.get('avg_requests_per_minute', 0) * self.alert_thresholds['traffic_spike_multiplier']:

            return {
                'alert_level': 'WARNING',
                'message': f'Traffic spike detected for customer {customer_id}',
                'recommended_action': 'enable_rate_limiting'
            }

        return None

Lessons for System Architects

This incident offers several crucial lessons for anyone designing distributed systems:

1. Design for Blast Radius Containment

Every system should be designed to contain failures within well-defined boundaries. In this case, one customer's behavior affected thousands of others because proper isolation wasn't in place.

Implementation principle: Assume every component will misbehave eventually, and design safeguards accordingly.

2. Implement Defense in Depth

Rate limiting shouldn't exist at just one layer—it should be implemented throughout the stack:

Application layer: Circuit breakers, backoff, request queuing
Infrastructure layer: Per-tenant resource limits, egress controls
Network layer: Traffic shaping, graduated routing preferences
Edge layer: Customer budgets, automatic throttling

3. Coordination is Critical

The fact that AWS's well-intentioned BGP withdrawals made the situation worse highlights the importance of coordination between network operations teams. In a multi-provider environment, traffic engineering decisions by one party can have unintended consequences for others.

4. Monitor Aggregate Impact, Not Just Individual Metrics

Traditional monitoring focuses on individual services or customers. This incident shows the importance of monitoring aggregate effects—how does one customer's traffic pattern affect the broader system?

Building Resilient Interconnected Systems

The August 21st incident wasn't just a Cloudflare problem or an AWS problem—it was an internet architecture problem. As our systems become more interconnected, we need to think about resilience differently.

The Shared Responsibility Model for Operational Resilience

Just as we have shared responsibility models for security, we need them for operational resilience:

Customers: Implement proper rate limiting, circuit breakers, and graceful degradation
Infrastructure providers: Enforce per-tenant resource limits and provide traffic management tools
Network operators: Coordinate traffic engineering decisions and implement graduated controls
Edge providers: Implement customer isolation and automatic safeguards

Future-Proofing Internet Infrastructure

This incident points toward several areas where the industry needs to evolve:

Standardized traffic management protocols that allow providers to coordinate responses to unusual traffic patterns
Automatic traffic budgeting that dynamically adjusts based on overall system health
Cross-provider monitoring that gives visibility into how traffic patterns affect the broader ecosystem
Graduated response frameworks that replace binary controls (like BGP withdrawals) with more nuanced approaches

Conclusion

The Cloudflare incident of August 21st, 2025, serves as a stark reminder that in our interconnected world, local problems can quickly become global ones. But it also provides a roadmap for building more resilient systems.

The solution isn't perfect prediction or elimination of failures—it's building systems that fail gracefully and contain the blast radius of problems when they occur. Rate limiting isn't just about protecting individual services; it's about protecting the entire ecosystem.

As we continue to build increasingly complex distributed systems, the lessons from this incident become more critical. Every application we build, every infrastructure decision we make, and every operational procedure we implement should ask: "How does this behavior affect not just our system, but the broader ecosystem we're part of?"

The internet works because millions of systems cooperate effectively. Incidents like this remind us that with that cooperation comes responsibility—to build systems that are good citizens in the global network we all depend on.

The technical details in this analysis are based on Cloudflare's public incident report. Code examples are provided for illustrative purposes and should be adapted for specific environments and requirements.

Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of The Rose Theory series on math and physics.

Search This Blog

Tech-Reader.blog