Skip to main content
Architecture

Designing for a 10x Traffic Spike You Didn't Predict

Ravinder··9 min read
ArchitectureReliabilityScalabilityDistributed Systems
Share:
Designing for a 10x Traffic Spike You Didn't Predict

The Spike You Did Not See Coming

Every system gets a traffic spike eventually. Sometimes you cause it yourself — a marketing campaign, a product launch, a viral moment. Sometimes it happens to you — a competitor goes down, a news story links to you, a bot starts hammering an endpoint. The distinguishing factor between systems that survive and systems that collapse under unexpected load is whether resilience was built in or bolted on.

Bolted-on resilience means spinning up more servers and hoping the spike is slow enough that autoscaling catches it. Built-in resilience means the system is designed to degrade gracefully, protect its most important functions, and recover without human intervention.

This post covers the patterns that deliver built-in resilience: queue-based load leveling, bulkheads, load shedding, and circuit breakers. Not as a checklist — as a design philosophy with concrete implementation details.


The Fundamental Problem: Synchronous Systems Are Brittle Under Load

A synchronous request-response system is a chain. If any link in the chain is saturated, the entire request fails. Under a 10x spike:

Normal load:      Client → API → DB (20ms) → Client ✓
10x load spike:   Client → API → DB (saturated)

                             Connection pool exhausted

                             API threads blocked

                             API response queue full

                             Clients timeout

                             Clients retry → 100x load

The retry amplification is the part that turns a bad situation into a catastrophe. Clients that time out retry. If you have 10x the clients, and each retries 3 times with no backoff, you get 30x the load from clients that already could not get through. This is the death spiral.

The patterns below break this chain at different points.


Pattern 1: Queue-Based Load Leveling

The core idea: accept requests at the edge immediately, process them asynchronously at a rate the backend can sustain.

flowchart LR C[Clients] --> A[API layer\naccepts immediately] A --> Q[Queue\nfull = reject early] Q --> W1[Worker 1] Q --> W2[Worker 2] Q --> W3[Worker 3] W1 --> DB[(Database)] W2 --> DB W3 --> DB

The API layer decouples acceptance from processing. Under a 10x spike, the queue absorbs the burst while workers drain it at a controlled pace. The client gets an immediate 202 Accepted with a job ID, not a 5-second timeout.

import asyncio
from fastapi import FastAPI, BackgroundTasks
from redis import Redis
import uuid
 
app = FastAPI()
queue = Redis(host="redis")
 
@app.post("/orders")
async def submit_order(order: OrderRequest):
    # Accept immediately, process async
    job_id = str(uuid.uuid4())
    
    # Reject if queue is too deep (the important part)
    queue_depth = queue.llen("order_queue")
    if queue_depth > MAX_QUEUE_DEPTH:
        raise HTTPException(
            status_code=503,
            detail="System under load. Retry after 30s.",
            headers={"Retry-After": "30"},
        )
    
    queue.rpush("order_queue", json.dumps({
        "job_id": job_id,
        "order": order.dict(),
        "submitted_at": utcnow().isoformat(),
    }))
    
    return {"job_id": job_id, "status": "queued"}
 
@app.get("/orders/{job_id}/status")
async def order_status(job_id: str):
    result = db.get_job_result(job_id)
    return {"job_id": job_id, "status": result.status}

The queue depth check is essential. Without it, the queue becomes unbounded — you absorb all traffic but never apply backpressure, and you end up processing orders from 3 hours ago while new orders pile up. Set a depth limit and return 503 with a Retry-After header when you hit it. Clients that handle 503 correctly will retry when the queue drains.


Pattern 2: Bulkheads

A bulkhead is an isolation boundary between workloads. The concept comes from ship design: watertight compartments mean a breach in one section does not sink the whole ship.

In a system context: a CPU-bound export job should not be able to starve the thread pool handling user-facing API requests. A slow third-party payment provider should not exhaust the connection pool that real-time search depends on.

graph TD subgraph API Thread Pool [API Thread Pool - 50 threads] T1[User-facing\nrequests\n30 threads] T2[Admin\nrequests\n10 threads] T3[Webhooks\n10 threads] end subgraph Worker Pool [Worker Pool - separate processes] W1[Export workers\n5 processes] W2[Report workers\n3 processes] end subgraph DB Connections [Connection Pool per workload] D1[API pool\n20 connections] D2[Worker pool\n10 connections] end T1 --> D1 T2 --> D1 W1 --> D2 W2 --> D2

The key: if the export workers exhaust their connection pool, the API pool is unaffected. A flood of admin requests cannot displace user-facing request threads.

# Separate connection pools per workload type
from sqlalchemy import create_engine
 
# User-facing API: smaller pool, short checkout timeout
api_engine = create_engine(
    DATABASE_URL,
    pool_size=20,
    max_overflow=5,
    pool_timeout=3,     # Fail fast if no connection available
    pool_pre_ping=True,
)
 
# Background workers: larger pool, more patient
worker_engine = create_engine(
    DATABASE_URL,
    pool_size=10,
    max_overflow=2,
    pool_timeout=30,    # Workers can wait longer
    pool_pre_ping=True,
)

The pool timeout is the critical parameter. A user-facing request that cannot get a connection in 3 seconds should fail fast with a 503. A background worker can afford to wait 30 seconds. Without separate pools, a burst of background work can hold connections long enough to time out user-facing requests.


Pattern 3: Load Shedding

When a system is genuinely overloaded, trying to serve all requests is worse than serving a fraction of them well. Load shedding is the deliberate decision to reject low-priority requests so high-priority requests succeed.

Priority tiers for a typical B2B SaaS:

Priority Workload Action Under Load
Critical Auth, checkout, payment Never shed
High Core product features Shed last
Medium Search, recommendations Shed at 70% capacity
Low Analytics, reporting, exports Shed at 50% capacity
Background Notifications, cleanup jobs Shed at 30% capacity
import time
from dataclasses import dataclass
from enum import IntEnum
 
class Priority(IntEnum):
    BACKGROUND = 1
    LOW = 2
    MEDIUM = 3
    HIGH = 4
    CRITICAL = 5
 
@dataclass
class LoadShedder:
    current_cpu_pct: float      # Updated by a background monitor
    current_queue_depth: int    # Updated by a background monitor
 
    THRESHOLDS = {
        Priority.BACKGROUND: 0.30,
        Priority.LOW: 0.50,
        Priority.MEDIUM: 0.70,
        Priority.HIGH: 0.90,
        Priority.CRITICAL: 1.01,  # Never shed critical
    }
 
    def should_shed(self, priority: Priority) -> bool:
        threshold = self.THRESHOLDS[priority]
        return self.current_cpu_pct > threshold
 
# FastAPI middleware example
@app.middleware("http")
async def load_shedding_middleware(request: Request, call_next):
    priority = get_request_priority(request)  # From route or header
    
    if load_shedder.should_shed(priority):
        return Response(
            status_code=503,
            headers={"Retry-After": "10"},
            content="Service under load",
        )
    
    return await call_next(request)

The client-side counterpart is exponential backoff with jitter. A client that gets a 503 and retries immediately with full concurrency makes the problem worse. The Retry-After header is a contract — a well-behaved client honours it.

import random
import time
 
def retry_with_backoff(fn, max_retries=5, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return fn()
        except ServiceUnavailableError as e:
            if attempt == max_retries - 1:
                raise
            
            retry_after = e.retry_after or (base_delay * 2 ** attempt)
            jitter = random.uniform(0, retry_after * 0.1)
            time.sleep(retry_after + jitter)

Pattern 4: Circuit Breakers

A circuit breaker prevents a failing downstream dependency from cascading failures into your system. Without one, requests pile up waiting for a slow or failed service, exhausting your thread pool and timeouts.

stateDiagram-v2 [*] --> Closed Closed --> Open: Error rate > threshold\n(e.g. 50% errors in 10s window) Open --> HalfOpen: Wait period elapsed\n(e.g. 30 seconds) HalfOpen --> Closed: Probe request succeeds HalfOpen --> Open: Probe request fails Closed --> Closed: Normal operation Open --> Open: Fail fast\nno requests sent
import time
from threading import Lock
 
class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: float = 30.0,
        half_open_max_calls: int = 1,
    ):
        self._failures = 0
        self._last_failure_time = None
        self._state = "closed"
        self._lock = Lock()
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
 
    def call(self, fn, *args, **kwargs):
        with self._lock:
            if self._state == "open":
                elapsed = time.monotonic() - self._last_failure_time
                if elapsed < self.recovery_timeout:
                    raise CircuitOpenError("Circuit open, failing fast")
                self._state = "half_open"
 
        try:
            result = fn(*args, **kwargs)
            with self._lock:
                self._failures = 0
                self._state = "closed"
            return result
        except Exception as e:
            with self._lock:
                self._failures += 1
                self._last_failure_time = time.monotonic()
                if self._failures >= self.failure_threshold:
                    self._state = "open"
            raise

The critical design choice: fail fast when the circuit is open. Return an error immediately rather than waiting for a timeout. This frees your threads to handle requests that will succeed, and it stops applying load to a downstream that is already struggling.


Putting It Together: Layered Resilience

These patterns work best in combination. A realistic architecture for a high-traffic endpoint:

flowchart TD C[Client] --> LB[Load Balancer\nRate limiting] LB --> LS[Load Shedder\nPriority-based rejection] LS --> BH[Bulkhead\nPer-workload thread pools] BH --> Q[Queue\nAsync for non-critical work] BH --> CB[Circuit Breaker\nProtects downstream calls] CB --> DB[(Database)] CB --> EXT[External APIs] Q --> W[Workers\nControlled drain rate] W --> DB

The layers each handle a different failure mode:

  • Rate limiting at the load balancer stops volumetric attacks
  • Load shedding at the API layer preserves capacity for high-priority work
  • Bulkheads prevent one workload from starving another
  • Queues decouple acceptance from processing for async workloads
  • Circuit breakers prevent cascade from downstream failures

A 10x traffic spike that reaches all layers still causes degraded service for low-priority requests. It does not cause a total outage.


What Not to Do

Do not rely on autoscaling alone. Autoscaling is slow — typically 2–5 minutes to spin up new capacity. A spike that saturates your system in 30 seconds has done its damage long before autoscaling responds.

Do not set queue depths to unbounded. An unbounded queue means you absorb infinite load. It also means requests from 45 minutes ago are still being processed. Set a maximum depth. Reject when you hit it.

Do not circuit-break without a fallback. A circuit that opens and returns a raw 500 with no information is not resilient — it is just a different kind of failure. Return a meaningful error code, a Retry-After header, and if possible, a cached or degraded response.

Do not skip load testing. These patterns are hypotheses until you have tested them under realistic load. Run load tests against staging before you need to trust these patterns in production.


Key Takeaways

  • Synchronous systems are inherently brittle under spikes; retry amplification turns a 10x spike into a 30x load in seconds.
  • Queue-based load leveling decouples acceptance from processing, but requires a queue depth limit and a 503 response with Retry-After to apply backpressure.
  • Bulkheads isolate workloads by allocating separate thread pools and connection pools so a slow background job cannot starve user-facing requests.
  • Load shedding is a deliberate choice to reject low-priority requests under load; without it, you try to serve everyone and succeed for no one.
  • Circuit breakers protect downstream dependencies by failing fast when error rates spike, preventing cascade failures from blocked threads.
  • These patterns require load testing to validate — they are hypotheses until tested against realistic spike traffic in a staging environment.