System Design Interviews

Rate-Limiter Walkthrough

Ravinder·May 13, 2025·7 min read

System DesignInterviewsArchitectureRate Limiting

Series

System Design Interviews, Real

Part 7 of 10

← Part 6

Ride-Sharing Walkthrough

Part 8 →

Distributed Search Walkthrough

Rate limiting is a deceptively deep system design problem. Most candidates name-drop "token bucket" or "sliding window" and sketch a Redis counter without engaging with the real questions: What are the precision and cost tradeoffs between algorithms? What happens when you distribute the limiter across multiple nodes? How do you handle the thundering herd at window resets? These are the questions interviewers are actually probing.

Why Rate Limiting Exists

Rate limiting protects backend services from three failure modes: accidental overload (buggy clients retrying in a tight loop), deliberate abuse (API scraping, credential stuffing), and noisy-neighbor resource contention in multi-tenant systems. The goal is to shed excess load predictably and inform clients about limits in a way that allows graceful retry.

A rate limiter should never be the system's only defense — it is one layer. But it is a critical one, because it is the first layer that sees the incoming request.

Algorithm 1: Fixed Window Counter

The simplest algorithm. Divide time into fixed windows (e.g., 60-second buckets). Each window has a counter per client. Increment on request; reject if counter exceeds limit.

def allow_request(client_id: str, limit: int, window_sec: int) -> bool:
    window_key = f"rl:{client_id}:{int(time.time() // window_sec)}"
    count = redis.incr(window_key)
    if count == 1:
        redis.expire(window_key, window_sec)
    return count <= limit

Problem: boundary spike. A client can make limit requests at 11:59:59 and another limit requests at 12:00:00 — 2× the allowed rate in a two-second span. For strict rate limiting, this is unacceptable.

Use when: strict sub-window precision is not required. Simple dashboards, coarse abuse detection.

Algorithm 2: Sliding Window Log

Maintain a timestamped log of every request. On each new request, remove entries older than the window, count remaining entries, allow or reject.

def allow_request(client_id: str, limit: int, window_ms: int) -> bool:
    now = time.time_ns() // 1_000_000  # milliseconds
    window_start = now - window_ms
    key = f"rl_log:{client_id}"
    pipe = redis.pipeline()
    pipe.zremrangebyscore(key, '-inf', window_start)
    pipe.zcard(key)
    pipe.zadd(key, {str(now): now})
    pipe.expire(key, window_ms // 1000 + 1)
    _, count, *_ = pipe.execute()
    return count < limit

Problem: memory cost. If a client sends 10,000 requests in a window, you store 10,000 timestamps in a sorted set. At scale, this is prohibitive.

Use when: you need exact precision and the per-client request volume is low.

Algorithm 3: Sliding Window Counter

A hybrid approach that divides time into fixed windows but approximates the sliding window by weighting the current and previous window counts:

approx_count = prev_window_count * (1 - elapsed_fraction) + curr_window_count

Where elapsed_fraction is how far into the current window you are (0 to 1). This gives a smooth approximation with O(1) storage per client.

flowchart LR T[Request at time T] --> EF[elapsed_fraction = T mod window / window] EF --> AC[approx_count = prev * 1-elapsed + curr] AC --> D{approx_count > limit?} D -->|Yes| R[Reject - 429] D -->|No| INC[Increment curr counter\nAllow request]

This is what Cloudflare and many production rate limiters use. The approximation error is at most a few percent under uniform traffic — acceptable for nearly all use cases.

Algorithm 4: Token Bucket

A token bucket holds up to capacity tokens. Tokens refill at a fixed rate (e.g., 100 tokens/second). Each request consumes one token. If the bucket is empty, the request is rejected.

Advantage over window-based approaches: it naturally handles bursts. A client that was idle can accumulate tokens and then burst above the per-second rate, up to capacity. This is often the desired behavior for API clients that batch requests.

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.time()
 
    def allow(self) -> bool:
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

For distributed token bucket: store (tokens, last_refill_timestamp) in Redis. On each request, load, compute refill, check, decrement, and save — atomically, via Lua script, to avoid race conditions.

-- Redis Lua script for atomic token bucket
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
 
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now
 
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
 
if tokens >= 1 then
    tokens = tokens - 1
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    return 1  -- allowed
else
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    return 0  -- rejected
end

Distributed Rate Limiting: The Real Challenge

A single Redis node running one of the above algorithms handles millions of counter updates per second. The problem appears when you run multiple rate-limiter nodes without a shared state.

If Client A sends 3 requests and they hit three different rate-limiter instances, each instance counts 1 request — all pass, even if the limit is 2. Without coordination, distributed limiters are inaccurate.

Option 1 — Centralized Redis: all limiter instances read/write the same Redis cluster. Accurate, but Redis becomes a bottleneck and a single point of failure. Mitigate with Redis Cluster (hash-slot sharding by client ID).

Option 2 — Sticky routing: route all requests from a given client ID to the same limiter instance using consistent hashing at the LB. No coordination needed. Fails if that instance is down or restarted.

Option 3 — Approximate global counter: each limiter instance maintains a local counter and periodically (every 100ms) syncs with a central store. Local decisions use the approximate global count. Over-counts are possible during the sync interval; the system errs slightly toward allowing requests rather than rejecting them. This is a common production tradeoff.

Response Headers: The Often-Missed Part

A rate limiter that rejects requests without explanation is useless. Well-behaved rate limiters return:

HTTP 429 Too Many Requests
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714600800
Retry-After: 47

Retry-After is the most important: it tells the client exactly when to retry. Without it, clients implement their own backoff, often incorrectly, and create thundering herds at the reset boundary.

Key Takeaways

Fixed window has O(1) cost but a 2× burst vulnerability at window boundaries — use only when coarse precision is acceptable.
Sliding window log is exact but memory-expensive at high request volumes; use sliding window counter (weighted approximation) for production-grade precision at O(1) cost.
Token bucket is the right algorithm when burst headroom is a feature, not a bug — it naturally accumulates capacity during idle periods.
Distributed rate limiting requires a coordination strategy: centralized Redis (accurate but bottleneck), sticky routing (accurate but brittle), or periodic sync (approximate but resilient).
Use Redis Lua scripts for all multi-step rate-limiter operations to ensure atomicity without WATCH/MULTI overhead.
Return Retry-After on every 429 response — without it, clients create thundering herds at window reset boundaries.