Architecture

Rate Limiting Beyond Token Bucket

Ravinder·July 22, 2025·11 min read

ArchitectureRate LimitingDistributed SystemsAPI Design

Most teams implement rate limiting the same way: they add a library, configure a token bucket with some number of requests per minute, and call it done. Then they hit production. The noisy tenant problem appears — one customer consuming 80% of the allowed burst capacity and degrading response times for everyone else. The distributed system problem appears — rate limit counters living in a single Redis instance that becomes the bottleneck they were trying to prevent. The fairness problem appears — a limit that is technically enforced but practically punishes burst-heavy workloads like report generation while permitting a steady stream of cheap requests that consume more total backend resources.

Token bucket is not wrong. It is just the beginning of the problem space, not the end of it. This post covers the full algorithm landscape — token bucket, leaky bucket, fixed window, sliding window log, sliding window counter — and then the harder problems: weighted fair queuing, per-tenant fairness, and running a limiter that does not become a single point of failure.

Why Rate Limiting Is Harder Than It Looks

Rate limiting has two distinct goals that are in tension:

Protection. Prevent a single client from overwhelming the backend.
Fairness. Ensure that a client under the limit does not suffer because another client is approaching theirs.

Most algorithms optimize for protection. Fairness is the harder, less-discussed problem. If you have 1,000 tenants each with a 100 req/s limit, and one tenant is at 99 req/s while 999 are at 1 req/s, the system is "fair" by limit compliance but the 99 req/s tenant is consuming disproportionate backend resources. Weighted fair queuing addresses this. But first, the algorithms.

Token Bucket

The token bucket algorithm is the most common because it naturally handles bursty traffic. A bucket holds up to capacity tokens. Tokens refill at a fixed rate. Each request costs one token. If the bucket is empty, the request is rejected.

import time
import threading
 
class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.monotonic()
        self._lock = threading.Lock()
 
    def consume(self, tokens: int = 1) -> bool:
        with self._lock:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(
                self.capacity,
                self.tokens + elapsed * self.refill_rate,
            )
            self.last_refill = now
 
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False

Pros: Allows controlled bursting up to capacity. Simple. Low memory footprint (one bucket per client).

Cons: Burstiness can be a problem at the wrong time. A client can spend all burst capacity immediately at the start of each refill window, creating a thundering start. The "two-burst" problem: a client can burst at the end of one window and the start of the next, effectively getting 2x burst capacity across the boundary.

Leaky Bucket

Leaky bucket enforces a strict constant output rate. Requests enter the bucket (a queue); they leave at a fixed rate. If the bucket overflows, requests are rejected.

The distinction from token bucket is subtle but operationally significant: token bucket allows bursts (it is concerned with input rate over a window), leaky bucket enforces output rate (it is concerned with what reaches the backend per unit time).

Leaky bucket is the right choice when the backend is sensitive to concurrent load — a database that can handle 100 writes per second but falls over at 200. The constant output rate protects the backend from bursts that a token bucket would permit.

Cons: Poor user experience for bursty, legitimate workloads. A user who submits a 50-item batch job will have it trickle out over 50 seconds at 1/s even if the backend has spare capacity.

Fixed Window Counter

A counter per time window (e.g., per minute). Simple. Cheap. Wrong.

Window 1: [00:00 - 01:00]   99 requests
Window 2: [01:00 - 02:00]   99 requests

A client who makes 99 requests in the last second of window 1 and 99 requests in the first second of window 2 makes 198 requests in a 2-second span — but never exceeds the per-minute limit. The boundary attack is real and easy.

Fixed window counter is only acceptable as a rough backstop when the exact enforcement boundary does not matter (e.g., billing quotas that are reconciled daily). Never use it as a primary rate limiting mechanism for API protection.

Sliding Window Log

The sliding window log tracks every request timestamp. The count for a given window is the number of timestamps within the last N seconds.

import time
from collections import deque
 
class SlidingWindowLog:
    def __init__(self, window_seconds: int, max_requests: int):
        self.window = window_seconds
        self.max_requests = max_requests
        self.log: deque = deque()
 
    def allow(self) -> bool:
        now = time.time()
        cutoff = now - self.window
 
        # Remove expired entries
        while self.log and self.log[0] < cutoff:
            self.log.popleft()
 
        if len(self.log) < self.max_requests:
            self.log.append(now)
            return True
        return False

Pros: Accurate. No boundary attack. The allowed count is always exactly the requests in the last N seconds.

Cons: Memory scales with traffic, not with limit. A client who sends 1,000 requests per minute needs 1,000 timestamps stored. At scale across many clients, this is significant. Redis Sorted Sets make this tractable (score = timestamp), but the memory cost is real.

Sliding Window Counter

The sliding window counter is a hybrid that approximates the sliding window log with O(1) memory. It keeps two fixed-window counters — the current window and the previous window — and interpolates based on the elapsed position within the current window.

import time
import math
 
class SlidingWindowCounter:
    def __init__(self, window_seconds: int, max_requests: int):
        self.window = window_seconds
        self.max_requests = max_requests
        self.current_count = 0
        self.previous_count = 0
        self.window_start = math.floor(time.time() / window_seconds) * window_seconds
 
    def allow(self) -> bool:
        now = time.time()
        current_window_start = math.floor(now / self.window) * self.window
 
        if current_window_start != self.window_start:
            # Window rolled over
            self.previous_count = self.current_count
            self.current_count = 0
            self.window_start = current_window_start
 
        # Position within current window (0.0 to 1.0)
        position = (now - self.window_start) / self.window
        # Weighted count: previous window's contribution decreases as we move through current window
        weighted_count = self.previous_count * (1 - position) + self.current_count
 
        if weighted_count < self.max_requests:
            self.current_count += 1
            return True
        return False

Pros: O(1) memory regardless of traffic. Highly accurate for smooth traffic distributions. No boundary attack.

Cons: Slight inaccuracy for very bursty traffic at window boundaries. For most API protection use cases, the approximation error is negligible (typically <1%).

Algorithm Comparison

quadrantChart title Rate Limiting Algorithms x-axis Low Memory Usage --> High Memory Usage y-axis Low Accuracy --> High Accuracy quadrant-1 High Memory, High Accuracy quadrant-2 Low Memory, High Accuracy quadrant-3 Low Memory, Low Accuracy quadrant-4 High Memory, Low Accuracy Token Bucket: [0.2, 0.65] Leaky Bucket: [0.15, 0.70] Fixed Window Counter: [0.1, 0.15] Sliding Window Log: [0.85, 0.95] Sliding Window Counter: [0.15, 0.88]

For most API protection use cases, the sliding window counter is the right default: low memory, high accuracy, no boundary attack.

Weighted Fair Queuing

When you have multiple tenants sharing a pool of backend capacity, weighted fair queuing (WFQ) ensures that each tenant gets a fair share proportional to their tier, even if some tenants are below their limit.

The idea: instead of a per-tenant limit, assign each tenant a weight. Schedule requests from each tenant proportionally to their weight.

import heapq
import time
from dataclasses import dataclass, field
 
@dataclass(order=True)
class TenantQueue:
    virtual_finish_time: float
    tenant_id: str = field(compare=False)
    request: dict = field(compare=False)
 
class WeightedFairQueue:
    def __init__(self, tenant_weights: dict[str, float]):
        self.weights = tenant_weights
        self.virtual_time: dict[str, float] = {}
        self.heap: list[TenantQueue] = []
 
    def enqueue(self, tenant_id: str, request: dict) -> None:
        weight = self.weights.get(tenant_id, 1.0)
        # Virtual finish time = virtual start time + cost/weight
        vst = max(self.virtual_time.get(tenant_id, 0.0), time.monotonic())
        vft = vst + (1.0 / weight)
        self.virtual_time[tenant_id] = vft
        heapq.heappush(self.heap, TenantQueue(vft, tenant_id, request))
 
    def dequeue(self) -> dict | None:
        if not self.heap:
            return None
        item = heapq.heappop(self.heap)
        return item.request

A tenant with weight 2.0 gets twice the throughput of a weight 1.0 tenant, but neither starves when the other is idle. This is significantly more fair than independent per-tenant limits where one tenant's idle capacity goes unused rather than being redistributed.

Distributed Rate Limiting with Redis

In a distributed system, rate limit state must live somewhere shared. Redis is the standard choice. The key operations — increment a counter and check against a limit — must be atomic to prevent race conditions.

Use a Lua script for atomic check-and-increment:

-- KEYS[1]: rate limit key (e.g., "ratelimit:tenant_1:2025071510")
-- ARGV[1]: max requests
-- ARGV[2]: window in seconds
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
 
local current = redis.call("INCR", key)
 
if current == 1 then
    redis.call("EXPIRE", key, window)
end
 
if current > limit then
    return 0  -- Denied
else
    return 1  -- Allowed
end

import redis
 
r = redis.Redis()
CHECK_AND_INCREMENT = r.register_script("""
    local key = KEYS[1]
    local limit = tonumber(ARGV[1])
    local window = tonumber(ARGV[2])
    local current = redis.call("INCR", key)
    if current == 1 then
        redis.call("EXPIRE", key, window)
    end
    if current > limit then return 0 else return 1 end
""")
 
def is_allowed(tenant_id: str, limit: int, window: int) -> bool:
    import time
    window_key = int(time.time() / window)
    key = f"ratelimit:{tenant_id}:{window_key}"
    result = CHECK_AND_INCREMENT(keys=[key], args=[limit, window])
    return bool(result)

Redis single-instance: Fine for most workloads. Redis can handle hundreds of thousands of operations per second. The risk is single-point-of-failure; mitigate with Redis Sentinel or Redis Cluster.

Redis Cluster: Shard rate limit keys by tenant prefix. All keys for a tenant land on the same shard via hash tags: {tenant_1}:ratelimit:window. This keeps tenant operations local to a single shard without cross-slot transactions.

Local limiter with periodic sync: For the highest throughput requirements, run a local in-process token bucket and sync to Redis every 100ms. Accept slightly stale counts in exchange for no per-request network hop. Each server instance takes a fraction of the global limit; periodically pushes consumed count to Redis and pulls from the global pool.

Per-Tenant Fairness Beyond Simple Limits

Flat rate limits are blunt instruments. A more sophisticated model:

Tier-based limits: Free tier at 100 req/min, Pro at 1,000 req/min, Enterprise at custom.
Cost-weighted requests: Not all requests are equal. A POST /export/pdf might cost 10x a GET /user/profile. Weight the token cost accordingly.
Adaptive limits: Detect when the system is under load and temporarily reduce limits proportionally across all tenants, restoring when load drops.
Burst credits: Allow tenants to accumulate unused capacity up to a burst ceiling (e.g., 10x their base rate, capped at 30 seconds of burst).

flowchart TD Request["Incoming Request"] --> CostCalc["Calculate Request Cost\n(1x for reads, 10x for exports)"] CostCalc --> TierCheck["Lookup Tenant Tier\n& Weight"] TierCheck --> Limiter["Weighted Token Bucket\n(Redis Lua)"] Limiter -->|Allowed| Backend["Backend Services"] Limiter -->|Denied| Response["429 Too Many Requests\nRetry-After header"] style Limiter fill:#4f46e5,color:#fff style Response fill:#dc2626,color:#fff

Response Headers That Actually Help

When you reject a request, the response must tell the client when they can try again. The standard headers:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1721044800
Retry-After: 47
Content-Type: application/json
 
{"error": "rate_limit_exceeded", "retry_after_seconds": 47}

Retry-After is the most important header. Without it, clients implement their own backoff heuristics, which are usually worse than yours. Give them the exact number of seconds. X-RateLimit-Reset is the Unix timestamp when the window resets; include both for clients that prefer epoch timestamps.

Include these headers on every request, not just 429 responses. Clients who monitor X-RateLimit-Remaining can self-throttle before hitting the limit.

Key Takeaways

Token bucket is the right default for burst-tolerant API protection; sliding window counter is better when boundary attack resistance matters and memory is constrained.
Fixed window counter has a well-known boundary attack vulnerability; do not use it as a primary protection mechanism.
Weighted fair queuing distributes capacity proportionally across tenants instead of applying flat limits, preventing noisy-tenant problems even within the limit envelope.
Redis Lua scripts provide the atomic check-and-increment required for correct distributed rate limiting; avoid non-atomic read-then-write patterns across network calls.
Cost-weighted request pricing (expensive operations cost more tokens) is more accurate than flat per-request limits for protecting backend resources.
Always include Retry-After and X-RateLimit-Remaining headers; give clients the information to self-throttle and avoid building their own worse heuristics.