Rate Limiting and Quotas
Series
API Design MasteryWithout rate limiting, one misbehaving client can saturate your API and degrade service for everyone else. This is not a theoretical risk — it happens regularly in production, often from a client with a bug in its retry loop rather than a malicious actor. Rate limiting and quotas are the contract that guarantees every client gets a fair share of your API's capacity, and that capacity can scale predictably as you grow your customer base.
Algorithms: Fixed Window, Sliding Window, Token Bucket
Fixed Window
Count requests within a fixed time window (e.g., per minute):
Window start: 09:00:00
Window end: 09:01:00
Limit: 100 requestsProblem: a client can send 100 requests at 08:59:59 and 100 more at 09:00:01 — 200 requests in 2 seconds without triggering a limit.
Sliding Window
Track the request count for the last N seconds on a rolling basis using a sorted set in Redis:
ZADD rate:apikey:k123 1699183200.5 "req_1"
ZREMRANGEBYSCORE rate:apikey:k123 -inf (NOW - 60)
ZCARD rate:apikey:k123This eliminates the boundary burst problem but costs more memory and compute per check.
Token Bucket
Tokens accumulate at a fixed rate (refill rate) up to a maximum (burst capacity). Each request consumes one token. If the bucket is empty, the request is rejected.
Refill rate: 10 tokens/second
Burst capacity: 100 tokensToken bucket is the best fit for most public APIs: it allows short bursts (a client can send 100 requests instantly if they have tokens saved up) while enforcing an average rate over time.
Quota Dimensions
Rate limits should be applied at multiple dimensions:
| Dimension | Scope | Example |
|---|---|---|
| Per API key | Single credential | 1,000 req/min |
| Per tenant/org | All keys in org | 10,000 req/min |
| Per endpoint | Specific operation | /search 100 req/min |
| Per user IP | Unauthenticated | 10 req/min |
| Per plan tier | Pricing plan | Free: 100/day, Pro: 10,000/day |
Apply checks in this order (most specific first). If any bucket is exhausted, reject the request. This prevents one tenant from using its admin key to bypass plan-level limits.
Standard Rate Limit Headers
Use the emerging IETF standard headers (draft-ietf-httpapi-ratelimit-headers):
HTTP/1.1 200 OK
RateLimit-Limit: 1000
RateLimit-Remaining: 487
RateLimit-Reset: 1699183260
RateLimit-Policy: 1000;w=60RateLimit-Limit— the limit for the current windowRateLimit-Remaining— requests remaining in this windowRateLimit-Reset— Unix timestamp when the window resetsRateLimit-Policy— machine-readable description of the policy
For the rejection response:
HTTP/1.1 429 Too Many Requests
Content-Type: application/problem+json
RateLimit-Limit: 1000
RateLimit-Remaining: 0
RateLimit-Reset: 1699183260
Retry-After: 47
{
"type": "https://api.example.com/problems/rate-limited",
"title": "Rate Limit Exceeded",
"status": 429,
"detail": "You have exceeded the limit of 1000 requests per minute.",
"retryAfter": 47,
"limitDimension": "api-key",
"resetAt": "2025-11-12T09:01:00Z"
}The Retry-After header tells clients exactly how long to wait. Clients that respect it stop hammering your API immediately.
Fairness at the Tenant Level
Per-key limiting is not enough. One tenant with 1,000 API keys can bypass a 1,000 req/min per-key limit by distributing requests across keys. Always enforce a tenant-level aggregate limit.
Quota vs Rate Limit
Distinguish between rate limits (speed) and quotas (volume):
- Rate limit: 1,000 requests per minute. A rolling limit on throughput.
- Quota: 100,000 requests per month. A hard cap on total consumption.
Quotas are for billing and plan enforcement. Surface them through a separate endpoint or in a response header:
X-Quota-Limit: 100000
X-Quota-Used: 74821
X-Quota-Reset: 2025-12-01T00:00:00ZSend warning notifications (email, webhook) when a tenant reaches 80% quota consumption — do not let them hit the wall without warning.
Handling Rate-Limited Clients Well
A well-designed rate-limiting implementation helps clients recover, not just rejects them:
HTTP/1.1 429 Too Many Requests
Retry-After: 47
X-RateLimit-Scope: tenant
X-RateLimit-Policy-URL: https://api.example.com/docs/rate-limitsInclude the scope that triggered the limit (api-key, tenant, endpoint) so clients know which limit to address. Link to documentation. When possible, distinguish between "slow down" (429) and "you're over quota for this billing period" (a different error code or a 402).
Key Takeaways
- Token bucket is the most practical algorithm for public APIs: it permits short bursts while enforcing average throughput.
- Apply rate limits at multiple dimensions — per key, per tenant, per endpoint — and enforce all of them on every request.
- Tenant-level aggregate limits prevent distributed bypass through multiple API keys.
- Use the standard
RateLimit-*headers and always includeRetry-Afterin 429 responses so clients can back off intelligently. - Separate rate limits (throughput) from quotas (volume); quotas belong to billing plans and should trigger proactive warning notifications at 80% usage.
- Tell clients which limit dimension they hit in the error response — debugging a 429 that says nothing is one of the most frustrating developer experiences.
Series
API Design Mastery