API Design Mastery

Rate Limiting and Quotas

Ravinder·November 12, 2025·5 min read

API DesignRESTGraphQLgRPCRate Limiting

Series

API Design Mastery

Part 7 of 10

← Part 6

Idempotency

Part 8 →

Webhooks

Without rate limiting, one misbehaving client can saturate your API and degrade service for everyone else. This is not a theoretical risk — it happens regularly in production, often from a client with a bug in its retry loop rather than a malicious actor. Rate limiting and quotas are the contract that guarantees every client gets a fair share of your API's capacity, and that capacity can scale predictably as you grow your customer base.

Algorithms: Fixed Window, Sliding Window, Token Bucket

Fixed Window

Count requests within a fixed time window (e.g., per minute):

Window start: 09:00:00
Window end:   09:01:00
Limit:        100 requests

Problem: a client can send 100 requests at 08:59:59 and 100 more at 09:00:01 — 200 requests in 2 seconds without triggering a limit.

Sliding Window

Track the request count for the last N seconds on a rolling basis using a sorted set in Redis:

ZADD rate:apikey:k123 1699183200.5 "req_1"
ZREMRANGEBYSCORE rate:apikey:k123 -inf (NOW - 60)
ZCARD rate:apikey:k123

This eliminates the boundary burst problem but costs more memory and compute per check.

Token Bucket

Tokens accumulate at a fixed rate (refill rate) up to a maximum (burst capacity). Each request consumes one token. If the bucket is empty, the request is rejected.

Refill rate:    10 tokens/second
Burst capacity: 100 tokens

Token bucket is the best fit for most public APIs: it allows short bursts (a client can send 100 requests instantly if they have tokens saved up) while enforcing an average rate over time.

graph LR A[Token Source] -- "refill: 10/sec" --> B[(Bucket\ncapacity: 100)] C[Incoming Request] --> D{Tokens > 0?} D -- Yes --> E[Consume Token → Process] D -- No --> F[Reject → 429] B --> D

Quota Dimensions

Rate limits should be applied at multiple dimensions:

Dimension	Scope	Example
Per API key	Single credential	1,000 req/min
Per tenant/org	All keys in org	10,000 req/min
Per endpoint	Specific operation	`/search` 100 req/min
Per user IP	Unauthenticated	10 req/min
Per plan tier	Pricing plan	Free: 100/day, Pro: 10,000/day

Apply checks in this order (most specific first). If any bucket is exhausted, reject the request. This prevents one tenant from using its admin key to bypass plan-level limits.

Standard Rate Limit Headers

Use the emerging IETF standard headers (draft-ietf-httpapi-ratelimit-headers):

HTTP/1.1 200 OK
RateLimit-Limit: 1000
RateLimit-Remaining: 487
RateLimit-Reset: 1699183260
RateLimit-Policy: 1000;w=60

RateLimit-Limit — the limit for the current window
RateLimit-Remaining — requests remaining in this window
RateLimit-Reset — Unix timestamp when the window resets
RateLimit-Policy — machine-readable description of the policy

For the rejection response:

HTTP/1.1 429 Too Many Requests
Content-Type: application/problem+json
RateLimit-Limit: 1000
RateLimit-Remaining: 0
RateLimit-Reset: 1699183260
Retry-After: 47
 
{
  "type": "https://api.example.com/problems/rate-limited",
  "title": "Rate Limit Exceeded",
  "status": 429,
  "detail": "You have exceeded the limit of 1000 requests per minute.",
  "retryAfter": 47,
  "limitDimension": "api-key",
  "resetAt": "2025-11-12T09:01:00Z"
}

The Retry-After header tells clients exactly how long to wait. Clients that respect it stop hammering your API immediately.

Fairness at the Tenant Level

Per-key limiting is not enough. One tenant with 1,000 API keys can bypass a 1,000 req/min per-key limit by distributing requests across keys. Always enforce a tenant-level aggregate limit.

flowchart TD A[Request with API Key] --> B[Resolve Tenant ID from Key] B --> C{Check per-key bucket} C -- exhausted --> G[429 Rate Limited] C -- ok --> D{Check tenant bucket} D -- exhausted --> G D -- ok --> E{Check endpoint bucket} E -- exhausted --> G E -- ok --> F[Forward to Handler]

Quota vs Rate Limit

Distinguish between rate limits (speed) and quotas (volume):

Rate limit: 1,000 requests per minute. A rolling limit on throughput.
Quota: 100,000 requests per month. A hard cap on total consumption.

Quotas are for billing and plan enforcement. Surface them through a separate endpoint or in a response header:

X-Quota-Limit: 100000
X-Quota-Used: 74821
X-Quota-Reset: 2025-12-01T00:00:00Z

Send warning notifications (email, webhook) when a tenant reaches 80% quota consumption — do not let them hit the wall without warning.

Handling Rate-Limited Clients Well

A well-designed rate-limiting implementation helps clients recover, not just rejects them:

HTTP/1.1 429 Too Many Requests
Retry-After: 47
X-RateLimit-Scope: tenant
X-RateLimit-Policy-URL: https://api.example.com/docs/rate-limits

Include the scope that triggered the limit (api-key, tenant, endpoint) so clients know which limit to address. Link to documentation. When possible, distinguish between "slow down" (429) and "you're over quota for this billing period" (a different error code or a 402).

Key Takeaways

Token bucket is the most practical algorithm for public APIs: it permits short bursts while enforcing average throughput.
Apply rate limits at multiple dimensions — per key, per tenant, per endpoint — and enforce all of them on every request.
Tenant-level aggregate limits prevent distributed bypass through multiple API keys.
Use the standard RateLimit-* headers and always include Retry-After in 429 responses so clients can back off intelligently.
Separate rate limits (throughput) from quotas (volume); quotas belong to billing plans and should trigger proactive warning notifications at 80% usage.
Tell clients which limit dimension they hit in the error response — debugging a 429 that says nothing is one of the most frustrating developer experiences.

Series

API Design Mastery

Part 7 of 10

← Part 6

Idempotency

Part 8 →

Webhooks