Idempotency Keys Done Right
A payment API that charges a customer twice is not a billing bug — it is a trust-destroying event. Yet most idempotency implementations I have audited will do exactly that under one of three conditions: the client retries before the first request finishes, the server crashes after charge but before storing the response, or the key expires while the client is still retrying.
The fix is not hard, but it requires thinking through each failure mode precisely. Let us do that.
What an Idempotency Key Actually Promises
An idempotency key says: for a given key, the server will execute the side-effectful operation at most once and will return the same response on every subsequent call carrying that key.
That is two promises, not one:
- At-most-once execution — the mutation fires once.
- Stable response replay — retries get the same response body and status code.
Most teams implement promise 1 and forget promise 2. The client then cannot distinguish "retry saw cached 200" from "retry triggered a second charge that also succeeded."
Key Design
The key must be:
- Client-generated — never server-generated. The client must be able to produce the same key across process restarts and across network failures before the request lands.
- Request-scoped, not user-scoped —
user_idis a terrible key. Use a UUID v4 or a deterministic hash of the operation parameters. - Coupled to request body — if the body changes, the key must change. Accepting the same key for a different body is either an error or a security vulnerability.
import uuid
import hashlib
import json
def generate_idempotency_key(operation: str, params: dict) -> str:
"""
Deterministic key: survives process restarts, safe for retries.
Use this when the client must regenerate the key without storage.
"""
canonical = json.dumps({"op": operation, **params}, sort_keys=True)
digest = hashlib.sha256(canonical.encode()).hexdigest()
return f"{operation}:{digest[:32]}"
def generate_random_key() -> str:
"""
Random key: simpler, requires client-side persistence across retries.
Prefer this for payment SDKs where the client can store the key.
"""
return str(uuid.uuid4())Storage Schema
The idempotency record needs to capture more than you think.
CREATE TABLE idempotency_keys (
key TEXT PRIMARY KEY,
request_hash TEXT NOT NULL, -- SHA-256 of request body
status TEXT NOT NULL DEFAULT 'in_flight',
-- in_flight | completed | failed
response_status INT,
response_body JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
expires_at TIMESTAMPTZ NOT NULL,
lock_token TEXT -- used for concurrent request dedup
);
CREATE INDEX ON idempotency_keys (expires_at); -- for TTL cleanup jobstatus is the underappreciated column. Without it you cannot distinguish "first request is still running" from "key does not exist."
The State Machine
The in_flight state is critical. A naive implementation only checks for existence and writes on completion. Under that design, a concurrent retry will execute the operation again before the first response is stored.
Handling Concurrent Retries
Insert the key before executing the operation. Use a unique constraint to detect races.
import psycopg2
from contextlib import contextmanager
@contextmanager
def idempotency_guard(conn, key: str, request_body: dict):
body_hash = sha256(request_body)
try:
with conn.cursor() as cur:
cur.execute("""
INSERT INTO idempotency_keys
(key, request_hash, status, expires_at)
VALUES
(%s, %s, 'in_flight', NOW() + INTERVAL '24 hours')
ON CONFLICT (key) DO NOTHING
RETURNING key
""", (key, body_hash))
inserted = cur.fetchone()
if inserted is None:
# Key exists — check state
row = fetch_key(conn, key)
if row["status"] == "in_flight":
raise ConflictError("Request already in flight. Retry after 2s.")
if row["request_hash"] != body_hash:
raise BadRequestError("Key reuse with different body.")
# completed or failed — replay
yield {"replay": True, "row": row}
return
conn.commit()
yield {"replay": False}
except ConflictError:
raise
except Exception as e:
mark_failed(conn, key, str(e))
raiseResponse Caching Strategy
Cache the exact HTTP response: status code, headers relevant to the client, and body. Do not cache derived data or re-serialize from a database entity — you will hit serialization drift after schema changes.
def complete_idempotency_key(conn, key: str, status_code: int, body: dict):
with conn.cursor() as cur:
cur.execute("""
UPDATE idempotency_keys SET
status = 'completed',
response_status = %s,
response_body = %s,
completed_at = NOW()
WHERE key = %s AND status = 'in_flight'
""", (status_code, json.dumps(body), key))
if cur.rowcount == 0:
# Lost a race — log and return, do not double-write
logger.warning("idempotency key %s already completed by concurrent request", key)
conn.commit()
def replay_response(row: dict) -> HTTPResponse:
return HTTPResponse(
status=row["response_status"],
body=row["response_body"],
headers={"X-Idempotency-Replayed": "true"}
)The X-Idempotency-Replayed header is not cosmetic — it lets the client distinguish a fresh response from a cached one without parsing the body.
TTL: Longer Than You Think
Most teams set a 1-hour TTL. That is wrong for payment operations.
Consider the retry schedule of a well-behaved client: exponential backoff starting at 1s, cap at 60s, max 10 attempts — total window is roughly 10 minutes. But the client may also come back after a process restart, or a mobile app coming back online. Set TTL to 24–72 hours for financial operations, 1 hour for idempotent reads, and 5 minutes for very short-lived operations like OTP validation.
Clean up with a background job, not on read:
-- Run every 15 minutes
DELETE FROM idempotency_keys
WHERE expires_at < NOW()
AND status IN ('completed', 'failed');
-- Do NOT delete in_flight keys; they may indicate a hung worker.The Partial Failure Problem
The hardest case: the database write for the business operation succeeds, then the server crashes before writing the idempotency response. On retry:
- The key status is still
in_flight. - The operation already happened.
You have two options:
Option A — Fencing token + idempotent business write. The business operation itself is idempotent keyed on the same key. On retry, the business DB upsert is a no-op. Promote the key to completed and replay.
Option B — Stuck in_flight detection. If status = 'in_flight' and NOW() - created_at > threshold, treat it as a failed operation and return an error the client can retry with a new key.
Option A is correct. Option B is pragmatic when you cannot make the business write idempotent (e.g., calling a third-party payment processor).
Conflict Handling: Wrong Body, Same Key
If a client sends the same key with a different request body, reject it with 422:
def validate_request_hash(row: dict, incoming_hash: str):
if row["request_hash"] != incoming_hash:
raise UnprocessableEntityError(
"Idempotency key reuse with different request body. "
"Generate a new key for a new request."
)Do not silently accept it. A key collision on different bodies is almost always a client bug — surface it loudly.
Edge Cases Worth Handling
Clock skew. If your idempotency store is distributed, NOW() comparisons for stuck in-flight detection may be unreliable. Use a monotonic sequence or a dedicated timeout service rather than wall-clock comparisons.
Key rotation under A/B deployment. If you deploy a new version that changes the request body schema, old keys may collide with new request shapes. Version the key namespace: v2:payment:<uuid>.
Distributed idempotency store. Redis is popular here. Use SET key value NX EX ttl for atomic insert-if-absent. But Redis is not durable by default — use appendfsync always or accept the risk that a crash loses in-flight records.
# Redis atomic idempotency insert
SET "idem:abc123" '{"status":"in_flight","hash":"..."}' NX EX 86400
# Returns OK if inserted, nil if key existsKey Takeaways
- Insert the idempotency record in
in_flightstate before executing the operation — not after. - Cache the raw HTTP response (status + body), not a derived entity, to survive schema changes.
- Set TTL to match your longest realistic retry window — 24 hours for payments, not 1 hour.
- Return
409 Conflictfor concurrent in-flight retries; return the cached response for completed/failed replays. - Validate request body hash on every call — key reuse with a different body is a bug, not a feature.
- For partial failures, prefer making the business write idempotent over stuck-detection heuristics.