Distributed Locks: Do You Actually Need One?
Distributed Locks Are a Last Resort
Whenever a team discovers a race condition in a distributed system, the first instinct is to reach for a distributed lock. Add a Redis lock, grab it before the critical section, release it after. Problem solved.
Except it is not. Distributed locks are fragile in ways that do not show up in development and bite you in production at the worst possible moment. They add latency on the happy path, introduce failure modes on the unhappy path, and require careful tuning that is easy to get wrong.
More often than not, there is a better answer. This post walks through the alternatives — optimistic concurrency, fencing tokens, idempotent operations, and single-writer patterns — and explains when a distributed lock is genuinely the right tool.
Why Distributed Locks Are Harder Than They Look
A simple Redis lock looks like this:
import redis
import uuid
r = redis.Redis()
def acquire_lock(name: str, ttl_seconds: int = 10) -> str | None:
token = str(uuid.uuid4())
acquired = r.set(name, token, nx=True, ex=ttl_seconds)
return token if acquired else None
def release_lock(name: str, token: str) -> bool:
# Must be atomic — use Lua script
script = """
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
"""
return bool(r.eval(script, 1, name, token))The Lua script for release is already a sign that this is not simple. But the deeper problems are:
Clock skew. The lock TTL is based on wall-clock time. If the clock on your Redis server or your application server is skewed, the TTL behaves unexpectedly.
Process pause. A GC pause, a VM migration, or a kernel stall can pause your process for longer than the lock TTL. The lock expires. Another process acquires it. Your process resumes — unaware that it no longer holds the lock — and proceeds to corrupt shared state.
Network partition. If your application cannot reach Redis to release the lock, the lock holds until TTL. Under load, this stacks up.
Redlock in particular. The Redlock algorithm (locking across N Redis nodes for a majority quorum) was proposed to address single-node Redis failures. Martin Kleppmann's analysis showed that it still fails under process pause scenarios. Redlock does not provide the safety guarantees it claims under asynchronous network conditions.
Alternative 1: Optimistic Concurrency Control
Optimistic concurrency does not lock anything. It reads a version number alongside the data, and when it writes, it includes a condition: "only update if the version is still N."
If two processes race to update the same record, both read version N. Both proceed to write. One succeeds. The other gets a conflict response, re-reads the current state, and retries.
-- Schema includes a version column
CREATE TABLE accounts (
id UUID PRIMARY KEY,
balance_cents BIGINT NOT NULL,
version INT NOT NULL DEFAULT 0
);
-- Read
SELECT id, balance_cents, version FROM accounts WHERE id = $1;
-- Write with optimistic lock
UPDATE accounts
SET balance_cents = $2,
version = version + 1
WHERE id = $1 AND version = $3;
-- If 0 rows updated: conflict, retryfrom dataclasses import dataclass
@dataclass
class Account:
id: str
balance_cents: int
version: int
class OptimisticConflictError(Exception):
pass
def transfer_funds(
from_id: str, to_id: str, amount_cents: int, max_retries: int = 3
) -> None:
for attempt in range(max_retries):
with db.transaction():
source = db.get_account(from_id)
dest = db.get_account(to_id)
if source.balance_cents < amount_cents:
raise InsufficientFundsError()
rows_updated = db.execute(
"UPDATE accounts SET balance_cents = %s, version = version + 1 "
"WHERE id = %s AND version = %s",
(source.balance_cents - amount_cents, from_id, source.version)
)
if rows_updated == 0:
raise OptimisticConflictError()
db.execute(
"UPDATE accounts SET balance_cents = %s, version = version + 1 "
"WHERE id = %s AND version = %s",
(dest.balance_cents + amount_cents, to_id, dest.version)
)Optimistic concurrency works well when:
- Conflicts are rare (most of the time, nobody is racing)
- Retry cost is low (re-reading and re-computing is cheap)
- The operation has bounded retries (not an infinite loop)
It does not work well for high-contention scenarios where many writers constantly race for the same record. Under high contention, most writes fail and retry, which is CPU waste and latency.
Alternative 2: Fencing Tokens
A fencing token is a monotonically increasing number issued by a lock service. When you acquire the lock, you get a token (e.g., 42). You include that token in every write to the protected resource. The resource rejects writes with a token lower than the highest token it has seen.
The fencing token makes the storage layer the arbiter of safety, not the lock service. Even if the lock expires and another process acquires it, the storage layer rejects stale writes.
# Storage layer enforces the fence
def write_with_fence(resource_id: str, data: dict, token: int) -> None:
with db.transaction():
current_token = db.get_token(resource_id)
if token <= current_token:
raise StaleFenceError(
f"Token {token} rejected, current is {current_token}"
)
db.write(resource_id, data)
db.set_token(resource_id, token)Fencing tokens work well when:
- The protected resource can store and check the token
- You have a lock service that issues monotonically increasing tokens (ZooKeeper, etcd)
- Process pauses are a realistic concern in your environment
The limitation: the storage system must be aware of and enforce the fencing protocol. A third-party service that does not accept a fencing token in its API cannot be protected this way.
Alternative 3: Idempotent Operations
Many race conditions disappear if you design operations to be idempotent — safe to execute multiple times with the same result.
The canonical example: generating an invoice. If two processes both attempt to generate the invoice for order #123, you do not want two invoices. The naive solution is a lock around invoice generation. The better solution: make invoice generation idempotent.
def generate_invoice(order_id: str, idempotency_key: str) -> Invoice:
# Check if already generated with this key
existing = db.get_invoice_by_key(idempotency_key)
if existing:
return existing # Return the same result, no duplicate
# Generate and store atomically
invoice = Invoice.from_order(order_id)
db.insert_invoice(
invoice=invoice,
idempotency_key=idempotency_key,
# Unique constraint on idempotency_key prevents duplicates
# even under race conditions
)
return invoice-- The database enforces uniqueness — no application-level lock needed
CREATE TABLE invoices (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
order_id UUID NOT NULL,
idempotency_key VARCHAR(255) UNIQUE NOT NULL,
amount_cents BIGINT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);If two processes race to insert with the same idempotency key, one succeeds and one gets a unique constraint violation. The losing process reads the winner's result and returns it. The outcome is correct without any lock.
Stripe's API design is the reference implementation of this pattern. Every mutation accepts an idempotency key header. Clients can safely retry any request without risk of double-charging.
Alternative 4: Single-Writer Pattern
Some state is genuinely easier to manage if exactly one process handles writes at a time. Instead of locks, you eliminate the contention by design.
The single-writer pattern: route all writes for a given entity (e.g., tenant ID, account ID) to a single process using a partitioned queue or consistent hashing.
Within each writer, operations are sequential. No locking needed — no two writers are processing the same entity simultaneously.
import hashlib
def get_writer_for_entity(entity_id: str, num_writers: int = 3) -> int:
hash_val = int(hashlib.md5(entity_id.encode()).hexdigest(), 16)
return hash_val % num_writers
# Kafka: partition by entity_id — Kafka guarantees order within a partition
producer.send(
topic="account_writes",
key=account_id.encode(), # Same key → same partition → same consumer
value=event_payload,
)Kafka's partition model implements single-writer naturally. Publish writes to a topic partitioned by entity ID. Each partition has exactly one consumer. That consumer processes writes sequentially. No locks, no contention.
The trade-off: writer reassignment during consumer group rebalancing requires care. A rebalance can briefly move a partition to a new consumer while the old consumer has an uncommitted write in flight.
When a Distributed Lock Is Actually Right
After all the alternatives, when is a distributed lock the correct choice?
Exactly-once execution of a scheduled job. You have 5 instances of a cron worker. The job should run once. A lock with a short TTL (longer than the job runtime) and a single acquisition attempt is a clean solution here. Idempotency works too, but the lock is simpler.
External resources that are not idempotent. If you must call a third-party API that has no idempotency key support and charges per call, a lock prevents duplicate calls. The alternatives require the storage to cooperate — a non-cooperative external API cannot.
Brief critical sections with low contention. A write that takes 50ms with sub-1% contention rate is a reasonable use of a lock. The expected cost of contention is low, and the complexity of the alternatives is higher.
# Acceptable use: prevent duplicate execution of a scheduled job
def run_monthly_billing_job():
lock_name = f"billing:monthly:{current_month()}"
token = acquire_lock(lock_name, ttl_seconds=3600)
if not token:
logger.info("Another instance is running the billing job, skipping")
return
try:
process_monthly_billing()
finally:
release_lock(lock_name, token)Even in this acceptable case, the lock must have a TTL that is longer than the maximum expected job runtime, and the job must be resumable if it is interrupted mid-way (otherwise a partial run followed by a lock expiry means corrupted state).
A Decision Framework
Key Takeaways
- Distributed locks fail silently under process pauses, clock skew, and network partitions — the lock expires while the holder still thinks it holds it.
- Redlock does not provide the safety guarantees it claims under asynchronous network conditions; process pauses invalidate its correctness proof.
- Optimistic concurrency (read version, write with version check) is the right default for low-contention write paths with a database you control.
- Fencing tokens move the enforcement to the storage layer, making it safe even if the lock expires mid-operation; the storage rejects stale writes by token number.
- Idempotency keys with database unique constraints eliminate most race conditions without any locking; Stripe's API is the reference implementation.
- Single-writer patterns remove contention by design: partition writes by entity ID and process each partition sequentially, using Kafka or consistent hashing to route.
- A distributed lock is appropriate for exactly-once job execution, non-idempotent external APIs, and brief critical sections with demonstrably low contention.