Skip to main content
Database Internals

The Cost of Consistency

Ravinder··6 min read
DatabaseInternalsArchitectureConsistencyDistributed SystemsCAP Theorem
Share:
The Cost of Consistency

Consistency is not binary. Engineers talk about "strongly consistent" and "eventually consistent" databases as if those are the only two options, but the consistency spectrum has many gradations—and each gradation trades something real. Moving from eventual consistency to strict serializable consistency can multiply your write latency by 5–50x depending on deployment topology. That is not a reason to avoid strong consistency; it is a reason to be deliberate about where you need it and where you do not.

The Consistency Spectrum

From weakest to strongest, the main consistency levels you encounter in practice:

Level Guarantee Example Systems
Eventual All replicas converge eventually; individual reads may be stale DynamoDB (default), Cassandra ONE, DNS
Monotonic read You never read an older value than one you already read Session-pinned reads
Read-your-writes You always see your own writes Many SQL DBs with session affinity
Causal Causally related operations appear in order to all nodes MongoDB causal sessions, Cosmos DB
Sequential All operations appear in some global order (not necessarily real-time) Rarely offered directly
Linearizable Operations appear instantaneous at some point in their real-time interval etcd, Spanner, CockroachDB, ZooKeeper
Strict serializable Linearizability + full transaction serializability Spanner, CockroachDB with SSI

Most production systems need different levels for different data. Your product recommendation cache can be eventually consistent; your payment balance cannot.

The Latency Math

Eventual Consistency

A write succeeds as soon as the local replica acknowledges it. Reads return whatever value the nearest replica has.

Write latency = local disk write ≈ 1–5ms
Read latency  = local replica read ≈ 1–5ms
Staleness window = replication lag ≈ 10ms to seconds

This is the minimum achievable latency. Any stronger guarantee costs more.

Quorum Consistency

In a system with N replicas, a write requires W acknowledgments and a read requires R acknowledgments, with W + R > N ensuring overlap.

# For N=3, W=2, R=2:
# Write waits for 2 of 3 replicas → 1 network round trip to nearest replica
# Read contacts 2 of 3 replicas → takes the latest version
 
write_latency = max_rtt_to_quorum_member + local_disk_write
# e.g., 5ms network + 5ms disk = ~10ms
 
read_latency  = max_rtt_to_quorum_member_for_read
# e.g., ~5ms network

Dynamo-style quorum reads are consistent under normal operation but allow stale reads during network partitions if W + R = N + 1 is the threshold. Adding more acknowledgments (W=3) eliminates staleness but adds latency.

Linearizable Reads

A linearizable read must contact the leader (in Raft) or perform a quorum read with fencing to ensure it sees the latest committed value. In a multi-region setup:

Client in us-east-1, leader in us-west-2 (70ms RTT):
Linearizable write = 70ms (send to leader) + commit time
Linearizable read  = 70ms (contact leader) OR read from local replica with lease

Leader leases (used in CockroachDB, etcd) allow reads from the local leader without contacting quorum, reducing linearizable read latency to local disk speed—but leases expire and must be renewed, adding a background cost.

The Availability Tradeoff

CAP theorem states that a distributed system cannot be simultaneously Consistent, Available, and Partition-tolerant. During a network partition, you must choose:

  • CP (Consistency + Partition tolerance): Refuse to serve reads/writes when quorum is unreachable. No stale data; reduced availability. → Raft-based databases.
  • AP (Availability + Partition tolerance): Continue serving reads/writes from available replicas. Possible stale reads; always available. → Cassandra, DynamoDB.
flowchart TD P[Network Partition\noccurs] --> Q{System choice} Q -->|CP: wait for quorum| U[Unavailable\nduring partition\nNo stale data] Q -->|AP: serve local replica| S[Available\nbut may return\nstale data] U -->|partition heals| R1[Full consistency\nrestored] S -->|partition heals| R2[Replicas converge\nvia reconciliation]

PACELC extends CAP for the non-partition case: even when the network is healthy, there is a tradeoff between latency (L) and consistency (C). A system that requires quorum acknowledgment has higher latency than one that acknowledges locally, network partition or not.

Consistency Levels in Practice: Cassandra

Cassandra exposes tunable consistency per operation. This lets you choose the tradeoff at query time:

from cassandra import ConsistencyLevel
from cassandra.query import SimpleStatement
 
# Eventual: fastest, may read stale
stmt = SimpleStatement(
    "SELECT balance FROM accounts WHERE id = %s",
    consistency_level=ConsistencyLevel.ONE
)
 
# Quorum: consistent across majority of replicas
stmt = SimpleStatement(
    "SELECT balance FROM accounts WHERE id = %s",
    consistency_level=ConsistencyLevel.QUORUM
)
 
# All: strongest, slowest, unavailable if any replica is down
stmt = SimpleStatement(
    "SELECT balance FROM accounts WHERE id = %s",
    consistency_level=ConsistencyLevel.ALL
)

The right consistency level depends on the data. Use ONE for user preference reads. Use QUORUM for anything financial. Use LOCAL_QUORUM in multi-region to avoid cross-region latency on every read while still getting consistency within a region.

Where to Apply Each Level

Strict serializable:
  Financial balances, inventory counts, seat reservations
  → Must never show stale values; double-spend / oversell is unacceptable
 
Linearizable:
  Leader election, distributed locks, configuration state
  → Single-object latest-value guarantee is sufficient
 
Read-your-writes / causal:
  Social feed, user profile reads after an update
  → User should see their own changes; other users' lag is acceptable
 
Eventual:
  View counts, recommendation scores, analytics aggregates
  → Approximate values are fine; minimize latency and cost

Latency Comparison Table

Assume a 3-node cluster, single region, 1ms intra-region RTT, 5ms disk write:

Consistency Level Write p50 Read p50 Notes
Eventual (ONE) 5ms 2ms Local only
Quorum (QUORUM) 6ms 3ms +1 RTT for acknowledgment
Linearizable 7ms 3–7ms Leader required for reads unless leased
Strict serializable 8–15ms 3–7ms SSI conflict detection overhead

In a 3-region global deployment, replace "1ms" with the cross-region RTT (50–150ms). The ratios hold; the absolute numbers scale by 50–150x.

Key Takeaways

  • Consistency is a spectrum from eventual (no guarantees on staleness) to strict serializable (fully ordered, latest-value transactions); each step up costs measurable latency.
  • The latency cost of strong consistency in a single region is small (milliseconds); in a multi-region deployment it scales with cross-region RTT, making strict global consistency expensive.
  • CAP theorem forces a choice between consistency and availability during network partitions; PACELC extends this: even without partitions, consistency costs latency.
  • Use the weakest consistency level that your correctness requirements permit—eventual for analytics and recommendations, linearizable or serializable for financial and inventory operations.
  • Cassandra-style tunable consistency lets you set the level per query; prefer LOCAL_QUORUM over QUORUM in multi-region deployments to avoid unnecessary cross-region latency.
  • Leader leases reduce linearizable read latency to near-local speed by allowing the leader to serve reads without a quorum round trip during the lease period, at the cost of lease renewal overhead.
Share: