Database Internals

Distributed Consensus

Ravinder·January 26, 2026·6 min read

DatabaseInternalsArchitectureDistributed SystemsConsensusRaft

Series

Database Internals for App Engineers

Part 7 of 10

← Part 6

Sharding Strategies

Part 8 →

Query Planning

When you use CockroachDB, TiDB, etcd, or any distributed database that advertises strong consistency, there is a consensus protocol running under every write. Most engineers interact with the result—reads always return the latest committed value, writes are never lost across node failures—without understanding the mechanism. That is fine until you need to reason about why your p99 latency doubled during a leader election, or why your multi-region write throughput is capped by cross-region round trips. At that point, you need to understand Raft.

The Problem Consensus Solves

In a distributed system with multiple replicas, you need all replicas to agree on the order of writes. Without agreement, two replicas might apply updates in different orders and diverge, giving different clients different views of the world indefinitely.

The consensus problem: given N nodes, some of which may fail or become unreachable, agree on a sequence of values such that:

All non-faulty nodes agree on the same value.
The agreed value was proposed by some node (no values are invented).
Once agreed, the value never changes.

Raft (and Paxos, which it simplifies) solve this. They require a quorum: a majority of nodes (⌊N/2⌋ + 1) must agree on a value for it to be committed.

Raft: The Readable Consensus Protocol

Raft decomposes consensus into three sub-problems: leader election, log replication, and safety. There is always exactly one leader at a time. All writes go through the leader.

sequenceDiagram participant C as Client participant L as Leader\n(Node 1) participant F1 as Follower\n(Node 2) participant F2 as Follower\n(Node 3) C->>L: Write: x = 42 L->>L: Append to local log (index 7) L->>F1: AppendEntries (index 7, x=42) L->>F2: AppendEntries (index 7, x=42) F1-->>L: ACK F2-->>L: ACK Note over L: Quorum reached (2 of 3 followers ACK'd) L->>L: Commit index 7 L-->>C: Success L->>F1: CommitIndex = 7 L->>F2: CommitIndex = 7

The write is committed once the leader receives ACKs from a quorum. The client gets a success response only after the commit. This is the source of both the correctness guarantee and the latency cost.

Leader Election and Its Impact on Latency

Leaders are elected via terms. Each node has an election timer. When a follower does not hear from the leader within the election timeout (typically 150–300ms), it increments its term, votes for itself, and requests votes from other nodes. The first node to receive a quorum of votes becomes the new leader.

During election—from the moment the old leader fails to the moment the new leader is elected and starts serving writes—the cluster is unavailable for writes. This is the leader election gap: typically 150–500ms in well-tuned deployments, but potentially seconds if network conditions are unstable.

From an application perspective, this looks like a burst of connection errors or timeouts. Your connection pool and retry logic must handle this gracefully:

# Retry with exponential backoff for transient consensus gaps
import time, random
 
def write_with_retry(db, query, params, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return db.execute(query, params)
        except LeaderNotAvailableError:
            wait = (2 ** attempt) + random.uniform(0, 0.1)
            time.sleep(wait)
    raise Exception("Max retries exceeded")

What Consistency Actually Gives You

Linearizability (the guarantee most consensus-based databases provide) means: every operation appears to take effect instantaneously at some point between its invocation and its response, and all operations are globally ordered.

In practice: if write W1 completes before read R1 starts, R1 must see W1's value. There is no "stale read" from a linearizable system.

This is different from:

Sequential consistency: operations appear in some global order, but not necessarily the real-time order. Reads may see old values.
Causal consistency: reads see all writes causally preceding them, but concurrent writes may be observed in different orders by different nodes.
Eventual consistency: all replicas converge eventually, but any individual read might return a stale value.

For most OLTP applications, linearizability is what you want for critical data (balances, inventory) and eventual consistency is fine for non-critical data (view counts, recommendations).

The Multi-Region Latency Math

Consensus requires a quorum round trip. In a 3-node cluster spread across regions:

us-east-1  ←→  us-west-2:  ~70ms RTT
us-east-1  ←→  eu-west-1:  ~100ms RTT
 
3-node cluster: us-east-1 (leader), us-west-2, eu-west-1
Write latency = max(RTT to quorum members) = 70ms minimum
                (leader must wait for 1 of 2 followers)

A write to a 3-node global cluster cannot complete in less than one cross-region RTT. This is physics. If your application issues synchronous writes on the critical path to a globally distributed database, the user-visible latency floor is set by network geography.

Mitigations:

Use a leader region close to your write origin.
Batch writes; use async acknowledgment where durability windows allow.
Consider 5-node clusters where 2 followers are in the same region as the leader—quorum can be achieved locally.

Paxos vs Raft: The Practical Difference

Both protocols solve the same problem. Raft was designed to be more understandable: it has a strong leader that serializes all decisions, explicit leader election with term numbers, and a straightforward log replication mechanism.

Paxos in its original form is underspecified for practical systems—the "Multi-Paxos" variant used in production is closer to Raft in structure. Google's Chubby and Spanner use Paxos variants; etcd, CockroachDB, and TiDB use Raft.

For an app engineer, the choice of protocol does not change the consistency guarantees—both achieve linearizability. It affects implementation complexity and edge-case handling by the database team, not your application code.

Key Takeaways

Consensus protocols (Raft, Paxos) require a quorum (majority) of nodes to agree before a write is committed, ensuring no two nodes can simultaneously commit conflicting values.
Leader election creates a brief unavailability window (typically 150–500ms) when the current leader fails; your application must retry writes with backoff during this window.
Linearizability guarantees that reads always see the latest committed write; this is stronger than sequential, causal, or eventual consistency and is what "strongly consistent" distributed databases provide.
Multi-region deployments incur minimum write latency equal to the round-trip time to the nearest quorum member; this is a physical floor, not a software limitation.
For writes, the critical path runs through the leader; placing the leader geographically close to your write-heavy application tier is the primary latency lever.
Eventual consistency is appropriate for non-critical reads; use linearizable reads only where correctness requires the latest value, to avoid unnecessary consensus overhead.