Database Internals

Replication Mechanics

Ravinder·January 12, 2026·5 min read

DatabaseInternalsArchitectureReplicationHigh AvailabilityPostgreSQL

Series

Database Internals for App Engineers

Part 5 of 10

← Part 4

MVCC and Snapshot Isolation

Part 6 →

Sharding Strategies

Most applications treat database replicas as a black box—you point a connection string at a read replica and assume it is current. That assumption holds until a failover happens and you discover your replica was 30 seconds behind, or until your logical replication slot quietly fills its disk because no consumer was reading it. Replication is not a set-and-forget feature; it is a protocol with precise semantics that directly affects your application's behavior under failure.

Physical vs Logical Replication

These two modes solve different problems and make different tradeoffs.

Physical replication streams raw WAL bytes from the primary to the replica. The replica replays those bytes against its own storage, producing a byte-for-byte copy of the primary's data files. It is fast, simple, and robust—but the replica must run the same PostgreSQL major version and architecture as the primary, and it cannot be used as a source for cross-version upgrades or selective table replication.

Logical replication decodes WAL records into logical change events (INSERT, UPDATE, DELETE) and sends those higher-level events to subscribers. Subscribers can be a different PostgreSQL version, a different database, or an external system like Kafka. The tradeoff: decoding WAL is more CPU-intensive, and logical replication requires a replication slot on the primary that accumulates WAL if the subscriber falls behind.

How WAL Streaming Works

Physical replication opens a persistent connection from the standby to the primary. The primary's WAL sender process sends WAL segments in real time as they are written. The standby's WAL receiver writes them to its own WAL directory and replays them.

Primary:
  WAL writer → pg_wal/000000010000000100000001
  WAL sender → TCP stream → standby
 
Standby:
  WAL receiver → pg_wal/000000010000000100000001
  Startup process → replays WAL → updates heap/index files

The replication protocol uses Log Sequence Numbers (LSNs) to track position:

-- On primary: check current WAL position
SELECT pg_current_wal_lsn();
 
-- On primary: check lag per standby
SELECT application_name,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn))  AS send_lag,
       pg_size_pretty(pg_wal_lsn_diff(sent_lsn, flush_lsn))             AS flush_lag,
       pg_size_pretty(pg_wal_lsn_diff(flush_lsn, replay_lsn))           AS replay_lag
FROM pg_stat_replication;

send_lag is WAL generated but not yet sent. flush_lag is sent but not yet written to the standby's disk. replay_lag is written but not yet applied to the standby's heap. For durability guarantees, you care about flush_lag. For read consistency, you care about replay_lag.

Synchronous vs Asynchronous Replication

By default, PostgreSQL replication is asynchronous. The primary commits and acknowledges to the application before the standby confirms receipt. On failover, you can lose committed transactions that were not yet replicated.

Synchronous replication (synchronous_standby_names) makes COMMIT wait until at least one standby confirms WAL receipt:

-- postgresql.conf on primary
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'
-- COMMIT blocks until standby1 OR standby2 confirms flush of WAL

The cost: COMMIT latency increases by the round-trip time to the standby plus standby disk flush time. On a 1ms RTT network, this adds ~2–5ms per commit. For write-heavy workloads, this is significant.

Mode	Data loss on failover	Commit latency overhead
Async	Up to a few seconds	None
Sync (remote_write)	None (WAL received)	+1 RTT
Sync (remote_apply)	None (committed)	+1 RTT + replay time

Replication Slots and the Bloat Trap

Logical replication requires a replication slot on the primary. The slot tracks how far the consumer has consumed WAL, ensuring the primary retains WAL until the consumer catches up.

-- Create a logical replication slot
SELECT pg_create_logical_replication_slot('my_consumer', 'pgoutput');
 
-- Check slot positions and retained WAL
SELECT slot_name,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag,
       active
FROM pg_replication_slots;

If a consumer stops consuming, the slot retains WAL indefinitely. This causes pg_wal to grow without bound, eventually filling the disk and crashing the primary. This is one of the most common production incidents with logical replication.

Mitigations:

Set max_slot_wal_keep_size (PostgreSQL 13+) to limit how much WAL a slot can retain.
Monitor slot lag with alerting; alert at 1 GB, page at 5 GB.
Drop inactive slots promptly (pg_drop_replication_slot).

Failover and Promotion

When the primary fails, a standby is promoted to become the new primary. In physical replication, promotion is straightforward: the standby applies all remaining WAL it has received, then starts accepting writes.

# PostgreSQL 12+: promote standby to primary
pg_ctl promote -D /var/lib/postgresql/data
# Or: touch /var/lib/postgresql/data/failover.trigger

The critical question is what happened to transactions committed on the primary but not yet received by the standby. In async replication, those transactions are lost and must be handled by the application (e.g., replaying events from an outbox, re-running failed operations).

For automated failover, tools like Patroni (etcd-based leader election) or AWS RDS Multi-AZ (synchronous promotion with no data loss) handle this orchestration. They also manage client redirect via virtual IP or DNS TTL changes so applications reconnect to the new primary transparently.

Key Takeaways

Physical replication sends raw WAL bytes for a byte-for-byte copy; logical replication decodes WAL into change events for flexibility at the cost of higher overhead.
Replication lag has three components: send lag, flush lag, and replay lag—each matters for different guarantees.
Asynchronous replication has zero commit overhead but risks data loss on failover; synchronous replication eliminates data loss at the cost of added commit latency equal to network RTT.
Logical replication slots accumulate WAL when consumers fall behind; unchecked, they will fill the disk and crash the primary.
Physical replicas can be promoted to primary; logical subscribers cannot—they are not full database copies.
Automated failover tools (Patroni, RDS Multi-AZ) handle leader election and client redirection; application-level outbox patterns handle the lost-transaction edge case in async setups.