Database Internals

WAL and Durability

Ravinder·December 29, 2025·6 min read

DatabaseInternalsArchitectureWALDurabilityPostgreSQL

Series

Database Internals for App Engineers

Part 3 of 10

← Part 2

B-trees and LSM

Part 4 →

MVCC and Snapshot Isolation

Your application commits a transaction and receives a success response. You assume the data is safe. But what does "safe" actually mean, and what chain of operations had to complete for the database to make that guarantee? The answer is the Write-Ahead Log (WAL)—and understanding it explains why databases are slower than raw file writes, why synchronous_commit = off is dangerous, and why fsync issues in the kernel can silently corrupt your data.

The Core Problem WAL Solves

Databases store data in pages (heap files, index pages). Modifying a page in memory and writing it to disk takes time. If the system crashes mid-write, you get a partial write: some bytes of the page are new, some are old, and neither version is coherent. Recomputing what should be there is impossible without a record of what the operation was supposed to do.

The WAL is that record. The rule is simple: before any modified page is written to its final location on disk, the log record describing that modification must be flushed to disk first. Hence "write-ahead."

WAL Mechanics Step by Step

sequenceDiagram participant App participant BufferPool participant WALBuffer participant Disk App->>BufferPool: BEGIN; UPDATE users SET name='Bob' WHERE id=1 BufferPool->>WALBuffer: Append WAL record (LSN 1042) App->>BufferPool: COMMIT WALBuffer->>Disk: fsync() WAL segment Disk-->>WALBuffer: fsync confirmed BufferPool-->>App: COMMIT success Note over BufferPool,Disk: Dirty heap page written later\nby background checkpointer

Key points from this flow:

The heap page (the actual data) is modified only in the buffer pool (memory).
A WAL record is written to the WAL buffer.
On COMMIT, the WAL buffer is flushed and fsync() is called on the WAL segment file.
Only after fsync() returns does the database acknowledge the commit to the application.
The dirty heap page is written to disk later—asynchronously—by the background checkpointer process.

This means the heap on disk can be behind what is in memory. On crash, the WAL is replayed from the last checkpoint to reconstruct all committed changes that had not yet been flushed to the heap.

What fsync Actually Does

fsync() tells the OS to flush all in-kernel buffers for a file descriptor to the physical storage device. Without it, "written" data may sit in the OS page cache and be lost on a power failure.

// Simplified: what happens in the Postgres WAL writer
int walfd = open("pg_wal/000000010000000000000001", O_RDWR);
write(walfd, wal_buffer, wal_buffer_len);  // goes to OS page cache
fsync(walfd);                              // flushes cache to disk
// Only now is it safe to return COMMIT to the client

The critical insight: write() returning success does not mean data is on disk. fsync() returning success does (modulo disk firmware bugs, which are real—see the ext4 + barrier discussion below).

Group Commit: Batching fsync Calls

Each fsync() call is expensive—it causes a full rotation or flush cycle on the storage device. If every transaction called fsync() independently, a database serving 10,000 TPS would issue 10,000 fsyncs per second. That is untenable.

Group commit solves this by batching: multiple transactions that commit within a short window share a single fsync(). The WAL writer accumulates pending commits, issues one fsync(), then acknowledges all of them at once.

# Pseudocode: group commit loop (simplified from Postgres WALWriter)
pending_commits = []
 
while True:
    # Collect commits that arrived while we were working
    new_commits = drain_commit_queue()
    pending_commits.extend(new_commits)
 
    if pending_commits:
        # Write all accumulated WAL records
        write_wal_buffer_to_segment()
        # One fsync covers all pending commits
        fsync(wal_segment_fd)
        # Wake up all waiting backends
        for commit in pending_commits:
            commit.signal_success()
        pending_commits = []
 
    sleep(wal_writer_delay)  # default 200ms in Postgres

Group commit is why high-concurrency databases outperform low-concurrency ones even per-transaction: more concurrent writers means larger commit groups and fewer fsyncs per transaction.

The Danger Zones

synchronous_commit = off

PostgreSQL allows you to disable WAL fsync per transaction:

SET synchronous_commit = off;
INSERT INTO events (payload) VALUES ($1);
COMMIT;
-- Returns immediately without waiting for fsync
-- If the server crashes in the next ~200ms, this row is LOST

This trades durability for latency. It is acceptable for truly ephemeral data (session logs, analytics events where occasional loss is tolerable) but catastrophic if misapplied to financial or user-facing data.

Filesystem and Disk Firmware Lies

Some storage controllers and filesystems lie about fsync(). They report success but hold data in a write-back cache that can be lost on power failure. The famous PostgreSQL fsync gate (2019) revealed that on some Linux kernels, I/O errors during fsync were silently swallowed—the database thought data was durable when it was not.

Mitigations:

Use storage with battery-backed write cache (BBWC) or capacitor-backed NVMe.
Enable fsync = on and full_page_writes = on in PostgreSQL.
Use O_DIRECT mode where available to bypass the OS page cache entirely.

Checkpoint Frequency and Recovery Time

The WAL is replayed from the last checkpoint on crash recovery. A checkpoint writes all dirty buffer pool pages to disk and records the WAL position. The farther apart your checkpoints, the longer crash recovery takes.

Checkpoint at LSN 5000 → crash at LSN 9000 → must replay 4000 WAL records
Checkpoint at LSN 8500 → crash at LSN 9000 → must replay 500 WAL records

checkpoint_completion_target (Postgres) spreads checkpoint I/O over time to avoid write spikes. Tightening checkpoint intervals reduces recovery time but increases I/O pressure during normal operation.

WAL in Replication

WAL records are also the unit of physical replication. A standby connects to the primary, receives a stream of WAL records, and replays them. This is why:

Physical replicas are byte-for-byte identical to the primary.
Replication lag is measured in WAL bytes not yet applied to the standby.
synchronous_standby_names in Postgres makes COMMIT wait until the standby confirms WAL receipt before acknowledging to the application—this extends the durability guarantee to the replica.

Key Takeaways

WAL records must be durably written (via fsync) before a transaction's COMMIT is acknowledged—this is what durability in ACID means at the physical level.
write() returning success is not durability; fsync() returning success is (with caveats around hardware lies).
Group commit amortizes the cost of fsync across concurrent transactions, making high-concurrency workloads more efficient than low-concurrency ones on a per-commit basis.
synchronous_commit = off gains latency at the cost of a small durability window—appropriate only for data you can afford to lose.
Checkpoint frequency controls the tradeoff between crash recovery time and normal I/O pressure.
Physical replication is WAL streaming—replication lag is directly correlated with WAL generation rate on the primary.