Database Internals

B-trees and LSM

Ravinder·December 22, 2025·6 min read

DatabaseInternalsArchitectureB-treeLSM-treeIndexes

Series

Database Internals for App Engineers

Part 2 of 10

← Part 1

Storage Engines: Row vs Column

Part 3 →

WAL and Durability

Every primary index in your database is backed by one of two dominant data structures: a B-tree or a Log-Structured Merge-tree (LSM-tree). The choice is baked in at the engine level—PostgreSQL and MySQL use B-trees; Cassandra, RocksDB, LevelDB, and ClickHouse's MergeTree all use LSM variants. If you have ever wondered why Cassandra writes feel fast but reads feel heavier, or why PostgreSQL bloats under heavy insert load, the answer lives in these structures.

B-tree: The Read-Optimized Workhorse

A B-tree organizes data in a balanced tree of fixed-size pages (typically 8 KB in PostgreSQL, 16 KB in MySQL). Each internal node holds sorted keys and pointers to children. Leaf nodes hold the actual data or pointers to heap rows.

                    [30 | 60]
                   /    |    \
          [10|20]   [40|50]   [70|80]
          /  |  \    / | \     / |  \
        ...  ...  ...         ...

A lookup traverses from root to leaf: O(log N) page reads. For a table with 100 million rows, that is roughly 3–4 page reads from a well-cached tree. Reads are predictable and fast.

Updates in place: when you UPDATE a row, the engine finds the leaf page, modifies the key/value in place, and writes a dirty page back to disk (mediated by the WAL). This in-place mutation is what gives B-trees their read advantage—data is always in its final, sorted location.

Read amplification: low. One tree traversal. Write amplification: moderate. Every update touches at least a leaf page, possibly triggers a page split that cascades upward, and all dirty pages must be flushed. Under heavy random writes, page splits cause fragmentation and the engine must periodically rebalance (VACUUM in Postgres, OPTIMIZE TABLE in MySQL).

LSM-tree: The Write-Optimized Alternative

LSM-trees never update data in place. All writes go to an in-memory buffer (the memtable). When the memtable fills, it is flushed to disk as a sorted, immutable file called an SSTable (Sorted String Table). Over time, background compaction merges SSTables, eliminating stale versions and maintaining sorted order.

flowchart LR W[Write] --> M[Memtable\nin-memory] M -->|flush when full| L0[L0 SSTables] L0 -->|compaction| L1[L1 SSTables] L1 -->|compaction| L2[L2 SSTables] R[Read] --> M R --> L0 R --> L1 R --> L2 style M fill:#4CAF50,color:#fff style L0 fill:#2196F3,color:#fff style L1 fill:#2196F3,color:#fff style L2 fill:#2196F3,color:#fff

Writes are sequential—memtable flushes and SSTable compactions write large, contiguous blocks. Sequential I/O is 10–100x faster than random I/O on spinning disks and significantly faster even on SSDs.

Write amplification: moderate to high. A single logical write may be physically rewritten several times across compaction levels. RocksDB's leveled compaction writes data roughly 10–30x total over its lifetime. Read amplification: higher. A read must check the memtable, then each SSTable level (mitigated by Bloom filters, which let the engine skip SSTables that definitely do not contain a key).

Amplification Numbers Side by Side

Metric	B-tree (Postgres)	LSM (RocksDB/Cassandra)
Write amplification	~2–5x	~10–30x (leveled)
Read amplification	~3–4 IOs (cached tree)	~1–7 IOs + Bloom filter
Space amplification	~1.1–1.5x	~1.1–2x (during compaction)
Random write perf	Moderate	High
Point read perf	High	Moderate

Pseudocode: How Compaction Works

# Simplified leveled compaction (as in RocksDB)
def compact_level(level_n_files, level_n1_files):
    # Pick the oldest/largest file from level N
    chosen = pick_compaction_candidate(level_n_files)
    # Find overlapping files in level N+1 (key range overlap)
    overlapping = [f for f in level_n1_files
                   if f.key_range.overlaps(chosen.key_range)]
    # Merge-sort all chosen + overlapping files
    merged = merge_sort_sst_files([chosen] + overlapping)
    # Write new immutable SSTables at level N+1
    new_files = write_sst_files(merged, level=level_n + 1)
    # Delete old files
    delete_files([chosen] + overlapping)
    return new_files

Compaction is the heartbeat of an LSM engine. When compaction falls behind writes (write stall), the database slows down or halts writes temporarily to let compaction catch up—a phenomenon Cassandra operators call a "compaction storm."

Workload Signals: Which to Choose

B-tree wins when:

Your workload is read-heavy with frequent point lookups.
You need strong, in-place update semantics (UPDATE of existing rows is common).
Your write rate is moderate—thousands per second rather than millions.
You need range scans over large datasets (B-tree leaf pages are linked for efficient sequential scan).

LSM-tree wins when:

You have extremely high write throughput (millions of events/sec, time-series, IoT telemetry).
Writes are mostly appends or upserts rather than in-place updates of existing values.
You can tolerate slightly higher read latency in exchange for much lower write latency.
Your dataset is write-heavy at ingestion time but read-rarely (log archives, audit trails).

App-Level Implications

-- PostgreSQL (B-tree): this pattern is efficient because each UPDATE
-- hits a leaf page directly and modifies in place.
UPDATE inventory SET quantity = quantity - 1 WHERE sku_id = $1;
 
-- On an LSM engine (Cassandra equivalent), every UPDATE is actually
-- a new INSERT with a higher timestamp. The old value becomes a tombstone.
-- Heavy UPDATE patterns create tombstone accumulation, slowing reads.
INSERT INTO inventory (sku_id, quantity, updated_at)
VALUES ($1, $2, toTimestamp(now()));
-- Cassandra will merge versions during compaction, but reads must check
-- multiple SSTables until it finds the latest version.

This explains a common Cassandra antipattern: modeling a table that receives heavy overwrites (like a leaderboard) as a Cassandra table. The LSM tombstone accumulation creates read latency that grows over time until compaction clears it.

Key Takeaways

B-trees store data in sorted, mutable pages and excel at point reads and range scans; LSM-trees buffer writes in memory and flush immutable sorted files, excelling at high write throughput.
Read amplification and write amplification are inversely traded between the two structures—no free lunch.
Bloom filters are the LSM's primary defense against read amplification; they let the engine skip SSTables with high probability.
Compaction is essential to LSM health: falling behind on compaction leads to write stalls and degraded read performance.
Heavy UPDATE/DELETE patterns on LSM engines create tombstone accumulation—model your data to favor appends where possible.
Choose based on your dominant operation: if reads outnumber writes, B-tree; if writes are the bottleneck, LSM.