Database Internals

Query Planning

Ravinder·February 2, 2026·6 min read

DatabaseInternalsArchitectureQuery PlanningPostgreSQLPerformance

Series

Database Internals for App Engineers

Part 8 of 10

← Part 7

Distributed Consensus

Part 9 →

Indexes, End to End

You write SQL. The database turns it into a plan. That plan walks the data in some order, uses or ignores your indexes, joins tables in some sequence, and applies filters at some point in the pipeline. If you do not understand how the planner makes those decisions, you are flying blind: you cannot predict when a query will regress, you cannot fix a bad plan without guessing, and you cannot write SQL that stays fast as your data evolves.

How the Cost-Based Optimizer Works

PostgreSQL's query planner is a cost-based optimizer (CBO). It enumerates candidate execution plans, assigns a cost estimate to each, and picks the cheapest one. Cost is expressed in abstract units calibrated to disk page reads (seq_page_cost = 1.0, random_page_cost = 4.0 by default).

The cost of a plan depends on:

Statistics: how many rows the table has, how they are distributed across column values, how many distinct values exist.
Access methods: sequential scan, index scan, bitmap index scan, index-only scan.
Join algorithms: nested loop, hash join, merge join.
Operator selectivity: what fraction of rows survive each filter.

The planner does not execute queries to estimate these—it consults precomputed statistics stored in pg_statistic and collected by ANALYZE.

Reading an EXPLAIN ANALYZE Output

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, o.total, u.email
FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.status = 'pending'
  AND o.created_at > now() - interval '7 days';

A typical output:

Hash Join  (cost=142.50..3821.30 rows=1240 width=48)
           (actual time=12.3..89.4 rows=1187 loops=1)
  Buffers: shared hit=2340 read=187
  Hash Cond: (o.user_id = u.id)
  ->  Seq Scan on orders o
        (cost=0.00..3410.00 rows=1240 width=24)
        (actual time=0.05..67.2 rows=1187 loops=1)
        Filter: ((status = 'pending') AND (created_at > ...))
        Rows Removed by Filter: 98813
        Buffers: shared hit=2190 read=187
  ->  Hash  (cost=98.00..98.00 rows=3560 width=28)
        (actual time=11.9..11.9 rows=3560 loops=1)
        Buckets: 4096  Batches: 1
        ->  Seq Scan on users u  (cost=0.00..98.00 ...)

Key things to read:

cost=X..Y: estimated startup cost (X) and total cost (Y).
actual time=X..Y: real elapsed milliseconds (startup..total).
rows=N: estimated rows vs actual rows. Large divergence = stale statistics.
Buffers: shared hit=N read=M: pages found in cache (hit) vs read from disk (read). High read = cache miss pressure.
Rows Removed by Filter: rows scanned but discarded. High number on a seq scan = the filter is not selective enough to justify an index, or an index is missing.

The optimizer is only as good as its statistics. PostgreSQL's ANALYZE collects:

n_distinct: number of distinct values in a column.
most_common_vals: the most frequent values and their frequencies.
histogram_bounds: bucket boundaries for the value distribution.

-- Inspect statistics for a column
SELECT
  attname,
  n_distinct,
  most_common_vals,
  most_common_freqs,
  histogram_bounds
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';

When statistics are stale or missing, the optimizer guesses. Common symptoms:

The planner chooses a seq scan on a large table that has a relevant index—it thinks the filter is not selective, but the actual data is highly skewed.
The planner chooses a nested loop join on two large tables—it underestimates one table's row count.

-- Force re-analysis on a table with stale stats
ANALYZE orders;
 
-- For skewed data, increase statistics target for specific columns
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;
-- Higher target = more histogram buckets = better estimates for skewed columns

Join Algorithm Selection

The planner chooses among three join algorithms:

flowchart TD J[Join Required] --> S{Estimated\nrow counts} S -->|One side small| NL[Nested Loop\nOuter × index lookup on inner\nBest for small × large with index] S -->|Both sides large\nno sort order| HJ[Hash Join\nBuild hash table on smaller side\nBest for large unsorted joins] S -->|Both sides sorted\nor sorted index available| MJ[Merge Join\nWalk both sorted inputs\nBest when data pre-sorted]

Nested loop scales as O(outer × index_lookup_cost). Fast when the outer is small (< a few thousand rows) and the inner has a good index.

Hash join builds a hash table from the smaller relation in memory, then probes it for each row of the larger relation. Cost is O(N + M). Falls back to disk-based batches if the hash table exceeds work_mem.

Merge join requires both inputs to be sorted on the join key. Fast when the data is already sorted (indexed columns), since it walks both sides simultaneously.

-- Increase work_mem to let hash joins fit in memory (per-session)
SET work_mem = '256MB';
EXPLAIN (ANALYZE, BUFFERS)
SELECT ...;
-- If the plan shifts from Batches: 4 to Batches: 1, the hash join
-- was previously spilling to disk. This can 5–10x the join speed.

Common Plan Pathologies and Fixes

Symptom	Likely Cause	Fix
Seq scan on large table	Missing index, stale stats, low selectivity estimate	Create index, ANALYZE, increase statistics target
Nested loop on large × large	Row count underestimate	ANALYZE, check for statistics correlation
Hash join spills to disk	work_mem too low	Increase work_mem, or reduce join breadth
Index scan ignored	Random page cost too high for SSD	Set `random_page_cost = 1.1` for NVMe
Plan changes on different parameter value	Parameter sniffing / plan cache	Use `pg_hint_plan` or query rewrite

-- On SSDs, random I/O is almost as fast as sequential
-- Tell the planner this so it stops over-preferring seq scans
ALTER SYSTEM SET random_page_cost = 1.1;
SELECT pg_reload_conf();

Key Takeaways

The cost-based optimizer enumerates candidate plans and picks the lowest-estimated cost; costs are calibrated to page read units, not wall-clock time.
Statistics (pg_statistic, collected by ANALYZE) are the optimizer's input—stale or missing statistics cause bad plan choices that cannot be fixed by indexes alone.
EXPLAIN (ANALYZE, BUFFERS) shows both estimated and actual rows; large divergence between the two is the primary diagnostic signal for optimizer misestimates.
Join algorithm selection depends on estimated row counts and memory: nested loop for small outer relations with indexed inner, hash join for large unsorted relations, merge join when inputs are pre-sorted.
work_mem controls hash join and sort memory; if hash joins spill to disk (Batches > 1), increasing work_mem can dramatically reduce join latency.
On NVMe storage, set random_page_cost = 1.1 to prevent the planner from over-preferring sequential scans over index scans based on outdated spinning-disk cost assumptions.