Database Internals

Indexes, End to End

Ravinder·February 9, 2026·6 min read

DatabaseInternalsArchitectureIndexesPostgreSQLPerformance

Series

Database Internals for App Engineers

Part 9 of 10

← Part 8

Query Planning

Part 10 →

The Cost of Consistency

The most common database performance antipattern is not a missing index—it is a table with fifteen indexes, half of which the planner never uses, all of which slow down every write. Indexes are a read/write tradeoff. Understanding that tradeoff end to end—how each index type is structured, when the planner will use it, and what it costs on writes—is what separates engineers who tune databases from engineers who randomly add indexes until things stop being slow.

How B-tree Indexes Are Structured

The default index type in PostgreSQL, MySQL, and most relational databases is a B-tree. An index is a separate data structure that maps column values to heap row locations (TIDs in PostgreSQL: page number + slot offset).

B-tree index on users(email):
 
Root: [m@ex.com | z@ex.com]
       /             |           \
[a@ex.com .. l@ex.com] [m@ex.com .. y@ex.com] [z@ex.com ..]
      |
[TID(3,1) → a@ex.com]
[TID(7,4) → b@ex.com]
[TID(1,2) → c@ex.com]
...

When you query WHERE email = 'a@example.com':

The planner traverses the B-tree from root to leaf: O(log N) page reads.
It finds the matching TID(s).
It fetches the heap row at that TID.

Step 3—the heap fetch—is the cost that index-only scans eliminate. If the query only needs columns present in the index, the heap fetch is skipped entirely.

Index Types and When to Use Each

PostgreSQL supports multiple index access methods, each optimized for different operators:

Index Type	Structure	Best For	Not For
B-tree	Balanced tree	`=`, `<`, `>`, `BETWEEN`, `LIKE 'prefix%'`	Full-text, array containment
Hash	Hash table	`=` only	Range queries
GIN	Inverted index	Array containment, full-text search, JSONB	Ordered access
GiST	Generalized search tree	Geometric types, range overlap, similarity	High write throughput
BRIN	Block range index	Monotonically increasing columns (timestamps)	Random access patterns
Bloom	Bit signature filter	Equality on many columns simultaneously	Range queries

-- GIN index for JSONB queries
CREATE INDEX idx_events_payload ON events USING GIN (payload);
-- Enables: WHERE payload @> '{"type": "click"}'
 
-- BRIN for time-series (tiny index, good enough for appended data)
CREATE INDEX idx_logs_created USING BRIN (created_at);
-- 128-page blocks; works because new rows always have larger timestamps
 
-- Partial index: only index rows that match a condition
CREATE INDEX idx_orders_pending
  ON orders (created_at)
  WHERE status = 'pending';
-- Tiny index, fast for the common query, no overhead for completed orders

The Write Cost of Indexes

Every index is updated on every write that touches the indexed columns. For a table with 10 indexes receiving 50,000 inserts/second:

Each INSERT:
  1 heap page write
+ 10 index page writes (one per index)
= 11 page writes per row
 
At 50,000 rows/sec = 550,000 page writes/sec
vs 50,000 page writes/sec with no indexes

This is write amplification by index count. It compounds with B-tree page splits under random-key inserts.

-- Find unused indexes (safe to drop candidates)
SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelid NOT IN (
    SELECT conindid FROM pg_constraint WHERE contype IN ('p','u')
  )
ORDER BY pg_relation_size(indexrelid) DESC;
-- idx_scan = 0 since last stats reset → the planner never used this index

Drop unused indexes. They cost writes, consume disk, and the planner must consider them during planning (adding overhead).

Covering Indexes and Index-Only Scans

A covering index includes all columns a query needs, eliminating the heap fetch:

-- Query: SELECT user_id, total FROM orders WHERE status = 'shipped'
-- Without covering index: index scan → TID → heap fetch for each row
 
-- With covering index (include non-key columns):
CREATE INDEX idx_orders_status_covering
  ON orders (status)
  INCLUDE (user_id, total);
 
-- EXPLAIN now shows "Index Only Scan" — no heap access
EXPLAIN SELECT user_id, total FROM orders WHERE status = 'shipped';
-- Index Only Scan using idx_orders_status_covering on orders
-- (cost=0.43..134.22 rows=4210 width=12)

Index-only scans are the fastest possible read path. The only overhead is the visibility map check (to verify that the heap page is all-visible, meaning VACUUM has processed it). On a well-vacuumed table, index-only scans avoid all heap I/O.

Composite Index Column Order

For a composite index (a, b, c), the index is useful for queries filtering on:

a alone
a, b together
a, b, c together

It is NOT useful for queries filtering on b alone or c alone. Column order in a composite index follows the leftmost prefix rule.

-- Index: (region, status, created_at)
-- Useful for:
WHERE region = 'us-east'
WHERE region = 'us-east' AND status = 'active'
WHERE region = 'us-east' AND status = 'active' AND created_at > '2025-01-01'
 
-- NOT useful for:
WHERE status = 'active'          -- skips leading column
WHERE created_at > '2025-01-01' -- skips two leading columns

Order guideline: equality filters first (high-selectivity columns before low-selectivity), range filter last.

Index Bloat

B-tree indexes accumulate bloat under heavy UPDATE/DELETE load, just like heap tables. Dead index entries remain until VACUUM cleans them. Index bloat degrades scan performance (more pages to read) and wastes disk.

flowchart LR I[INSERT 1M rows] --> B[B-tree fills pages] D[DELETE 500K rows] --> T[Dead entries remain\nin index pages] T --> BL[Index bloat:\nindex reads more pages\nthan necessary] BL --> V[VACUUM reclaims space] V --> R[Index pages compacted]

-- Check index bloat using pgstattuple extension
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstattuple('idx_orders_status_covering');
-- dead_tuple_percent > 10% → consider REINDEX CONCURRENTLY

REINDEX CONCURRENTLY rebuilds the index without locking the table—essential for production systems.

Key Takeaways

Every index is a write tax: each additional index adds one index page write per INSERT/UPDATE/DELETE on the indexed columns; audit and drop unused indexes regularly.
Index type selection matters: B-tree for ordered comparisons, GIN for array/JSONB/full-text, BRIN for monotonically increasing columns, partial indexes for filtered subsets.
Covering indexes with INCLUDE columns enable index-only scans, eliminating heap fetches entirely and providing the fastest possible read path.
Composite index column order follows the leftmost prefix rule: equality-filtered columns first, range-filtered columns last.
Index bloat from DELETE/UPDATE accumulation degrades scan performance; monitor with pgstattuple and rebuild with REINDEX CONCURRENTLY.
The planner will not use an index if it estimates that the filter is not selective enough (seq scan is cheaper)—stale statistics or missing column statistics are the usual cause.