Storage Engines: Row vs Column
Part 2 →
B-trees and LSM
Most engineers choose a database the same way they choose a cloud region: pick what the last team used and move on. That works—until a query that should take milliseconds starts taking minutes, or your write throughput collapses under analytic load. The root cause is almost always a mismatch between your workload and the storage engine's physical layout.
Storage engines are not an implementation detail. They are the first decision a database makes about your data, and every subsequent behavior—index efficiency, compression ratio, query parallelism—flows from it.
How Row Storage Organizes Data on Disk
Row-oriented engines (PostgreSQL, MySQL InnoDB, SQLite) write every column of a row contiguously in a heap file or page:
Page 0 [ id=1, name="Alice", age=30, balance=1200.00 ]
[ id=2, name="Bob", age=25, balance=800.00 ]
[ id=3, name="Carol", age=28, balance=3400.00 ]When you run SELECT * FROM users WHERE id = 2, the engine reads the page containing row 2 and returns everything in one I/O. That is ideal for OLTP: short, primary-key lookups that need all columns.
The penalty arrives when you run analytic queries:
SELECT AVG(balance) FROM accounts; -- scans every row just to read one columnEvery page must be read even though only balance matters. For a table with 50 columns and 100 million rows, you are pulling 49 columns of irrelevant data off disk on every scan.
How Column Storage Organizes Data on Disk
Columnar engines (DuckDB, ClickHouse, Redshift, Parquet files read by Spark) store each column as its own contiguous region:
Column "id": [ 1, 2, 3, 4, 5, ... ]
Column "name": [ "Alice", "Bob", "Carol", ... ]
Column "balance": [ 1200.00, 800.00, 3400.00, ... ]SELECT AVG(balance) now reads only the balance column—a fraction of the data. Compression ratios are dramatically better too because adjacent values in the same column tend to be similar (e.g., all balances hover around the same order of magnitude), making run-length encoding and dictionary compression highly effective.
The tradeoff: inserting a single row requires writing to every column file. Point lookups that need many columns must stitch values back together from multiple locations.
Decision Matrix
App-Level Signals That Tell You Which to Pick
Pick row storage when:
- Your queries are keyed by primary key or a narrow indexed range.
- You need
INSERT,UPDATE,DELETEat high frequency. - Rows are frequently returned whole (e.g., loading a user profile).
- Your schema changes often—wide tables are cheaper to alter without rebuilding column files.
Pick columnar storage when:
- Queries aggregate over millions of rows but touch fewer than 20% of columns.
- Data lands in bulk (nightly ETL, event streams) rather than individual inserts.
- Compression matters—columnar layouts routinely achieve 5–10x better ratios.
- You run GROUP BY, window functions, or time-series rollups as your dominant pattern.
Hybrid Approaches You Will Encounter
Real systems rarely fit cleanly into one bucket.
PostgreSQL TOAST automatically moves wide columns (text, jsonb, bytea) out of the main heap into a secondary TOAST table. This gives you some separation without a true columnar layout.
TimescaleDB hypertables partition time-series data in time chunks, compressing older chunks into a columnar format. You get row storage behavior for recent data and columnar behavior for historical queries—without changing your SQL.
Delta Lake / Iceberg store Parquet (columnar) files on object storage but add a transaction log so you can run ACID updates. Apps write row-by-row via a streaming engine; the file layer compacts into columnar chunks.
-- PostgreSQL: check what columns are being pulled for a heavy query
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT region, SUM(revenue)
FROM sales
GROUP BY region;
-- Look for "Seq Scan" with high "Buffers: shared hit" — that is your signal
-- that the engine is reading far more data than the query actually needs.Compression as a First-Class Consideration
Columnar compression is not just a storage saving—it directly reduces I/O and therefore query latency. Here is what common schemes do to your data:
| Encoding | Best for | Typical ratio |
|---|---|---|
| Run-length (RLE) | Low-cardinality columns | 10–50x |
| Dictionary | String columns, enums | 3–10x |
| Delta + bit-pack | Timestamps, auto-inc IDs | 4–8x |
| ZSTD frame | General-purpose fallback | 2–4x |
Row engines apply compression at the page level, not the column level, so they cannot exploit column-level value similarity as effectively.
What This Means for Schema Design
If you are on a row engine serving analytic queries, index selection becomes critical. A covering index that includes only the queried columns approximates columnar behavior:
-- Instead of a full table scan for a reporting query:
CREATE INDEX idx_sales_summary
ON sales (region, sale_date)
INCLUDE (revenue);
-- Now the planner can satisfy the query from the index alone (index-only scan)
-- without touching the heap at all.On a columnar engine, wide tables are cheap—adding 20 metadata columns costs almost nothing for queries that do not touch those columns. This inverts the normalization calculus: denormalize aggressively to avoid joins, because joins across large datasets in a columnar engine still require coordination.
Key Takeaways
- Row storage is optimized for point lookups and high-frequency writes; columnar storage is optimized for aggregations over narrow column slices across many rows.
- The physical layout is not abstract—it determines how many bytes cross the I/O bus for every query your application runs.
- Compression ratios in columnar engines are significantly better because identical or similar values are stored adjacent, enabling dictionary and RLE encoding.
- Hybrid designs (TimescaleDB compression, Delta Lake, materialized views) let you serve both OLTP and OLAP workloads without committing to a single engine.
- On a row engine, covering indexes are your best lever for approximating columnar behavior on hot reporting queries.
- Schema decisions—normalization depth, column count, data types—have direct physical consequences that differ between row and columnar engines.
Part 2 →
B-trees and LSM