Database Internals

Sharding Strategies

Ravinder·January 19, 2026·6 min read

DatabaseInternalsArchitectureShardingDistributed SystemsScalability

Series

Database Internals for App Engineers

Part 6 of 10

← Part 5

Replication Mechanics

Part 7 →

Distributed Consensus

Sharding is the point where a database problem becomes an application architecture problem. Once you split data across multiple database nodes, every query, transaction, and schema migration must be aware of that split. Teams that reach for sharding too early find themselves maintaining distributed transaction logic with no meaningful performance gain. Teams that reach for it too late hit a wall at 3 AM when a single node cannot absorb another Black Friday spike. The decision of when and how to shard is one of the highest-leverage architectural choices you will make.

What Sharding Actually Means

Sharding partitions data across multiple independent database instances (shards), each owning a subset of rows. Unlike replication—where every node has all data—sharding means each node has only a fraction. Queries against data on a single shard are fast and local; queries that span shards require coordination.

The two fundamental questions:

How do you decide which shard a given row lives on? (the shard key and routing function)
What happens when a query or transaction touches multiple shards?

Hash Sharding

Hash sharding applies a hash function to the shard key and maps the result to a shard bucket.

def shard_for(user_id: int, num_shards: int) -> int:
    return hash(user_id) % num_shards

Pros: Distributes rows uniformly across shards, preventing hot spots. Simple to implement.

Cons: Range queries (e.g., "all users created in January") scatter across all shards—you must query every shard and merge results (a scatter-gather query). Adding a shard changes the mapping for most keys, requiring a data migration (re-sharding).

Range Sharding

Range sharding assigns contiguous key ranges to shards.

Shard 0: user_id 1        – 1,000,000
Shard 1: user_id 1,000,001 – 2,000,000
Shard 2: user_id 2,000,001 – 3,000,000

Pros: Range queries are efficient—they hit at most a few adjacent shards. Easy to add a new shard at the high end of the range.

Cons: If new users skew toward the highest user IDs (monotonically increasing keys), the last shard absorbs all writes—a hot spot. Access patterns that are temporally correlated (e.g., event tables sharded on timestamp) create the same problem.

flowchart LR A[Application] --> R{Shard Router} R -->|user_id 1-1M| S0[Shard 0] R -->|user_id 1M-2M| S1[Shard 1] R -->|user_id 2M+| S2[Shard 2\n⚠️ Hot shard\nall new writes] style S2 fill:#f44336,color:#fff

Consistent Hashing

Consistent hashing places shards and keys on a virtual ring. Each shard owns the key range between itself and its predecessor on the ring. Adding or removing a shard only moves a fraction of keys—typically 1/n of total data.

Ring (positions 0–360°):
  Shard A at 60°
  Shard B at 180°
  Shard C at 300°
 
Key hashed to 90°  → owned by Shard B (next shard clockwise)
Key hashed to 200° → owned by Shard C
Key hashed to 320° → owned by Shard A (wraps around)

Adding Shard D at 240°: only keys between 180° and 240° move from Shard C to Shard D.

This is how Cassandra, DynamoDB, and Redis Cluster distribute data. Virtual nodes (vnodes) improve balance by assigning each physical shard multiple positions on the ring.

The tradeoff: consistent hashing still scatters range queries. It solves re-sharding cost, not query locality.

Directory-Based Sharding

A lookup service (the shard directory) maps each shard key to its shard. The application queries the directory before routing the request.

-- Shard directory table
SELECT shard_id FROM shard_map WHERE tenant_id = $1;
-- Then connect to the appropriate shard

Pros: Fully flexible—any key can be moved to any shard without a hash function change. Enables per-tenant isolation (each large customer on its own shard, small customers co-located).

Cons: The directory is a single point of failure and must be cached aggressively. Cache invalidation during shard moves requires careful coordination.

Cross-Shard Operations: The Real Cost

The biggest hidden cost of sharding is operations that span multiple shards.

-- This query on a sharded-by-user_id schema requires
-- querying every shard for orders where seller_id = $1
-- because seller_id is not the shard key
SELECT * FROM orders WHERE seller_id = $1;

Solutions:

Dual-write: write to both the primary shard (by buyer) and a secondary shard (by seller). Doubles write load; consistency requires care.
Global tables: small, infrequently changing tables are replicated to all shards (e.g., currency codes, product categories).
Denormalization: embed seller data in each order row to avoid cross-shard joins.

Cross-shard transactions (two-phase commit / 2PC) are possible but expensive: each shard must vote, a coordinator must collect votes, and any participant failure blocks the transaction. Most sharded systems avoid 2PC by designing the shard key so that related data co-locates.

App-Level Shard Key Selection

The shard key decision shapes your schema for years. Rules of thumb:

Rule	Rationale
Use the entity that most queries filter by	Minimize scatter-gather
Avoid monotonically increasing keys for range sharding	Prevents hot spots on inserts
Choose a high-cardinality key	Low-cardinality keys cannot distribute across many shards
Co-locate related entities on the same shard	Enables local joins and transactions

For a multi-tenant SaaS, tenant_id is almost always the right shard key—it co-locates all data for a tenant, avoids cross-shard queries for tenant-scoped operations, and enables easy tenant isolation.

Key Takeaways

Hash sharding distributes load evenly but breaks range queries and makes re-sharding expensive; range sharding enables efficient range queries but risks hot spots on monotonic keys.
Consistent hashing minimizes data movement when adding or removing shards by mapping both keys and shards to a virtual ring; it is the dominant approach in distributed key-value and wide-column stores.
Cross-shard queries require scatter-gather execution across all shards; design your shard key to minimize how often this happens for your dominant query patterns.
Cross-shard transactions via 2PC are possible but costly—most architectures eliminate them by co-locating related data on the same shard.
Directory-based sharding trades routing simplicity for flexibility; it is powerful for multi-tenant systems where per-tenant placement matters.
Shard key selection is a one-way door: changing it requires a full data migration; get it right during the initial design.