Distributed Search Walkthrough
Search is one of those system design prompts where the surface area looks bounded — "Design a search engine" — but the depth is enormous. The gap between a search box that works and a search system that serves 50,000 queries per second with sub-100ms latency and relevant results is a few orders of magnitude of architectural decisions. Most candidates describe an Elasticsearch cluster and move on. This post excavates what that actually means and why the decisions inside it matter.
Scope First
Before designing, nail down what "search" means in this context:
- Full-text search over structured documents (e.g., job listings, products, articles) — the common case.
- Scale: 1 billion documents. 50K queries/second. Indexing pipeline: 10K new or updated documents/second.
- Latency: < 100ms p99 for query results.
- Ranking: relevance + recency + business signals (promoted listings, personalization).
- Out of scope: autocomplete, spell correction, semantic/vector search (unless interviewer steers toward it).
Derived: 1B documents at ~2KB average = ~2TB of raw document data. The inverted index for this corpus will be roughly 50–100% of document size, so ~1–2TB of index data. This does not fit on one machine.
The Inverted Index: What It Is and Why It Is Hard to Distribute
An inverted index maps each term to a posting list — the list of document IDs containing that term, along with term frequency and position information.
"python" → [(doc_7, tf=3, pos=[12,45,98]), (doc_24, tf=1, pos=[5]), ...]
"engineer" → [(doc_3, tf=2, pos=[1,67]), (doc_7, tf=1, pos=[4]), ...]For a query "python engineer", you retrieve both posting lists, intersect them (or score and merge), rank by relevance, and return the top K document IDs. Fetching full document content is a separate step.
The challenge at 1 billion documents: posting lists for common terms ("the", "software", "engineer") contain millions of entries. They must be compressed (delta encoding on sorted doc IDs, variable-length integer encoding), stored efficiently, and merged quickly during query time.
Sharding Strategy: Document Partitioning vs. Term Partitioning
Two fundamental approaches to distributing an inverted index:
Document partitioning: each shard holds a subset of documents and a complete local index for those documents. A query fans out to all shards in parallel (scatter), collects top-K results from each shard (gather), and merges to produce the global top-K.
Term partitioning: each shard holds the full posting list for a subset of terms. A query routes each term to the appropriate shard, retrieves posting lists in parallel, and merges locally.
Document partitioning is the dominant approach in production systems (Elasticsearch, Solr). Reasons: term partitioning is hard to balance (rare terms are tiny, common terms are enormous), and multi-term queries require fetching from every term shard anyway. Document partitioning distributes evenly and parallelizes naturally.
The scatter-gather overhead is the tradeoff: every query hits every shard. With 50 shards and 50K QPS, that is 2.5M shard-level operations per second. This is manageable but must be accounted for in sizing.
Replication: Availability vs. Throughput
Each shard should have at least one replica for read throughput and one for fault tolerance — so typically 2 replicas per shard (1 primary + 2 replica = 3 copies).
Replicas serve read queries directly; the primary handles index writes and propagates to replicas asynchronously. This means search results can lag behind the latest index updates by seconds — acceptable for most use cases.
With 50 primary shards and 2 replicas each, you have 150 shard copies. Sizing: if each shard holds ~2–4GB of index data, 150 shards × 3GB = ~450GB total. With modern NVMe SSDs and plenty of RAM for OS page cache (hot posting lists stay in memory), 15–20 nodes with 32–64GB RAM each is a reasonable cluster.
The Indexing Pipeline
Documents do not flow directly into the search index. They pass through an ingestion pipeline:
Source (Kafka) → Document Fetcher → Text Processor → Indexer → Shard
Text Processor:
1. Tokenization (split text into terms)
2. Normalization (lowercase, unicode normalization)
3. Stop-word removal ("the", "a", "is")
4. Stemming / lemmatization ("engineers" → "engineer")
5. Field boosting (title terms weighted 3× body terms)
6. Generate document vector for ML ranking signalsAt 10K document updates/second, the text processing pipeline needs to be horizontally scalable. A Kafka-backed pipeline with parallel consumer groups for processing and bulk-indexing is the standard approach.
Near-real-time indexing: Elasticsearch refreshes its index every second by default — committing in-memory writes to a searchable segment. This 1-second lag is acceptable for most document search use cases. For truly real-time search (e.g., live social media monitoring), you reduce the refresh interval at the cost of more I/O overhead.
Ranking: The Layer Most Candidates Skip
Returning a list of matching documents is not the same as returning a relevant list. Ranking is a pipeline:
- Retrieval score (BM25): term frequency + inverse document frequency. This is the default relevance signal in most full-text search engines.
- Freshness signal: a recency decay function that boosts newer documents for time-sensitive queries.
- Business signals: promoted documents, verified sellers, quality scores.
- Personalization: user interaction history (clicked, bookmarked) reweights individual results. Usually applied as a late-stage reranking on the top-100 BM25 results.
def rank(query, candidate_docs, user_context):
for doc in candidate_docs:
doc.score = (
0.6 * bm25_score(query, doc)
+ 0.2 * recency_decay(doc.published_at)
+ 0.1 * doc.quality_score
+ 0.1 * personalization_score(user_context, doc)
)
return sorted(candidate_docs, key=lambda d: d.score, reverse=True)[:10]Full ML-based ranking (learning-to-rank, two-tower models) requires feature pipelines and model serving infrastructure — out of scope for most interviews, but worth mentioning as the production evolution.
Capacity Summary
| Component | Scale |
|---|---|
| Query QPS | 50K/sec |
| Shards (document partitioned) | 50 primary + 100 replica = 150 total |
| Scatter-gather per query | 50 shard requests/query → 2.5M shard ops/sec |
| Indexing throughput | 10K documents/sec → Kafka with 10+ consumer partitions |
| Total index storage | ~450GB across 150 shard copies |
| Cluster size | ~15–20 nodes, 32–64GB RAM, NVMe SSD |
Key Takeaways
- Document partitioning (scatter-gather) is the standard sharding strategy for distributed search — term partitioning is academically interesting but hard to balance in practice.
- Every query fans out to all shards; this scatter-gather overhead must be designed for explicitly — 50 shards × 50K QPS = 2.5M shard operations per second.
- The indexing pipeline (tokenization, normalization, stemming, field boosting) is as architecturally significant as the query path — 10K doc/sec requires a horizontally scalable Kafka-backed pipeline.
- Replicas serve read throughput and fault tolerance, not just redundancy — route query traffic to replicas to offload primaries.
- Near-real-time indexing (1-second refresh interval) is a deliberate latency tradeoff — reducing it increases I/O overhead.
- Ranking is a multi-signal pipeline (BM25 + recency + quality + personalization), not just a relevance sort — knowing this separates candidates who have shipped search from those who have only read about it.