Building Production RAG

Hybrid Search: BM25 + Vector

Ravinder·February 5, 2025·8 min read

RAGAILLMInformation RetrievalBM25Hybrid Search

Series

Building Production RAG

Part 4 of 10

← Part 3

Embedding Choice and Dimensionality

Part 5 →

Re-ranking Architectures

There's a persistent myth in the RAG community that dense vector search supersedes keyword search. It doesn't. BM25 consistently outperforms vector search on queries containing product names, model numbers, error codes, proper nouns, and rare technical terms — all the things your users are most likely to paste from an error log or a support ticket. Building a production RAG pipeline that ignores lexical retrieval is leaving a meaningful chunk of recall on the table.

This post covers how hybrid search works, the three fusion strategies worth knowing, and how to tune the weights without guessing.

Why Each Method Fails Alone

Vector search converts both query and documents to dense representations and finds nearest neighbors by cosine similarity. It excels at semantic matching — "how do I reset my password" finds "account recovery options" even though the words don't overlap. It fails when the signal is in a specific token: "Error E2048" returns everything semantically adjacent to errors rather than the one document that mentions E2048.

BM25 is a term-frequency/inverse-document-frequency scoring function — essentially "how often does this query term appear in this document relative to the corpus baseline." It excels at exact matches and rare terms. It fails when the user asks in different words than the document uses.

flowchart LR Q[User Query] --> BM25[BM25\nLexical Index] Q --> VEC[Vector Search\nDense Index] BM25 --> BL[BM25 ranked list\ne.g. top-50] VEC --> VL[Vector ranked list\ne.g. top-50] BL --> FUS[Fusion Layer\nRRF / Linear / Learned] VL --> FUS FUS --> FINAL[Final top-K\nfor LLM context]

The practical implication: a user querying "configure webhook for Stripe event payment_intent.succeeded" needs both. BM25 finds the document mentioning that exact event name. Vector search finds conceptually adjacent documents about payment webhooks. The fusion step surfaces the most relevant content from both.

Setting Up Both Indexes

Most teams already have a vector database. The BM25 index is the piece that's often missing. If your vector DB supports sparse vectors (Weaviate, Qdrant, Elasticsearch), you can store both in one system. Otherwise run BM25 separately with rank_bm25 or Whoosh.

from rank_bm25 import BM25Okapi
import re
from typing import Optional
 
class BM25Index:
    def __init__(self, tokenizer=None):
        self._tokenizer = tokenizer or self._default_tokenize
        self._corpus: list[str] = []
        self._ids: list[str] = []
        self._bm25: Optional[BM25Okapi] = None
 
    @staticmethod
    def _default_tokenize(text: str) -> list[str]:
        # Lowercase, preserve hyphens and dots (important for error codes)
        text = text.lower()
        tokens = re.findall(r"[a-z0-9]+(?:[.\-_][a-z0-9]+)*", text)
        return tokens
 
    def add_documents(self, docs: list[dict]) -> None:
        """docs: [{id, text}, ...]"""
        for doc in docs:
            self._ids.append(doc["id"])
            self._corpus.append(doc["text"])
        tokenized = [self._tokenizer(text) for text in self._corpus]
        self._bm25 = BM25Okapi(tokenized)
 
    def search(self, query: str, top_k: int = 50) -> list[dict]:
        if self._bm25 is None:
            raise RuntimeError("Index is empty")
        tokens = self._tokenizer(query)
        scores = self._bm25.get_scores(tokens)
        top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
        return [
            {"id": self._ids[i], "score": float(scores[i]), "rank": rank + 1}
            for rank, i in enumerate(top_indices)
            if scores[i] > 0  # skip zero-score docs
        ]

One tokenization detail that matters: preserve compound tokens like payment_intent.succeeded, text-embedding-3-large, and version strings like v2.3.1. The default tokenizer above does this with a character-class regex rather than whitespace splitting.

Fusion Strategy 1: Reciprocal Rank Fusion

RRF is the default choice for hybrid search. It ignores raw scores entirely and fuses based on rank position. A document ranked 1st in BM25 and 5th in vector gets a combined RRF score; a document missing from either list doesn't.

def reciprocal_rank_fusion(
    bm25_results: list[dict],   # [{id, rank}, ...]
    vector_results: list[dict], # [{id, rank}, ...]
    k: int = 60,                # RRF smoothing constant — 60 is Cormack et al.'s default
) -> list[dict]:
    scores: dict[str, float] = {}
 
    for result in bm25_results:
        doc_id = result["id"]
        scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + result["rank"])
 
    for result in vector_results:
        doc_id = result["id"]
        scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + result["rank"])
 
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [{"id": doc_id, "rrf_score": score} for doc_id, score in ranked]

RRF is robust and parameter-free (the k=60 default holds up empirically across many domains). Start here.

Fusion Strategy 2: Linear Score Combination

If you want to control the balance between lexical and semantic signal, normalize both score distributions and combine them with a tunable weight:

import numpy as np
 
def linear_fusion(
    bm25_results: list[dict],
    vector_results: list[dict],
    alpha: float = 0.5,  # weight for vector; (1-alpha) for BM25
    top_k: int = 10,
) -> list[dict]:
    def normalize(results: list[dict], score_key: str) -> dict[str, float]:
        if not results:
            return {}
        raw = np.array([r[score_key] for r in results])
        min_v, max_v = raw.min(), raw.max()
        if max_v == min_v:
            return {r["id"]: 0.5 for r in results}
        normalized = (raw - min_v) / (max_v - min_v)
        return {r["id"]: float(s) for r, s in zip(results, normalized)}
 
    bm25_norm = normalize(bm25_results, "score")
    vec_norm = normalize(vector_results, "score")
 
    all_ids = set(bm25_norm) | set(vec_norm)
    combined = {
        doc_id: alpha * vec_norm.get(doc_id, 0.0) + (1 - alpha) * bm25_norm.get(doc_id, 0.0)
        for doc_id in all_ids
    }
    ranked = sorted(combined.items(), key=lambda x: x[1], reverse=True)
    return [{"id": doc_id, "score": score} for doc_id, score in ranked[:top_k]]

The alpha parameter is your tuning knob. Lower alpha favors exact-match queries; higher alpha favors semantic. Tune it against your golden set by query type.

Fusion Strategy 3: Learned Fusion

For high-traffic systems with enough labeled data, train a lightweight classifier on query features to predict the optimal alpha per query:

from sklearn.linear_model import LogisticRegression
import numpy as np
 
def extract_query_features(query: str) -> list[float]:
    """
    Simple features that correlate with BM25 vs vector preference.
    """
    words = query.lower().split()
    return [
        len(words),                                        # query length
        sum(1 for w in words if len(w) > 10) / max(1, len(words)),  # long-word ratio (jargon signal)
        int(any(c.isdigit() for c in query)),             # contains digits (error codes, versions)
        int(any(c.isupper() for c in query[1:])),         # contains uppercase mid-string (acronyms, names)
        sum(1 for w in words if "_" in w or "." in w),    # dotted/underscored tokens
    ]
 
# Train on labeled queries: label=1 means BM25 was better, 0 means vector was better
# Then at serve time: alpha = 1 - model.predict_proba([features])[0][1]

In practice, query-feature-based routing often outperforms a fixed alpha by 3–7 recall points, especially on corpora with mixed query types.

Tuning Alpha Against Your Golden Set

Don't pick alpha by intuition. Sweep it:

def tune_hybrid_alpha(
    golden_examples: list[dict],
    bm25_results_by_query: list[list[dict]],
    vector_results_by_query: list[list[dict]],
    alpha_range: list[float] = None,
    k: int = 5,
) -> dict:
    if alpha_range is None:
        alpha_range = [round(a * 0.1, 1) for a in range(0, 11)]
 
    best_alpha = 0.5
    best_recall = 0.0
    sweep = {}
 
    for alpha in alpha_range:
        recalls = []
        for ex, bm25_res, vec_res in zip(
            golden_examples, bm25_results_by_query, vector_results_by_query
        ):
            fused = linear_fusion(bm25_res, vec_res, alpha=alpha, top_k=k)
            retrieved = {r["id"] for r in fused}
            relevant = set(ex["relevant_chunk_ids"])
            recalls.append(len(retrieved & relevant) / max(1, len(relevant)))
 
        mean_recall = float(np.mean(recalls))
        sweep[alpha] = mean_recall
        if mean_recall > best_recall:
            best_recall = mean_recall
            best_alpha = alpha
 
    return {"best_alpha": best_alpha, "best_recall@k": best_recall, "sweep": sweep}

A common finding: the optimal alpha is around 0.6–0.7 for general knowledge corpora, and drops to 0.3–0.4 for technical/code-heavy corpora where exact matches matter more.

Operational Considerations

Running two indexes doubles your indexing pipeline surface area. A few things to keep in mind:

Sync on update: When you update or delete a document, you must update both the BM25 index and the vector index. Stale BM25 results that reference deleted documents surface bad retrievals that are hard to debug.

BM25 is not free: For 10M+ chunk corpora, building a BM25 index from scratch can take 20–60 minutes. Keep a serialized version (via pickle for rank_bm25, or a proper search server like Elasticsearch for larger corpora) rather than rebuilding on every startup.

Latency budget: Running both searches in parallel adds minimal latency if implemented correctly. BM25 search is CPU-bound and typically completes in <10ms for 1M documents; vector search is the latency bottleneck. Use asyncio or threading to run them in parallel, not sequentially.

import asyncio
from concurrent.futures import ThreadPoolExecutor
 
executor = ThreadPoolExecutor(max_workers=4)
 
async def hybrid_search_async(
    query: str,
    bm25_index: BM25Index,
    vector_index,  # your vector DB client
    top_k_each: int = 50,
) -> list[dict]:
    loop = asyncio.get_event_loop()
    bm25_future = loop.run_in_executor(executor, bm25_index.search, query, top_k_each)
    vector_future = loop.run_in_executor(executor, vector_index.search, query, top_k_each)
    bm25_results, vector_results = await asyncio.gather(bm25_future, vector_future)
    return reciprocal_rank_fusion(bm25_results, vector_results)

Key Takeaways

Pure vector search underperforms on exact-match queries containing error codes, product names, API identifiers, and technical jargon — these are often the highest-stakes queries in your system.
BM25 + vector hybrid search consistently outperforms either approach alone; the gains are largest on mixed corpora with both semantic and keyword-heavy queries.
Start with Reciprocal Rank Fusion — it's parameter-free, robust, and hard to beat without labeled training data.
If you use linear score fusion, sweep the alpha parameter against your golden set rather than guessing; optimal values vary substantially by domain.
Run both searches in parallel; sequential execution doubles your retrieval latency unnecessarily.
Keeping two indexes in sync is the main operational burden — build your update pipeline to write to both atomically before shipping to production.

Series

Building Production RAG

Part 4 of 10

← Part 3

Embedding Choice and Dimensionality

Part 5 →

Re-ranking Architectures