Building Production RAG

Re-ranking Architectures

Ravinder·February 12, 2025·6 min read

RAGAILLMRe-rankingInformation Retrieval

Series

Building Production RAG

Part 5 of 10

← Part 4

Hybrid Search: BM25 + Vector

Part 6 →

Caching, Batching, and Cost Control

Retrieval gives you a list of candidate chunks ranked by approximate relevance. Re-ranking re-scores that list with a more expensive, more accurate model that can jointly reason about the query and each candidate together — something embedding-based retrieval fundamentally cannot do. The difference in precision is significant enough that for most production systems, re-ranking is not optional.

The question is not whether to re-rank but where the latency ceiling is and which re-ranker architecture fits within it.

The Two-Stage Retrieval Architecture

First-stage retrieval (BM25 + vector, as covered in the previous post) is optimized for recall — cast wide, retrieve fast. Re-ranking is optimized for precision — take the top-50 or top-100 candidates and find the 5–10 that the LLM should actually see.

flowchart TD Q[User Query] --> FS[First-Stage Retrieval\nBM25 + Vector\nRecall-optimized] FS --> CANDS[Top-50 Candidates\n~5-20ms] CANDS --> RR{Re-ranker} RR --> CE[Cross-Encoder\nModel inference\n~50-200ms] RR --> LR[LLM Reranker\nPrompt-based\n~300-800ms] CE --> TOP[Top-5 Chunks] LR --> TOP TOP --> GEN[LLM Generation] GEN --> ANS[Answer]

The re-ranker takes the full query and each candidate chunk as a pair and produces a relevance score. It sees both simultaneously — it can reason about whether the document actually answers this specific question, not just whether the embeddings are nearby in vector space.

Cross-Encoder Rerankers

Cross-encoders are BERT-style models fine-tuned on query-document relevance pairs. They're the standard choice when you need accuracy and have a latency budget of under 200ms.

from sentence_transformers import CrossEncoder
import numpy as np
 
class CrossEncoderReranker:
    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        # Lighter models: ms-marco-MiniLM-L-6-v2 (~80ms for 50 pairs)
        # Heavier models: ms-marco-electra-base (~200ms), bge-reranker-large (~300ms)
        self.model = CrossEncoder(model_name, max_length=512)
 
    def rerank(
        self,
        query: str,
        candidates: list[dict],  # [{id, text, ...}, ...]
        top_k: int = 5,
    ) -> list[dict]:
        if not candidates:
            return []
 
        pairs = [(query, c["text"]) for c in candidates]
        scores = self.model.predict(pairs, batch_size=32, show_progress_bar=False)
 
        ranked = sorted(
            zip(candidates, scores),
            key=lambda x: x[1],
            reverse=True,
        )
        results = []
        for chunk, score in ranked[:top_k]:
            results.append({**chunk, "rerank_score": float(score)})
        return results

Model selection: ms-marco-MiniLM-L-6-v2 is the workhorse — small, fast, trained on 500K+ relevance-judged pairs from MS MARCO. bge-reranker-large (BAAI) and jina-reranker-v2-base-multilingual (Jina AI) are better quality but 3–4x slower. cross-encoder/ms-marco-electra-base is a good middle ground.

Batch size matters: if you're scoring 50 candidates, send them in a single batch rather than one pair at a time. The difference between batch_size=1 and batch_size=50 can be 10x in throughput.

API-Based Re-ranking

Cohere, Jina, and VoyageAI offer hosted re-ranking APIs that skip the GPU cold-start problem and are worth considering if you're not running your own inference.

import cohere
from typing import Optional
 
class CohereReranker:
    def __init__(self, api_key: str, model: str = "rerank-english-v3.0"):
        self.client = cohere.Client(api_key)
        self.model = model
 
    def rerank(
        self,
        query: str,
        candidates: list[dict],
        top_k: int = 5,
        max_chunks_per_doc: Optional[int] = None,
    ) -> list[dict]:
        texts = [c["text"] for c in candidates]
 
        response = self.client.rerank(
            query=query,
            documents=texts,
            model=self.model,
            top_n=top_k,
            max_chunks_per_doc=max_chunks_per_doc,
        )
 
        results = []
        for hit in response.results:
            original = candidates[hit.index]
            results.append({
                **original,
                "rerank_score": hit.relevance_score,
            })
        return results

Cost: Cohere charges ~$1 per 1,000 re-rank calls at 100 documents each. For a system with 10K daily queries, that's ~$10/day or ~$300/month. If you're running fewer than 50K queries/day, the API beats owning GPU infrastructure.

LLM Rerankers

Instead of a specialized cross-encoder, you can use the LLM itself to score relevance. The approach: prompt the LLM to produce a relevance judgment for each candidate, or use a listwise approach where it orders all candidates in a single prompt.

Pointwise (one prompt per candidate — most parallelizable):

import asyncio
import openai
 
RELEVANCE_PROMPT = """\
Query: {query}
 
Document:
{document}
 
Is this document relevant to answering the query? Rate relevance from 0.0 (completely irrelevant) to 1.0 (directly answers the query). Respond with only a float, nothing else."""
 
async def llm_score_candidate(
    client: openai.AsyncOpenAI,
    query: str,
    candidate: dict,
    model: str = "gpt-4o-mini",
) -> tuple[dict, float]:
    response = await client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": RELEVANCE_PROMPT.format(
            query=query, document=candidate["text"][:1000]
        )}],
        temperature=0.0,
        max_tokens=10,
    )
    try:
        score = float(response.choices[0].message.content.strip())
    except ValueError:
        score = 0.0
    return candidate, max(0.0, min(1.0, score))
 
async def llm_rerank(
    query: str,
    candidates: list[dict],
    top_k: int = 5,
    model: str = "gpt-4o-mini",
) -> list[dict]:
    client = openai.AsyncOpenAI()
    tasks = [llm_score_candidate(client, query, c, model) for c in candidates]
    scored = await asyncio.gather(*tasks)
    ranked = sorted(scored, key=lambda x: x[1], reverse=True)
    return [{**chunk, "rerank_score": score} for chunk, score in ranked[:top_k]]

Listwise (single prompt, LLM produces an ordered list — faster, but less reliable at scale):

LLM rerankers are 3–5x slower and 10–50x more expensive per query than cross-encoders. Use them when:

You're already calling the LLM and can fold in reranking
Your re-ranking list is small (≤10 candidates)
You need nuanced relevance that requires language understanding (e.g., reasoning about whether a document contradicts the query)

Where Re-ranking Sits in the Pipeline

The sequencing is important. Re-ranking should always happen after retrieval and before LLM generation. Never re-rank the entire corpus — that defeats the purpose of first-stage retrieval.

The k parameter for first-stage retrieval should be 3–5x your final top-k:

Final top-k for LLM	First-stage candidate pool
3	20–30
5	30–50
10	50–100

Larger candidate pools give the re-ranker more material to promote highly relevant documents that ranked low in first-stage retrieval. The optimal ratio depends on your first-stage recall — if recall@20 is already 0.90, a pool of 20 is fine; if it's 0.75, go wider.

Latency Budget Reality Check

import time
from dataclasses import dataclass
 
@dataclass
class LatencyBudget:
    retrieval_ms: float       # first-stage: typically 5-20ms
    reranking_ms: float       # cross-encoder 50-200ms, LLM 300-800ms
    generation_ms: float      # LLM generation: 500-3000ms depending on output length
    overhead_ms: float = 50   # network, serialization, etc.
 
    @property
    def total_ms(self) -> float:
        return self.retrieval_ms + self.reranking_ms + self.generation_ms + self.overhead_ms
 
    def fits_budget(self, budget_ms: float) -> bool:
        return self.total_ms <= budget_ms
 
# Example: is cross-encoder viable at p99 < 2 seconds?
budget = LatencyBudget(
    retrieval_ms=20,
    reranking_ms=150,   # MiniLM cross-encoder, 50 candidates
    generation_ms=1200, # gpt-4o streaming
)
print(f"Total: {budget.total_ms}ms, fits 2s budget: {budget.fits_budget(2000)}")
# Total: 1420ms — yes, comfortably

If your p99 generation time is already at 1.8 seconds, a 300ms cross-encoder is a problem. In that case: cache re-ranking results for repeated queries (covered in the next post), use a smaller cross-encoder model, or reduce the candidate pool size.

Key Takeaways

Re-ranking is a precision layer on top of recall-optimized first-stage retrieval — treat them as separate responsibilities with different success metrics.
Cross-encoder models (ms-marco-MiniLM-L-6-v2) are the default choice: fast, accurate, deployable without GPU cold-start issues when served properly.
API-based re-rankers (Cohere, Jina) are cost-effective under 50K daily queries and eliminate infrastructure overhead for small teams.
LLM rerankers are 10–50x more expensive per query — reserve them for small candidate sets or cases where nuanced relevance reasoning is worth the cost.
Size your first-stage candidate pool at 3–5x your final top-k to give the re-ranker enough material to work with.
Always model your full latency budget — retrieval plus re-ranking plus generation — before committing to a re-ranker architecture.

Series

Building Production RAG

Part 5 of 10

← Part 4

Hybrid Search: BM25 + Vector

Part 6 →

Caching, Batching, and Cost Control