Re-ranking Architectures
Series
Building Production RAGRetrieval gives you a list of candidate chunks ranked by approximate relevance. Re-ranking re-scores that list with a more expensive, more accurate model that can jointly reason about the query and each candidate together — something embedding-based retrieval fundamentally cannot do. The difference in precision is significant enough that for most production systems, re-ranking is not optional.
The question is not whether to re-rank but where the latency ceiling is and which re-ranker architecture fits within it.
The Two-Stage Retrieval Architecture
First-stage retrieval (BM25 + vector, as covered in the previous post) is optimized for recall — cast wide, retrieve fast. Re-ranking is optimized for precision — take the top-50 or top-100 candidates and find the 5–10 that the LLM should actually see.
The re-ranker takes the full query and each candidate chunk as a pair and produces a relevance score. It sees both simultaneously — it can reason about whether the document actually answers this specific question, not just whether the embeddings are nearby in vector space.
Cross-Encoder Rerankers
Cross-encoders are BERT-style models fine-tuned on query-document relevance pairs. They're the standard choice when you need accuracy and have a latency budget of under 200ms.
from sentence_transformers import CrossEncoder
import numpy as np
class CrossEncoderReranker:
def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
# Lighter models: ms-marco-MiniLM-L-6-v2 (~80ms for 50 pairs)
# Heavier models: ms-marco-electra-base (~200ms), bge-reranker-large (~300ms)
self.model = CrossEncoder(model_name, max_length=512)
def rerank(
self,
query: str,
candidates: list[dict], # [{id, text, ...}, ...]
top_k: int = 5,
) -> list[dict]:
if not candidates:
return []
pairs = [(query, c["text"]) for c in candidates]
scores = self.model.predict(pairs, batch_size=32, show_progress_bar=False)
ranked = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True,
)
results = []
for chunk, score in ranked[:top_k]:
results.append({**chunk, "rerank_score": float(score)})
return resultsModel selection: ms-marco-MiniLM-L-6-v2 is the workhorse — small, fast, trained on 500K+ relevance-judged pairs from MS MARCO. bge-reranker-large (BAAI) and jina-reranker-v2-base-multilingual (Jina AI) are better quality but 3–4x slower. cross-encoder/ms-marco-electra-base is a good middle ground.
Batch size matters: if you're scoring 50 candidates, send them in a single batch rather than one pair at a time. The difference between batch_size=1 and batch_size=50 can be 10x in throughput.
API-Based Re-ranking
Cohere, Jina, and VoyageAI offer hosted re-ranking APIs that skip the GPU cold-start problem and are worth considering if you're not running your own inference.
import cohere
from typing import Optional
class CohereReranker:
def __init__(self, api_key: str, model: str = "rerank-english-v3.0"):
self.client = cohere.Client(api_key)
self.model = model
def rerank(
self,
query: str,
candidates: list[dict],
top_k: int = 5,
max_chunks_per_doc: Optional[int] = None,
) -> list[dict]:
texts = [c["text"] for c in candidates]
response = self.client.rerank(
query=query,
documents=texts,
model=self.model,
top_n=top_k,
max_chunks_per_doc=max_chunks_per_doc,
)
results = []
for hit in response.results:
original = candidates[hit.index]
results.append({
**original,
"rerank_score": hit.relevance_score,
})
return resultsCost: Cohere charges ~$1 per 1,000 re-rank calls at 100 documents each. For a system with 10K daily queries, that's ~$10/day or ~$300/month. If you're running fewer than 50K queries/day, the API beats owning GPU infrastructure.
LLM Rerankers
Instead of a specialized cross-encoder, you can use the LLM itself to score relevance. The approach: prompt the LLM to produce a relevance judgment for each candidate, or use a listwise approach where it orders all candidates in a single prompt.
Pointwise (one prompt per candidate — most parallelizable):
import asyncio
import openai
RELEVANCE_PROMPT = """\
Query: {query}
Document:
{document}
Is this document relevant to answering the query? Rate relevance from 0.0 (completely irrelevant) to 1.0 (directly answers the query). Respond with only a float, nothing else."""
async def llm_score_candidate(
client: openai.AsyncOpenAI,
query: str,
candidate: dict,
model: str = "gpt-4o-mini",
) -> tuple[dict, float]:
response = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": RELEVANCE_PROMPT.format(
query=query, document=candidate["text"][:1000]
)}],
temperature=0.0,
max_tokens=10,
)
try:
score = float(response.choices[0].message.content.strip())
except ValueError:
score = 0.0
return candidate, max(0.0, min(1.0, score))
async def llm_rerank(
query: str,
candidates: list[dict],
top_k: int = 5,
model: str = "gpt-4o-mini",
) -> list[dict]:
client = openai.AsyncOpenAI()
tasks = [llm_score_candidate(client, query, c, model) for c in candidates]
scored = await asyncio.gather(*tasks)
ranked = sorted(scored, key=lambda x: x[1], reverse=True)
return [{**chunk, "rerank_score": score} for chunk, score in ranked[:top_k]]Listwise (single prompt, LLM produces an ordered list — faster, but less reliable at scale):
LLM rerankers are 3–5x slower and 10–50x more expensive per query than cross-encoders. Use them when:
- You're already calling the LLM and can fold in reranking
- Your re-ranking list is small (≤10 candidates)
- You need nuanced relevance that requires language understanding (e.g., reasoning about whether a document contradicts the query)
Where Re-ranking Sits in the Pipeline
The sequencing is important. Re-ranking should always happen after retrieval and before LLM generation. Never re-rank the entire corpus — that defeats the purpose of first-stage retrieval.
The k parameter for first-stage retrieval should be 3–5x your final top-k:
| Final top-k for LLM | First-stage candidate pool |
|---|---|
| 3 | 20–30 |
| 5 | 30–50 |
| 10 | 50–100 |
Larger candidate pools give the re-ranker more material to promote highly relevant documents that ranked low in first-stage retrieval. The optimal ratio depends on your first-stage recall — if recall@20 is already 0.90, a pool of 20 is fine; if it's 0.75, go wider.
Latency Budget Reality Check
import time
from dataclasses import dataclass
@dataclass
class LatencyBudget:
retrieval_ms: float # first-stage: typically 5-20ms
reranking_ms: float # cross-encoder 50-200ms, LLM 300-800ms
generation_ms: float # LLM generation: 500-3000ms depending on output length
overhead_ms: float = 50 # network, serialization, etc.
@property
def total_ms(self) -> float:
return self.retrieval_ms + self.reranking_ms + self.generation_ms + self.overhead_ms
def fits_budget(self, budget_ms: float) -> bool:
return self.total_ms <= budget_ms
# Example: is cross-encoder viable at p99 < 2 seconds?
budget = LatencyBudget(
retrieval_ms=20,
reranking_ms=150, # MiniLM cross-encoder, 50 candidates
generation_ms=1200, # gpt-4o streaming
)
print(f"Total: {budget.total_ms}ms, fits 2s budget: {budget.fits_budget(2000)}")
# Total: 1420ms — yes, comfortablyIf your p99 generation time is already at 1.8 seconds, a 300ms cross-encoder is a problem. In that case: cache re-ranking results for repeated queries (covered in the next post), use a smaller cross-encoder model, or reduce the candidate pool size.
Key Takeaways
- Re-ranking is a precision layer on top of recall-optimized first-stage retrieval — treat them as separate responsibilities with different success metrics.
- Cross-encoder models (
ms-marco-MiniLM-L-6-v2) are the default choice: fast, accurate, deployable without GPU cold-start issues when served properly. - API-based re-rankers (Cohere, Jina) are cost-effective under 50K daily queries and eliminate infrastructure overhead for small teams.
- LLM rerankers are 10–50x more expensive per query — reserve them for small candidate sets or cases where nuanced relevance reasoning is worth the cost.
- Size your first-stage candidate pool at 3–5x your final top-k to give the re-ranker enough material to work with.
- Always model your full latency budget — retrieval plus re-ranking plus generation — before committing to a re-ranker architecture.