Re-rankers in Production
Re-rankers add 150–800ms to your RAG pipeline. Most teams add them and hope the quality improvement justifies the latency. That's not an engineering decision — it's a guess.
The question isn't whether to use a re-ranker. It's which type, where in the pipeline, with what caching strategy, and against which latency budget. Get those four things right and re-ranking is a net win. Get them wrong and you've shipped a slow system that's marginally better on paper.
The Two Re-ranker Architectures
Cross-Encoders
A cross-encoder takes the query and a candidate document as a single input and produces a relevance score. It's not a retriever — it sees both texts simultaneously and models their interaction directly.
query: "How do I reset my password?"
doc: "To reset your account password, navigate to Settings > Security..."
cross-encoder input: [CLS] query [SEP] doc [SEP]
output: relevance_score = 0.94This is fundamentally different from bi-encoder (embedding) retrieval, where query and document are encoded independently. The cross-encoder sees the full attention interaction between query and document tokens — which is why it's more accurate and more expensive.
Key characteristics:
- Cannot be pre-indexed (query-document interaction required at inference time)
- O(N) scoring — must score every candidate individually
- Latency: 5–30ms per candidate on GPU, 50–300ms on CPU
from sentence_transformers import CrossEncoder
# Load once at startup, not per request
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank_cross_encoder(
query: str,
candidates: list[dict],
top_n: int = 5,
) -> list[dict]:
pairs = [(query, c["text"]) for c in candidates]
scores = reranker.predict(pairs)
# Sort by score descending
scored = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True,
)
return [doc for doc, _ in scored[:top_n]]LLM-as-Reranker
Use the LLM itself to score relevance. Two variants:
Pointwise: Score each document independently with a prompt.
from openai import OpenAI
import json
client = OpenAI()
POINTWISE_PROMPT = """Rate the relevance of the following document to the query.
Query: {query}
Document: {document}
Return JSON: {{"relevance": <0.0 to 1.0>, "reason": "<one sentence>"}}
Only return valid JSON."""
def score_document_llm(query: str, document: str) -> float:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": POINTWISE_PROMPT.format(
query=query, document=document
)}],
response_format={"type": "json_object"},
temperature=0,
)
result = json.loads(response.choices[0].message.content)
return result["relevance"]Listwise: Show the LLM all candidates and ask it to rank them. Higher quality, higher latency, limited by context window.
LISTWISE_PROMPT = """Rank these documents by relevance to the query. Most relevant first.
Query: {query}
Documents:
{documents}
Return a JSON array of document indices in order of relevance, e.g. [2, 0, 4, 1, 3].
Only return the JSON array."""
def listwise_rerank(query: str, candidates: list[str], top_n: int = 5) -> list[int]:
doc_text = "\n\n".join(
f"[{i}] {doc[:500]}" for i, doc in enumerate(candidates)
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": LISTWISE_PROMPT.format(
query=query, documents=doc_text
)}],
response_format={"type": "json_object"},
temperature=0,
)
ranked_indices = json.loads(response.choices[0].message.content)
return ranked_indices[:top_n]Latency Budget: The Honest Numbers
LLM pointwise: 2000ms
Cohere API: 300ms RR-->>API: Top 5 (Xms) API->>LLM: Query + top 5 chunks LLM-->>API: Answer (800ms TTFT) API-->>U: Response
Realistic latency numbers at p50, assuming 20 candidates:
| Re-ranker type | p50 | p95 | Notes |
|---|---|---|---|
| Cross-encoder (GPU) | 80ms | 150ms | Self-hosted, varies with model size |
| Cross-encoder (CPU) | 400ms | 900ms | Free but slow |
| Cohere Rerank API | 250ms | 500ms | Managed, includes network |
| GPT-4o-mini pointwise | 800ms | 2000ms | 20 sequential API calls |
| GPT-4o listwise | 600ms | 1200ms | 1 API call, larger prompt |
LLM-as-reranker is almost never worth the latency for production serving. It makes sense for offline re-ranking (batch processing, indexing pipelines) or low-volume evaluation workflows.
Cross-encoder on GPU is the production default. If you don't have a GPU budget, use Cohere Rerank API and plan for ~300ms.
Where in the Pipeline to Place the Re-ranker
This is where most implementations go wrong.
Re-ranker after fusion, before LLM. Not before fusion — the fusion output is what the re-ranker should order. Not after the LLM — at that point it's too late.
Common mistakes:
Mistake 1: Re-ranking before fusion. You re-rank BM25 results, then merge with vector results. The merged set has no coherent ordering.
Mistake 2: Re-ranking too few candidates. If the correct chunk is at rank 8 in your vector search and you only pass top-5 to the re-ranker, you've already lost.
Mistake 3: Re-ranking after LLM. Not re-ranking at all — this means the LLM reads chunks in vector-similarity order, which is not relevance order.
Rule of thumb: Retrieve 20–50 candidates, re-rank to 3–7 for the LLM.
Caching Re-ranker Results
Re-ranking is expensive to skip and expensive to run. Caching is the only viable middle ground.
Exact Cache
Cache by (query, candidate_ids) hash. Hit rate on exact matches is low in practice — queries are rarely repeated verbatim.
import hashlib
import json
from functools import lru_cache
def make_cache_key(query: str, candidate_ids: list[str]) -> str:
payload = json.dumps({"q": query, "ids": sorted(candidate_ids)})
return hashlib.sha256(payload.encode()).hexdigest()
# In Redis
def get_reranked_from_cache(redis_client, cache_key: str) -> list[str] | None:
cached = redis_client.get(f"rerank:{cache_key}")
if cached:
return json.loads(cached)
return None
def set_reranked_in_cache(redis_client, cache_key: str, result: list[str], ttl: int = 3600):
redis_client.setex(f"rerank:{cache_key}", ttl, json.dumps(result))Semantic Cache
Cache by query embedding similarity. If an incoming query is semantically close to a cached query (cosine similarity > 0.95), return the cached re-ranking.
from sentence_transformers import SentenceTransformer
import numpy as np
embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
class SemanticRerankerCache:
def __init__(self, similarity_threshold: float = 0.95):
self.threshold = similarity_threshold
self.cache: list[tuple[np.ndarray, list[str]]] = [] # (embedding, result)
def lookup(self, query: str) -> list[str] | None:
query_emb = embedder.encode([query])[0]
for cached_emb, result in self.cache:
sim = float(np.dot(query_emb, cached_emb) /
(np.linalg.norm(query_emb) * np.linalg.norm(cached_emb)))
if sim >= self.threshold:
return result
return None
def store(self, query: str, result: list[str]):
query_emb = embedder.encode([query])[0]
self.cache.append((query_emb, result))
# In production: use a vector store for the cache, not a listSemantic cache hit rates of 20–40% are typical for conversational applications. The embedding lookup adds ~5ms, which pays for itself immediately when it avoids a 300ms re-rank.
Re-ranker Evaluation
A re-ranker that improves NDCG but increases p99 latency from 400ms to 1200ms may not be a net improvement in user experience.
Evaluate re-rankers on:
- NDCG@K improvement over baseline — Does re-ranking actually improve rank quality?
- Latency at p50, p95, p99 — Not just average.
- Failure rate — What happens when the re-ranker API is down? Do you have a fallback?
from sklearn.metrics import ndcg_score
import numpy as np
def evaluate_reranker(
test_cases: list[dict],
retriever,
reranker,
k: int = 5,
) -> dict:
"""
test_cases: [{"query": str, "relevance_scores": dict[doc_id, float]}]
"""
baseline_ndcgs = []
reranked_ndcgs = []
for case in test_cases:
candidates = retriever.retrieve(case["query"], top_k=20)
candidate_ids = [c.id for c in candidates]
true_scores = np.array([
case["relevance_scores"].get(cid, 0) for cid in candidate_ids
])
# Baseline: vector search rank order
baseline_pred = np.array([20 - i for i in range(len(candidates))])
# Re-ranked order
reranked = reranker.rerank(case["query"], candidates, top_n=20)
reranked_ids = [r.id for r in reranked]
reranked_pred = np.array([
20 - reranked_ids.index(cid) if cid in reranked_ids else 0
for cid in candidate_ids
])
baseline_ndcgs.append(ndcg_score([true_scores], [baseline_pred], k=k))
reranked_ndcgs.append(ndcg_score([true_scores], [reranked_pred], k=k))
return {
f"baseline_ndcg@{k}": np.mean(baseline_ndcgs),
f"reranked_ndcg@{k}": np.mean(reranked_ndcgs),
"improvement": np.mean(reranked_ndcgs) - np.mean(baseline_ndcgs),
}An improvement of less than 0.03 NDCG@5 probably isn't worth the latency and operational complexity of a re-ranker in production.
Fallback Strategy
Re-rankers fail. The API is down, the GPU is saturated, the latency spikes. You need a fallback.
import asyncio
import logging
logger = logging.getLogger(__name__)
async def rerank_with_fallback(
query: str,
candidates: list[dict],
reranker,
timeout_ms: int = 400,
top_n: int = 5,
) -> list[dict]:
try:
result = await asyncio.wait_for(
reranker.arerank(query, candidates, top_n=top_n),
timeout=timeout_ms / 1000,
)
return result
except asyncio.TimeoutError:
logger.warning("Re-ranker timeout; falling back to vector order")
return candidates[:top_n]
except Exception as e:
logger.error(f"Re-ranker error: {e}; falling back to vector order")
return candidates[:top_n]The fallback is vector-search order. Users get a slightly worse result, not an error. Log every fallback — a high fallback rate means your re-ranker deployment is underprovisioned.
Key Takeaways
- Cross-encoder on GPU is the production default; LLM-as-reranker is for offline or low-volume workloads.
- Re-rank over 20–50 candidates, not 3–5 — the re-ranker can only recover candidates that made it into the input set.
- Place the re-ranker after fusion, before the LLM prompt — nowhere else.
- Semantic caching cuts re-ranker latency cost by 20–40% in conversational applications.
- Measure NDCG improvement and p99 latency together — a re-ranker that slows you more than it improves you is not a win.
- Always implement a timeout + fallback to vector order; re-rankers fail and your API response should not.