Skip to main content
AI

Re-rankers in Production

Ravinder··8 min read
AISearchRe-rankingRAGLLM
Share:
Re-rankers in Production

Re-rankers add 150–800ms to your RAG pipeline. Most teams add them and hope the quality improvement justifies the latency. That's not an engineering decision — it's a guess.

The question isn't whether to use a re-ranker. It's which type, where in the pipeline, with what caching strategy, and against which latency budget. Get those four things right and re-ranking is a net win. Get them wrong and you've shipped a slow system that's marginally better on paper.

The Two Re-ranker Architectures

Cross-Encoders

A cross-encoder takes the query and a candidate document as a single input and produces a relevance score. It's not a retriever — it sees both texts simultaneously and models their interaction directly.

query: "How do I reset my password?"
doc:   "To reset your account password, navigate to Settings > Security..."
 
cross-encoder input: [CLS] query [SEP] doc [SEP]
output: relevance_score = 0.94

This is fundamentally different from bi-encoder (embedding) retrieval, where query and document are encoded independently. The cross-encoder sees the full attention interaction between query and document tokens — which is why it's more accurate and more expensive.

Key characteristics:

  • Cannot be pre-indexed (query-document interaction required at inference time)
  • O(N) scoring — must score every candidate individually
  • Latency: 5–30ms per candidate on GPU, 50–300ms on CPU
from sentence_transformers import CrossEncoder
 
# Load once at startup, not per request
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 
def rerank_cross_encoder(
    query: str,
    candidates: list[dict],
    top_n: int = 5,
) -> list[dict]:
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
 
    # Sort by score descending
    scored = sorted(
        zip(candidates, scores),
        key=lambda x: x[1],
        reverse=True,
    )
    return [doc for doc, _ in scored[:top_n]]

LLM-as-Reranker

Use the LLM itself to score relevance. Two variants:

Pointwise: Score each document independently with a prompt.

from openai import OpenAI
import json
 
client = OpenAI()
 
POINTWISE_PROMPT = """Rate the relevance of the following document to the query.
 
Query: {query}
Document: {document}
 
Return JSON: {{"relevance": <0.0 to 1.0>, "reason": "<one sentence>"}}
Only return valid JSON."""
 
def score_document_llm(query: str, document: str) -> float:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": POINTWISE_PROMPT.format(
            query=query, document=document
        )}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    result = json.loads(response.choices[0].message.content)
    return result["relevance"]

Listwise: Show the LLM all candidates and ask it to rank them. Higher quality, higher latency, limited by context window.

LISTWISE_PROMPT = """Rank these documents by relevance to the query. Most relevant first.
 
Query: {query}
 
Documents:
{documents}
 
Return a JSON array of document indices in order of relevance, e.g. [2, 0, 4, 1, 3].
Only return the JSON array."""
 
def listwise_rerank(query: str, candidates: list[str], top_n: int = 5) -> list[int]:
    doc_text = "\n\n".join(
        f"[{i}] {doc[:500]}" for i, doc in enumerate(candidates)
    )
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": LISTWISE_PROMPT.format(
            query=query, documents=doc_text
        )}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    ranked_indices = json.loads(response.choices[0].message.content)
    return ranked_indices[:top_n]

Latency Budget: The Honest Numbers

sequenceDiagram participant U as User participant API as API Layer participant RET as Retriever participant RR as Re-ranker participant LLM as LLM U->>API: Query API->>RET: Vector + BM25 search RET-->>API: 20 candidates (30ms) API->>RR: Score 20 candidates Note over RR: Cross-encoder: 200ms
LLM pointwise: 2000ms
Cohere API: 300ms RR-->>API: Top 5 (Xms) API->>LLM: Query + top 5 chunks LLM-->>API: Answer (800ms TTFT) API-->>U: Response

Realistic latency numbers at p50, assuming 20 candidates:

Re-ranker type p50 p95 Notes
Cross-encoder (GPU) 80ms 150ms Self-hosted, varies with model size
Cross-encoder (CPU) 400ms 900ms Free but slow
Cohere Rerank API 250ms 500ms Managed, includes network
GPT-4o-mini pointwise 800ms 2000ms 20 sequential API calls
GPT-4o listwise 600ms 1200ms 1 API call, larger prompt

LLM-as-reranker is almost never worth the latency for production serving. It makes sense for offline re-ranking (batch processing, indexing pipelines) or low-volume evaluation workflows.

Cross-encoder on GPU is the production default. If you don't have a GPU budget, use Cohere Rerank API and plan for ~300ms.

Where in the Pipeline to Place the Re-ranker

This is where most implementations go wrong.

flowchart LR Q[Query] --> H[Hybrid Search\ntop-50] H --> RR[Re-ranker\ntop-5 from 50] RR --> LLM[LLM\n+ top-5 chunks] LLM --> A[Answer] style RR fill:#f96,stroke:#333,color:#000

Re-ranker after fusion, before LLM. Not before fusion — the fusion output is what the re-ranker should order. Not after the LLM — at that point it's too late.

Common mistakes:

Mistake 1: Re-ranking before fusion. You re-rank BM25 results, then merge with vector results. The merged set has no coherent ordering.

Mistake 2: Re-ranking too few candidates. If the correct chunk is at rank 8 in your vector search and you only pass top-5 to the re-ranker, you've already lost.

Mistake 3: Re-ranking after LLM. Not re-ranking at all — this means the LLM reads chunks in vector-similarity order, which is not relevance order.

Rule of thumb: Retrieve 20–50 candidates, re-rank to 3–7 for the LLM.

Caching Re-ranker Results

Re-ranking is expensive to skip and expensive to run. Caching is the only viable middle ground.

Exact Cache

Cache by (query, candidate_ids) hash. Hit rate on exact matches is low in practice — queries are rarely repeated verbatim.

import hashlib
import json
from functools import lru_cache
 
def make_cache_key(query: str, candidate_ids: list[str]) -> str:
    payload = json.dumps({"q": query, "ids": sorted(candidate_ids)})
    return hashlib.sha256(payload.encode()).hexdigest()
 
# In Redis
def get_reranked_from_cache(redis_client, cache_key: str) -> list[str] | None:
    cached = redis_client.get(f"rerank:{cache_key}")
    if cached:
        return json.loads(cached)
    return None
 
def set_reranked_in_cache(redis_client, cache_key: str, result: list[str], ttl: int = 3600):
    redis_client.setex(f"rerank:{cache_key}", ttl, json.dumps(result))

Semantic Cache

Cache by query embedding similarity. If an incoming query is semantically close to a cached query (cosine similarity > 0.95), return the cached re-ranking.

from sentence_transformers import SentenceTransformer
import numpy as np
 
embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
 
class SemanticRerankerCache:
    def __init__(self, similarity_threshold: float = 0.95):
        self.threshold = similarity_threshold
        self.cache: list[tuple[np.ndarray, list[str]]] = []  # (embedding, result)
 
    def lookup(self, query: str) -> list[str] | None:
        query_emb = embedder.encode([query])[0]
        for cached_emb, result in self.cache:
            sim = float(np.dot(query_emb, cached_emb) /
                       (np.linalg.norm(query_emb) * np.linalg.norm(cached_emb)))
            if sim >= self.threshold:
                return result
        return None
 
    def store(self, query: str, result: list[str]):
        query_emb = embedder.encode([query])[0]
        self.cache.append((query_emb, result))
        # In production: use a vector store for the cache, not a list

Semantic cache hit rates of 20–40% are typical for conversational applications. The embedding lookup adds ~5ms, which pays for itself immediately when it avoids a 300ms re-rank.

Re-ranker Evaluation

A re-ranker that improves NDCG but increases p99 latency from 400ms to 1200ms may not be a net improvement in user experience.

Evaluate re-rankers on:

  1. NDCG@K improvement over baseline — Does re-ranking actually improve rank quality?
  2. Latency at p50, p95, p99 — Not just average.
  3. Failure rate — What happens when the re-ranker API is down? Do you have a fallback?
from sklearn.metrics import ndcg_score
import numpy as np
 
def evaluate_reranker(
    test_cases: list[dict],
    retriever,
    reranker,
    k: int = 5,
) -> dict:
    """
    test_cases: [{"query": str, "relevance_scores": dict[doc_id, float]}]
    """
    baseline_ndcgs = []
    reranked_ndcgs = []
 
    for case in test_cases:
        candidates = retriever.retrieve(case["query"], top_k=20)
        candidate_ids = [c.id for c in candidates]
 
        true_scores = np.array([
            case["relevance_scores"].get(cid, 0) for cid in candidate_ids
        ])
 
        # Baseline: vector search rank order
        baseline_pred = np.array([20 - i for i in range(len(candidates))])
 
        # Re-ranked order
        reranked = reranker.rerank(case["query"], candidates, top_n=20)
        reranked_ids = [r.id for r in reranked]
        reranked_pred = np.array([
            20 - reranked_ids.index(cid) if cid in reranked_ids else 0
            for cid in candidate_ids
        ])
 
        baseline_ndcgs.append(ndcg_score([true_scores], [baseline_pred], k=k))
        reranked_ndcgs.append(ndcg_score([true_scores], [reranked_pred], k=k))
 
    return {
        f"baseline_ndcg@{k}": np.mean(baseline_ndcgs),
        f"reranked_ndcg@{k}": np.mean(reranked_ndcgs),
        "improvement": np.mean(reranked_ndcgs) - np.mean(baseline_ndcgs),
    }

An improvement of less than 0.03 NDCG@5 probably isn't worth the latency and operational complexity of a re-ranker in production.

Fallback Strategy

Re-rankers fail. The API is down, the GPU is saturated, the latency spikes. You need a fallback.

import asyncio
import logging
 
logger = logging.getLogger(__name__)
 
async def rerank_with_fallback(
    query: str,
    candidates: list[dict],
    reranker,
    timeout_ms: int = 400,
    top_n: int = 5,
) -> list[dict]:
    try:
        result = await asyncio.wait_for(
            reranker.arerank(query, candidates, top_n=top_n),
            timeout=timeout_ms / 1000,
        )
        return result
    except asyncio.TimeoutError:
        logger.warning("Re-ranker timeout; falling back to vector order")
        return candidates[:top_n]
    except Exception as e:
        logger.error(f"Re-ranker error: {e}; falling back to vector order")
        return candidates[:top_n]

The fallback is vector-search order. Users get a slightly worse result, not an error. Log every fallback — a high fallback rate means your re-ranker deployment is underprovisioned.

Key Takeaways

  • Cross-encoder on GPU is the production default; LLM-as-reranker is for offline or low-volume workloads.
  • Re-rank over 20–50 candidates, not 3–5 — the re-ranker can only recover candidates that made it into the input set.
  • Place the re-ranker after fusion, before the LLM prompt — nowhere else.
  • Semantic caching cuts re-ranker latency cost by 20–40% in conversational applications.
  • Measure NDCG improvement and p99 latency together — a re-ranker that slows you more than it improves you is not a win.
  • Always implement a timeout + fallback to vector order; re-rankers fail and your API response should not.