Skip to main content
Data

Search Relevance, Measured

Ravinder··9 min read
DataSearchRelevanceNDCGInformation Retrieval
Share:
Search Relevance, Measured

Most search teams discover they have a relevance problem the wrong way: a support ticket, a viral tweet, or a senior exec who can't find something obvious. By then you've already shipped the regression. The teams that catch these problems first are running disciplined offline evaluation pipelines — not vibes-based QA against a handful of queries.

This post covers the metrics that matter, how to build a judgment set without bankrupting yourself on human labelers, and why click models are underused outside academia.


Why Precision@K Is Not Enough

The classic precision-at-K metric counts how many results in the top K are "relevant." It's binary and position-blind: a relevant result at rank 1 and a relevant result at rank 10 both count equally. That does not match how users behave. Users click the first result far more than the tenth. If your metric doesn't model position, your optimization target is wrong.

Enter discounted cumulative gain.


DCG and NDCG

Discounted Cumulative Gain (DCG) weights each result by a logarithmic discount based on its position:

DCG@K = Σ (rel_i / log2(i + 1))  for i = 1 to K

Where rel_i is the relevance grade of the result at position i. Relevance grades are typically 0–3 or 0–4 (not relevant, somewhat relevant, relevant, highly relevant).

The discount 1 / log2(i + 1) gives position 1 full weight, position 2 about 63%, position 3 about 50%, and so on. A highly relevant document buried at position 8 contributes far less than the same document at position 1.

Normalized DCG (NDCG) divides by the ideal DCG — the DCG you'd get if all results were in perfect relevance order:

NDCG@K = DCG@K / IDCG@K

NDCG is bounded 0–1, which makes it comparable across queries with different numbers of relevant documents. This matters: a query with 12 relevant documents in the corpus is structurally different from a query with 2.

import math
 
def dcg_at_k(relevances: list[int], k: int) -> float:
    """Compute DCG@K. relevances[0] is rank-1 result."""
    score = 0.0
    for i, rel in enumerate(relevances[:k], start=1):
        score += rel / math.log2(i + 1)
    return score
 
def ndcg_at_k(relevances: list[int], k: int) -> float:
    ideal = sorted(relevances, reverse=True)
    idcg = dcg_at_k(ideal, k)
    if idcg == 0:
        return 0.0
    return dcg_at_k(relevances, k) / idcg
 
# Example: query returns [3, 2, 0, 1] graded results
actual = [3, 2, 0, 1]
print(f"NDCG@4: {ndcg_at_k(actual, 4):.4f}")  # → 0.9275

A common mistake: computing NDCG over the full result set (K = 100) when users never scroll past page one. Pick K to match your real user behavior — usually 5 or 10.


Building a Judgment Set Without Going Broke

Offline evaluation requires a judgment set: a collection of (query, document, relevance-grade) triples. The naive approach is hiring humans to judge every query-document pair. At scale this is expensive and slow.

Practical shortcuts:

Sample from production traffic. Take the top 500–2000 queries by volume. These cover the bulk of user sessions. Head queries are cheap to judge because they're often navigational (the right answer is obvious).

Use graded relevance, not binary. Binary (relevant/not) throws away signal. Graded (0–3) lets NDCG do more work. Instruct labelers: 0 = off-topic, 1 = tangentially related, 2 = answers the query, 3 = perfect match.

Leverage existing signals for seeding. Clicks, add-to-carts, purchases — these are noisy but cheap. Use them to bootstrap candidates for human review, not as ground truth.

Inter-annotator agreement. For each query, have 3 labelers grade the same results. Compute Fleiss' kappa. Below 0.4 means your guidelines are ambiguous. Fix the guidelines before you collect more labels.

flowchart TD A[Production Query Log] --> B[Sample Top-N Queries] B --> C[Retrieve Candidate Results per Query] C --> D[Human Labelers — Graded 0-3] D --> E{IAA > 0.4?} E -- No --> F[Refine Guidelines] F --> D E -- Yes --> G[Judgment Set] G --> H[Compute NDCG@10 per System Version] H --> I[Track Over Time / A/B Compare]

Click Models: Free Signal You're Ignoring

If you have a search product with real users, you have click data. The problem is raw click-through rate (CTR) is biased by position — users click position 1 more because it's first, not always because it's best. A naive interpretation of CTR produces circular improvements: whatever ranks first gets more clicks, gets reinforced, ranks first forever.

Click models disentangle examination probability from relevance. The two workhorse models:

Position-Based Model (PBM): Assumes a click happens only if the user examines a result (which depends on position) and finds it relevant. Formally:

P(click | rank=r, doc=d) = P(examine | r) × P(relevant | d)

Parameters are fit via EM on your click log. The output is a per-document relevance estimate that's position-debiased.

Dynamic Bayesian Network (DBN): Models the user as reading down the list, deciding at each result whether to click and whether to continue. Captures "satisfied" states — if the user clicks and leaves, that's a strong relevance signal.

For most product teams, PBM is sufficient and easy to implement. The point is not to run a dissertation — it's to extract debiased relevance signals from existing data to supplement human labels.

# Simplified EM for Position-Based Model
import numpy as np
 
def fit_pbm(clicks, positions, n_positions=10, n_docs=1000, iters=20):
    """
    clicks: list of (doc_id, position, clicked) tuples
    Returns examination_prob per position, relevance per doc
    """
    exam = np.full(n_positions, 0.5)   # P(examine | position)
    rel  = np.full(n_docs, 0.5)        # P(relevant | doc)
 
    for _ in range(iters):
        # E-step: compute expected click probability
        new_exam = np.zeros(n_positions)
        new_rel  = np.zeros(n_docs)
        exam_count = np.zeros(n_positions)
        rel_count  = np.zeros(n_docs)
 
        for doc_id, pos, clicked in clicks:
            p_click = exam[pos] * rel[doc_id]
            p_no_click = 1 - p_click
            if clicked:
                w_exam = exam[pos] * rel[doc_id]
                w_rel  = exam[pos] * rel[doc_id]
            else:
                # clicked=0: weight on "examined but not relevant"
                w_exam = exam[pos] * (1 - rel[doc_id]) / max(p_no_click, 1e-9)
                w_rel  = (1 - exam[pos]) * rel[doc_id] / max(p_no_click, 1e-9)
            new_exam[pos] += w_exam
            exam_count[pos] += 1
            new_rel[doc_id] += w_rel
            rel_count[doc_id] += 1
 
        # M-step
        exam = new_exam / np.maximum(exam_count, 1)
        rel  = new_rel  / np.maximum(rel_count, 1)
 
    return exam, rel

Offline vs. Online Evaluation

Neither replaces the other. They answer different questions.

Offline (NDCG on judgment set) Online (A/B test)
Speed Hours Days to weeks
Cost Low once set up Infrastructure + traffic split
Signal quality Only as good as labels Ground truth user behavior
Catches Ranking regressions Business metric impact
Risk Ships changes untested Exposes users to bad changes

The right workflow: offline eval gates every change. Only changes that improve (or don't regress) NDCG proceed to A/B. This cuts the number of A/B tests you need by 60–80%, because most ranking ideas simply don't work and the offline eval kills them cheaply.

A common failure mode: teams run A/B tests as the primary evaluation loop. A/B tests are expensive — you need traffic, time, and statistical power. Running a bad idea in A/B for two weeks before killing it is waste. Kill it offline first.


Operationalizing the Pipeline

A practical offline evaluation pipeline has three components:

  1. Judgment store — a versioned database of (query, doc, grade) triples. Store the query text, the document ID, the relevance grade, the labeler ID, and the timestamp. You will update grades over time as content changes.

  2. Retrieval harness — a script that, given a judgment set query, fetches the top-K results from a given system version (index snapshot + ranking model). Stores results with positions.

  3. Metric computation — computes NDCG@5, NDCG@10, and any other metrics. Produces a report comparing the candidate system to the baseline.

Run this on every pull request that touches ranking code. Fail the PR if NDCG@10 drops more than 1% relative. This is not a strict threshold — calibrate it based on the noise floor of your judgment set (run the same system twice; the variance is your noise floor).

# Example CI step
python eval/run_offline.py \
  --judgment-set data/judgments_v4.jsonl \
  --system-a configs/baseline.yaml \
  --system-b configs/candidate.yaml \
  --metric ndcg@10 \
  --threshold 0.01 \
  --output results/eval_$(date +%Y%m%d).json

Practical Gotchas

Fresh content is under-judged. Your judgment set was built six months ago. New documents that are actually highly relevant have no labels. This creates a systematic bias against freshness. Refresh labels quarterly, or weight recent traffic queries more heavily when sampling.

Query reformulations inflate apparent coverage. "best laptop 2026" and "top laptops 2026" are different queries in your judgment set but the same user intent. Deduplicate by intent cluster before sampling, or you'll over-represent one intent while missing others.

NDCG is not the only metric. For navigational queries, MRR (mean reciprocal rank — where is the first relevant result?) is often more appropriate. For long-tail queries with few relevant documents, ERR (expected reciprocal rank) handles the "user stops after finding one good result" assumption better. Use NDCG as your headline metric but track MRR@1 for navigational intent.

Beware metric overfitting. If a team optimizes NDCG for long enough, they will find ways to game it that don't improve actual search quality. Rotate your evaluation queries periodically. Keep a held-out test set that the ranking team cannot see.


Key Takeaways

  • NDCG@K is the standard search relevance metric because it weights position and normalizes across queries; use K=5 or K=10 to match real user behavior.
  • Human judgment sets are expensive — sample from production traffic, use graded relevance, and validate with inter-annotator agreement before collecting at scale.
  • Raw click-through rate is position-biased; click models like PBM extract debiased relevance signals from your existing click log.
  • Offline evaluation should gate every ranking change; only improvements that survive offline eval earn an A/B test slot.
  • Keep a versioned judgment store, a retrieval harness, and a metric computation script — three components, none optional.
  • Refresh your judgment set quarterly and watch for metric overfitting; the evaluation pipeline is only as good as its inputs.