Skip to main content
Building Production RAG

Evaluation Harness from Scratch

Ravinder··7 min read
RAGAILLMEvaluationTestingMLOps
Share:
Evaluation Harness from Scratch

Most teams "evaluate" their RAG system by demoing it to stakeholders and calling it good. This works until someone changes the chunking strategy, updates the prompt, or swaps in a new embedding model — and quality degrades in ways nobody notices until users complain. By that point, you don't know which change broke it.

An evaluation harness gives you a number that goes up or down when you make a change. That's the whole point. Without it, you're flying blind.

The Anatomy of a RAG Evaluation Suite

A complete harness covers three layers:

flowchart TD A[Evaluation Suite] --> B[Retrieval Metrics\nRetrieval layer only] A --> C[Generation Metrics\nEnd-to-end quality] A --> D[Regression Detection\nDiff between versions] B --> B1[Recall@K] B --> B2[Precision@K] B --> B3[MRR / NDCG] C --> C1[Faithfulness] C --> C2[Answer Relevance] C --> C3[Context Utilization] D --> D1[Score delta against baseline] D --> D2[Per-query regression flagging] D --> D3[CI gate on merge]

Retrieval metrics tell you whether the right chunks are being surfaced. Generation metrics tell you whether those chunks are being used correctly. Regression detection tells you whether a code change made things better or worse.

Building the Golden Set

You cannot have meaningful evaluation without a golden set. The minimum viable version is 100–200 (query, relevant_chunk_ids, expected_answer) triples built from real user queries:

import json
from pathlib import Path
from typing import NamedTuple
from dataclasses import dataclass, asdict
 
@dataclass
class GoldenExample:
    example_id: str
    query: str
    relevant_chunk_ids: list[str]      # ground-truth chunks
    expected_answer: str               # reference answer for generation eval
    difficulty: str                    # "easy" | "medium" | "hard"
    query_type: str                    # "factual" | "procedural" | "comparative"
    source_doc_ids: list[str]          # which documents contain the answer
 
class GoldenSet:
    def __init__(self, path: str):
        self.path = Path(path)
        self.examples: list[GoldenExample] = []
        if self.path.exists():
            self._load()
 
    def _load(self):
        with open(self.path) as f:
            raw = json.load(f)
        self.examples = [GoldenExample(**ex) for ex in raw]
 
    def save(self):
        with open(self.path, "w") as f:
            json.dump([asdict(ex) for ex in self.examples], f, indent=2)
 
    def add(self, example: GoldenExample):
        self.examples.append(example)
 
    def split(self, dev_frac: float = 0.6, test_frac: float = 0.2):
        """Returns dev, test, holdout splits."""
        n = len(self.examples)
        dev_end = int(n * dev_frac)
        test_end = dev_end + int(n * test_frac)
        return (
            self.examples[:dev_end],
            self.examples[dev_end:test_end],
            self.examples[test_end:],
        )

Golden set construction is slow. Budget one hour of engineer time per 10 examples. Use LLM assistance to generate candidate answers and chunk annotations, then have a human verify and correct. Never skip the human review step — LLM-generated labels have systematic errors that compound badly in evaluation.

Retrieval Metrics

The core retrieval metrics are straightforward but the devil is in correct implementation:

import numpy as np
from typing import Optional
 
def recall_at_k(
    relevant: set[str],
    retrieved: list[str],
    k: int,
) -> float:
    """Fraction of relevant chunks found in top-k."""
    if not relevant:
        return 1.0  # vacuously true
    return len(relevant & set(retrieved[:k])) / len(relevant)
 
def precision_at_k(retrieved: list[str], relevant: set[str], k: int) -> float:
    """Fraction of top-k results that are relevant."""
    if k == 0:
        return 0.0
    return len(relevant & set(retrieved[:k])) / k
 
def mean_reciprocal_rank(retrieved: list[str], relevant: set[str]) -> float:
    """Rank of first relevant result."""
    for i, doc_id in enumerate(retrieved):
        if doc_id in relevant:
            return 1.0 / (i + 1)
    return 0.0
 
def ndcg_at_k(retrieved: list[str], relevant: set[str], k: int) -> float:
    """Normalized Discounted Cumulative Gain at k."""
    dcg = sum(
        1.0 / np.log2(i + 2)
        for i, doc_id in enumerate(retrieved[:k])
        if doc_id in relevant
    )
    ideal_hits = min(len(relevant), k)
    idcg = sum(1.0 / np.log2(i + 2) for i in range(ideal_hits))
    return dcg / idcg if idcg > 0 else 0.0
 
def evaluate_retrieval(
    golden_examples: list[GoldenExample],
    retrieval_fn,  # callable: query -> list[str] (chunk_ids, ordered)
    k: int = 5,
) -> dict:
    recalls, precisions, mrrs, ndcgs = [], [], [], []
 
    for ex in golden_examples:
        retrieved = retrieval_fn(ex.query)
        relevant = set(ex.relevant_chunk_ids)
        recalls.append(recall_at_k(relevant, retrieved, k))
        precisions.append(precision_at_k(retrieved, relevant, k))
        mrrs.append(mean_reciprocal_rank(retrieved, relevant))
        ndcgs.append(ndcg_at_k(retrieved, relevant, k))
 
    return {
        f"recall@{k}": round(float(np.mean(recalls)), 4),
        f"precision@{k}": round(float(np.mean(precisions)), 4),
        "mrr": round(float(np.mean(mrrs)), 4),
        f"ndcg@{k}": round(float(np.mean(ndcgs)), 4),
    }

Track all four, but use recall@k as your primary optimization target. It directly answers the question that matters most: is the relevant information in the context the LLM sees?

Generation Metrics

For generation evaluation, you need a model-based judge or reference-based scoring. The practical approach: a lightweight LLM judge that scores three dimensions.

import openai
import json
 
JUDGE_PROMPT = """\
You are evaluating a RAG system response. Score each dimension from 0.0 to 1.0.
 
Query: {query}
 
Retrieved Context:
{context}
 
System Response:
{response}
 
Reference Answer:
{reference}
 
Score these three dimensions:
1. faithfulness: Does the response only make claims supported by the retrieved context? (0=fabricates, 1=fully grounded)
2. answer_relevance: Does the response actually answer the query? (0=irrelevant, 1=directly answers)
3. context_utilization: Does the response use the most relevant parts of the context? (0=ignores context, 1=uses it well)
 
Respond with JSON only: {{"faithfulness": 0.0, "answer_relevance": 0.0, "context_utilization": 0.0}}"""
 
def judge_response(
    query: str,
    context_chunks: list[str],
    response: str,
    reference_answer: str,
    model: str = "gpt-4o-mini",
) -> dict[str, float]:
    client = openai.OpenAI()
    context = "\n\n---\n\n".join(context_chunks[:5])
    prompt = JUDGE_PROMPT.format(
        query=query,
        context=context[:4000],  # stay within reasonable token budget
        response=response,
        reference=reference_answer,
    )
    result = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
        response_format={"type": "json_object"},
    )
    try:
        scores = json.loads(result.choices[0].message.content)
        return {k: float(v) for k, v in scores.items()}
    except (json.JSONDecodeError, KeyError, ValueError):
        return {"faithfulness": 0.0, "answer_relevance": 0.0, "context_utilization": 0.0}

LLM-as-judge has known biases (position bias, verbosity bias, self-preference). Mitigate by using a different model than the one generating responses, randomizing chunk order, and validating the judge against a small human-labeled calibration set.

Regression Detection

The harness is only useful if it gates deploys. Structure your evaluation results to enable diff-based regression detection:

import json
from datetime import datetime
from pathlib import Path
 
@dataclass
class EvalRun:
    run_id: str
    timestamp: str
    git_sha: str
    pipeline_config: dict
    retrieval_metrics: dict
    generation_metrics: dict
    per_example_scores: list[dict]
 
def compare_runs(baseline: EvalRun, candidate: EvalRun, threshold: float = 0.02) -> dict:
    """
    Returns a regression report. Flags any metric that dropped by more than `threshold`.
    """
    regressions = []
    improvements = []
    report = {}
 
    all_metrics = {
        **baseline.retrieval_metrics,
        **{f"gen_{k}": v for k, v in baseline.generation_metrics.items()},
    }
    candidate_metrics = {
        **candidate.retrieval_metrics,
        **{f"gen_{k}": v for k, v in candidate.generation_metrics.items()},
    }
 
    for metric, baseline_val in all_metrics.items():
        candidate_val = candidate_metrics.get(metric, 0.0)
        delta = candidate_val - baseline_val
        report[metric] = {"baseline": baseline_val, "candidate": candidate_val, "delta": round(delta, 4)}
        if delta < -threshold:
            regressions.append(metric)
        elif delta > threshold:
            improvements.append(metric)
 
    report["_summary"] = {
        "regressions": regressions,
        "improvements": improvements,
        "ci_pass": len(regressions) == 0,
    }
    return report
 
def save_eval_run(run: EvalRun, output_dir: str = "./eval_runs") -> str:
    path = Path(output_dir)
    path.mkdir(exist_ok=True)
    filename = f"{run.timestamp}_{run.run_id}.json"
    with open(path / filename, "w") as f:
        json.dump(asdict(run), f, indent=2)
    return str(path / filename)

Wire this into CI. Every PR that touches retrieval logic, chunking, prompts, or model configuration should run the evaluation suite on the dev split and fail the merge if any metric regresses by more than 2 percentage points.

Automatic Golden Set Expansion

Your golden set should grow as your system is used. Log queries where users signal dissatisfaction (thumbs down, follow-up rephrasing, session abandonment) and route them to a human review queue:

from collections import deque
 
class GoldenSetExpansionQueue:
    """Collect failed queries for human annotation."""
 
    def __init__(self, max_size: int = 500):
        self._queue: deque[dict] = deque(maxlen=max_size)
 
    def add_candidate(
        self,
        query: str,
        retrieved_chunks: list[dict],
        response: str,
        failure_signal: str,  # "thumbs_down" | "rephrased" | "session_abandon"
    ) -> None:
        self._queue.append({
            "query": query,
            "retrieved_chunks": retrieved_chunks,
            "response": response,
            "failure_signal": failure_signal,
            "timestamp": datetime.now().isoformat(),
            "annotated": False,
        })
 
    def pending_annotation(self) -> list[dict]:
        return [item for item in self._queue if not item["annotated"]]

Target expanding the golden set by 10–20 examples per week from real user failures. Over time, this concentrates evaluation coverage on the cases where your system actually struggles.

Key Takeaways

  • A RAG pipeline without evaluation metrics is a system you can only improve by accident — build the harness before shipping to production, not after.
  • The golden set is the foundation: 100–200 human-annotated (query, relevant chunk IDs, expected answer) triples are the minimum for meaningful evaluation.
  • Track four retrieval metrics (recall@k, precision@k, MRR, NDCG) and use recall@k as your primary optimization signal.
  • LLM-as-judge generation scoring is practical at scale; mitigate bias by using a different model than your generator and calibrating against human labels.
  • Wire regression detection into CI with a hard threshold — any metric drop larger than 2 percentage points blocks merge.
  • Continuously expand the golden set from real user failure signals; your evaluation coverage should track your system's actual weak spots.
Share: