LLM Evals Beyond the Leaderboard

Ravinder·March 3, 2025·7 min read

AILLMEvaluationTesting

Leaderboard Numbers Will Mislead You

MMLU, HellaSwag, HumanEval — every major model release comes with a table of benchmark scores. Marketing teams love them. Engineering teams should be skeptical. Those benchmarks measure generic capability. They say almost nothing about whether a model will perform on your specific task, with your prompt, against your edge cases.

The pattern is familiar: you pick the top model on the leaderboard, wire it into your pipeline, and then discover it routinely botches the domain-specific terminology your users care about. Meanwhile, a model ranked three positions lower handles those cases cleanly.

The only eval that matters is the one you build yourself. This post covers exactly how to do that — task-specific eval sets, LLM-as-judge patterns, golden sets, regression tracking, and drift detection.

Step 1: Define What "Good" Means for Your Task

Before you write a single test, you need a rubric. Not a vague one — a precise, disagreement-resolving rubric that two engineers would apply consistently.

For a customer support summarization task, "good" might mean:

All action items are captured
Tone matches the customer's sentiment (do not summarize an angry ticket as neutral)
No hallucinated resolution claims
Under 80 words

Write these criteria down. They become the specification for your eval. Every ambiguous case you encounter while labeling will force you to tighten the rubric further, and that tightening is valuable — it surfaces assumptions your team held implicitly.

flowchart TD A[Define task output] --> B[Write rubric v1] B --> C[Label 20 examples by hand] C --> D{Disagreements?} D -- Yes --> E[Resolve + tighten rubric] E --> C D -- No --> F[Rubric is stable] F --> G[Scale labeling]

Do not skip the hand-labeling phase. Labeling 20 examples yourself will reveal rubric gaps that were invisible on paper.

Step 2: Build Your Eval Dataset

A credible eval set has three tiers:

Tier 1 — Golden Set. 50–200 examples with human-verified expected outputs. These are your ground truth. They change slowly and only after deliberate review.

Tier 2 — Adversarial Set. Examples specifically designed to probe failure modes: jailbreaks, ambiguous inputs, long-context tasks, multilingual inputs, inputs with deliberate typos. Curate these from real production failures and from red-teaming sessions.

Tier 3 — Regression Set. Every production bug that you fix should spawn a test case. This set grows automatically as you operate the system.

# Minimal eval dataset structure
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class EvalCase:
    id: str
    input: str
    expected: str
    tier: str          # "golden" | "adversarial" | "regression"
    rubric_version: str
    tags: list[str]    # e.g. ["long_context", "multilingual"]
    notes: Optional[str] = None

Store these in version control, not in a spreadsheet. Spreadsheets drift; git diffs are auditable.

Step 3: LLM-as-Judge — Use It Carefully

Automated judging with a stronger LLM is genuinely useful, but it has failure modes you must understand before trusting it.

What works well:

Checking factual consistency between output and source document
Binary quality checks (does the output contain a valid JSON object?)
Relative ranking between two outputs (A/B judging)

What breaks:

Detecting subtle tone mismatches
Judging outputs in domains where the judge model has low coverage
Anything requiring knowledge the judge model was not trained on

The most reliable judge pattern is a structured prompt that asks for a score and a rationale, then validates the rationale against your rubric criteria:

JUDGE_PROMPT = """
You are evaluating a customer support ticket summary.
 
Criteria:
1. All action items captured (0-2)
2. Tone matches customer sentiment (0-2)
3. No hallucinated resolution claims (0-2)
4. Under 80 words (0-1)
 
Ticket: {ticket}
Summary: {summary}
 
Return JSON: {{"scores": {{...}}, "rationale": "...", "total": N}}
"""
 
def judge(ticket: str, summary: str, model: str) -> dict:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            ticket=ticket, summary=summary
        )}],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

Always spot-check judge outputs against human labels on a random sample. If judge agreement with humans is below 80%, your judge prompt needs work before you trust it at scale.

Step 4: Running Evals as Regression Tests

Evals should run in CI. Every model version bump, prompt change, or retrieval config change should trigger a full eval run with a pass/fail threshold.

# .github/workflows/evals.yml
name: LLM Eval Suite
on:
  pull_request:
    paths:
      - "prompts/**"
      - "src/llm/**"
 
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run golden set evals
        run: python evals/run.py --tier golden --threshold 0.85
      - name: Run regression evals
        run: python evals/run.py --tier regression --threshold 0.95
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: eval_results.json

Set tier-specific thresholds. The golden set threshold can be moderate (0.85) because it includes hard edge cases. The regression threshold should be near-perfect (0.95+) — these are bugs you already fixed and must not reintroduce.

Step 5: Drift Detection in Production

Offline evals tell you about the model and prompt. Production drift tells you about the data. User input distributions shift over time. New product features introduce input patterns that were not in your eval set. Seasonal events create vocabulary spikes.

Track these signals continuously:

Input distribution drift. Embed incoming prompts and monitor the centroid distance from your eval set embeddings. A growing gap means users are asking things you have not tested.

Output quality distribution. If you have user feedback signals (thumbs up/down, follow-up corrections), track them by input cluster. A cluster with degrading feedback is a signal to add evals for that region of the input space.

Length and format anomalies. Sudden spikes in output truncation or JSON parse failures are often the first symptom of a model behavior change after a silent API update.

import numpy as np
from scipy.spatial.distance import cosine
 
def drift_score(
    production_embeddings: np.ndarray,
    eval_embeddings: np.ndarray,
) -> float:
    """Centroid cosine distance between production and eval distributions."""
    prod_centroid = production_embeddings.mean(axis=0)
    eval_centroid = eval_embeddings.mean(axis=0)
    return float(cosine(prod_centroid, eval_centroid))
 
# Alert if drift_score > 0.15 — threshold is task-dependent

Alert on drift, then triage: is the new input class something your system should handle, or is it out of scope? If in scope, add examples to your eval set before writing a fix.

Connecting Evals to Model Selection

When you are choosing between models, run your full eval suite on each candidate — not their public benchmarks. The decision matrix looks like this:

Criterion	Weight	Model A	Model B
Golden set score	40%	0.91	0.87
Adversarial set score	30%	0.78	0.84
Latency p95 (ms)	20%	320	180
Cost per 1k calls ($)	10%	0.40	0.18

A model that scores lower on MMLU but higher on your adversarial set is almost certainly the better choice for production. Trust your data, not the leaderboard.

Practical Scale: What to Automate First

If you are starting from zero, prioritize in this order:

Build a 50-example golden set with hand-written expected outputs
Wire a judge prompt and validate it against human labels
Add eval execution to your PR checks
Instrument production for format anomaly detection
Add adversarial cases from real production failures

Do not try to build a perfect eval system before shipping. A 50-example golden set running in CI beats a theoretical 10,000-example dataset that never runs.

Key Takeaways

Leaderboard benchmarks measure generic capability; only task-specific evals predict production performance.
Build three tiers: golden set (ground truth), adversarial set (failure probes), regression set (bug prevention).
LLM-as-judge is useful for consistency checks and A/B ranking, but must be validated against human labels before you trust it.
Evals belong in CI — every prompt or model change should trigger a scored run with pass/fail thresholds.
Track input distribution drift in production; a growing gap between production inputs and your eval set is a reliability risk.
Optimize eval infrastructure incrementally: a small golden set running consistently is more valuable than a large dataset that never runs.