Skip to main content
Building Production RAG

Problem Framing and Dataset Honesty

Ravinder··7 min read
RAGAILLMData EngineeringSystem Design
Share:
Problem Framing and Dataset Honesty

Most RAG failures happen before the first line of code is written. Teams reach for retrieval-augmented generation because it sounds like the right solution — and often it is — but they skip the uncomfortable questions: what exactly are users asking, what data do you actually have, and do those two things overlap in a way that makes retrieval useful?

This post is about getting brutally honest with yourself before you build anything.

What RAG Is Actually For

RAG solves a specific problem: a language model has a fixed knowledge cutoff and finite context, but users need answers grounded in documents the model has never seen. The retrieval step bridges that gap by pulling relevant chunks into the prompt at query time.

That's it. RAG is not:

  • A way to make hallucinations go away entirely
  • A substitute for fine-tuning when you need consistent behavior or tone
  • Useful when your documents are structured data that SQL can answer
  • The right tool when users are asking questions your documents don't contain

The failure mode I see most often: a team builds a beautiful RAG pipeline over their internal wiki, and users complain that it "doesn't know anything." The wiki is full of process documents, meeting notes, and onboarding guides — none of which contain factual answers to the product questions users actually ask.

Mapping the Problem Space

Start here. Before touching a vector database, map the full question space your users have:

from collections import Counter
import json
 
def analyze_query_log(log_path: str, n_samples: int = 500) -> dict:
    """
    Pull a sample of real queries and cluster them by type.
    Do this before you design anything.
    """
    with open(log_path) as f:
        queries = [json.loads(line)["query"] for line in f][:n_samples]
 
    # Crude but useful first pass: question word distribution
    question_words = ["what", "how", "why", "when", "who", "where", "can", "does", "is"]
    distribution = Counter()
 
    for q in queries:
        first_word = q.strip().lower().split()[0] if q.strip() else "other"
        if first_word in question_words:
            distribution[first_word] += 1
        else:
            distribution["other"] += 1
 
    return {
        "total": len(queries),
        "distribution": dict(distribution),
        "samples": queries[:20],  # eyeball these
    }

The output of this exercise tells you which query types dominate. Factual lookups ("what is the return policy?") need different retrieval than procedural queries ("how do I configure the webhook?") or comparative queries ("what's the difference between plan A and plan B?").

The Dataset Audit

This is where most teams skip to the fun parts and pay for it later. A proper dataset audit answers four questions:

flowchart TD A[Your Document Corpus] --> B{Coverage Audit} B --> C[Which query types are covered?] B --> D[Which are NOT covered?] C --> E[Fully covered] C --> F[Partially covered] D --> G[Gap — data you need to create or source] F --> H[Quality problem — data exists but is stale/incomplete] E --> I[Proceed to chunking design] H --> J[Data remediation before RAG build] G --> J J --> I

Question 1: What format is your data in?

PDFs with embedded tables, HTML with navigation menus baked in, Word docs with tracked changes — all of these require different preprocessing. Know this upfront.

Question 2: How stale is it?

A document last updated 18 months ago is a liability, not an asset. Track update timestamps and flag anything older than your acceptable staleness threshold.

import os
from datetime import datetime, timedelta
from pathlib import Path
 
def audit_corpus_freshness(
    corpus_dir: str,
    stale_threshold_days: int = 180
) -> dict:
    stale = []
    fresh = []
    threshold = datetime.now() - timedelta(days=stale_threshold_days)
 
    for path in Path(corpus_dir).rglob("*.md"):
        mtime = datetime.fromtimestamp(path.stat().st_mtime)
        entry = {"file": str(path), "last_modified": mtime.isoformat()}
        if mtime < threshold:
            stale.append(entry)
        else:
            fresh.append(entry)
 
    return {
        "total": len(stale) + len(fresh),
        "stale_count": len(stale),
        "stale_pct": round(len(stale) / max(1, len(stale) + len(fresh)) * 100, 1),
        "stale_files": stale,
    }

Question 3: Is the content actually answerable?

Sample 50 documents and ask: if a user asked a question whose answer is in this document, would the document contain enough context to answer it standalone? Procedural docs often reference other docs, assume shared context, or use internal jargon without definition.

Question 4: Who owns this data and how does it change?

RAG pipelines that don't account for data mutation decay over time. You need a re-indexing strategy from day one, which means knowing who updates documents and how frequently.

The Dataset You Don't Have

Here's the uncomfortable truth: the most valuable dataset for building a RAG system is the one that maps user questions to correct answers — and you almost certainly don't have it.

You need a golden dataset of (query, relevant_document_chunks, correct_answer) triples. Without it, you cannot measure retrieval recall, you cannot tune your chunking or embedding strategy, and you cannot detect when a change makes things worse.

How to build one when you're starting from zero:

from typing import NamedTuple
 
class GoldenExample(NamedTuple):
    query: str
    relevant_chunk_ids: list[str]
    expected_answer: str
    difficulty: str  # "easy" | "medium" | "hard"
    query_type: str  # "factual" | "procedural" | "comparative"
 
def bootstrap_golden_set(
    existing_queries: list[str],
    documents: dict[str, str],  # chunk_id -> text
    n_target: int = 200
) -> list[GoldenExample]:
    """
    Strategy: take your best real queries, manually identify
    the correct chunks, and write the expected answer.
    This is slow. Do it anyway.
    """
    # This function is intentionally skeletal — the work is human.
    # Automate the scaffolding, not the judgment.
    examples = []
    for query in existing_queries[:n_target]:
        # You fill these in manually or with LLM assistance + human review
        examples.append(GoldenExample(
            query=query,
            relevant_chunk_ids=[],   # human identifies these
            expected_answer="",      # human writes this
            difficulty="medium",
            query_type="factual",
        ))
    return examples

Aim for 200 examples minimum before you consider your system "evaluatable." Split 60/20/20 into dev/test/holdout.

The Retrieval Contract

Once you have your problem mapped and your data audited, write down the retrieval contract. This is a one-page document that specifies:

  • Scope: what questions this system is expected to answer
  • Out of scope: what it explicitly won't answer (and what happens when users ask anyway)
  • Data sources: exactly which document sets are included, with version/date
  • Acceptable latency: p50 and p99 end-to-end response time
  • Acceptable accuracy: what recall@k target you're optimizing for

This document does two things: it forces clarity before you build, and it gives you a reference point when stakeholders later say "why doesn't it know X?" The answer might be "X is out of scope per the contract we agreed on."

What to Validate Before Moving Forward

Before you design your chunking strategy or pick a vector database, validate these:

  1. You have at least 50 real user queries to work from
  2. Your corpus covers the answer space for those queries (even partially)
  3. You know which documents are stale and have a plan for them
  4. You have or can build a golden evaluation set
  5. Latency and cost requirements are defined

If any of these are red, fix them. Building an elaborate pipeline on top of bad data or undefined requirements wastes weeks.

Key Takeaways

  • RAG solves a specific problem — closed-domain question answering over private documents. Make sure that's actually your problem before building.
  • Audit your corpus before touching code: format, freshness, coverage, and ownership.
  • The query distribution shapes every downstream decision — collect real queries first.
  • The most important dataset you don't have is a golden set of (query, relevant chunks, correct answer) triples. Build it manually.
  • Write a retrieval contract that defines scope, data sources, and success metrics before you start chunking.
  • Teams that skip this step ship pipelines that are expensive to fix, because the problems are architectural, not algorithmic.
Share: