Why Your RAG Retrieval Is Wrong

Ravinder·January 6, 2025·7 min read

AIRAGRetrievalLLMVector Search

Your RAG system returns confident answers. Some of them are wrong — and the LLM isn't the problem. The retrieval layer is.

This happens because retrieval failures are silent. The model doesn't say "I couldn't find anything relevant." It hallucinates, or it answers from a slightly-wrong chunk and the answer looks plausible. By the time a user catches it, you have no idea which retrieval step failed.

Here's where the breaks actually are, and what to do about each one.

The Chunking Tax You're Ignoring

Every tutorial tells you to chunk at 512 tokens. That number is cargo-culted from early embedding model context limits and has nothing to do with your documents.

The right chunk size is a function of:

Embedding model context window — Most modern models handle 8k tokens. Smaller chunks waste capacity.
Document structure — Legal contracts and code are paragraph-coherent. Wikipedia articles are section-coherent. Chunking across natural boundaries destroys meaning.
Query length distribution — Short keyword queries retrieve differently than long natural-language questions.

What Actually Goes Wrong

Chunk too small: A sentence fragment embeds fine in isolation but its meaning depends on the surrounding context. The retriever finds it, the LLM reads it, and the answer is wrong because the crucial qualifier was in the previous chunk.

Chunk too large: The chunk has high semantic coverage but the relevant sentence is buried. The embedding averages across the whole chunk. Relevance score dilutes. You miss it.

Fixed-size splits mid-sentence: The classic mistake. RecursiveCharacterTextSplitter with chunk_size=512, chunk_overlap=50 will split at character boundaries. A number, a date, a procedure step — all cut in half.

# What most people do — wrong for structured docs
from langchain.text_splitter import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
)
 
# What you should do for structured docs
from langchain.text_splitter import MarkdownHeaderTextSplitter
 
headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False,  # keep headers in chunk for context
)

The Parent-Child Pattern

For dense technical documents, use parent-child chunking: embed small child chunks for retrieval precision, but return the parent chunk to the LLM for context.

from llama_index.core.node_parser import HierarchicalNodeParser
 
parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]  # parent, child, grandchild
)
 
nodes = parser.get_nodes_from_documents(documents)
# Index 128-token nodes, retrieve parent 512-token on match

This gives you the precision of small chunks and the coherence of large ones.

Recall vs Precision — You're Optimizing the Wrong One

Most teams look at whether retrieved chunks are relevant (precision). Almost nobody measures whether relevant chunks were retrieved at all (recall).

Metric	What it tells you	What it misses
Precision@K	Are the top-K results good?	You may have missed the best result
Recall@K	Did you find the right chunks?	Doesn't care about rank order
MRR	Is the first good result early?	Ignores tail coverage
NDCG	Graded relevance + rank position	Requires graded labels

A high-precision retriever that misses the best chunk will cause the LLM to answer from second-best evidence. This produces answers that are technically sourced but subtly wrong.

The practical fix: Retrieve more candidates (top-20 or top-50) and let a re-ranker do precision filtering. Never tune your vector search for precision — it's the wrong layer for that job.

# Don't do this — you're doing the re-ranker's job poorly
results = vector_store.similarity_search(query, k=3)
 
# Do this — high recall, delegate precision
candidates = vector_store.similarity_search(query, k=20)
results = reranker.rerank(query, candidates, top_n=3)

The Embedding Model Mismatch

Your embedding model was trained on general text. Your documents are financial contracts, medical records, or internal engineering wikis. The semantic space doesn't match.

This isn't a theoretical problem. On domain-specific retrieval benchmarks, a general embedding model can underperform a fine-tuned one by 15–30% in recall@10.

# Check your embedding model's training domain before committing
from sentence_transformers import SentenceTransformer
 
# General purpose — good baseline, bad for specialized domains
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
 
# Domain-specific alternatives
# Medical: "pritamdeka/S-PubMedBert-MS-MARCO"
# Legal: "legal-bert-base-uncased"
# Code: "microsoft/codebert-base"

Run a quick recall@10 test on 50 representative queries before choosing a model. This takes an afternoon and can save weeks of debugging.

Where Re-Ranking Actually Belongs

flowchart TD Q[User Query] --> VR[Vector Retrieval\ntop-50 candidates] Q --> KW[Keyword Search\nBM25 top-50] VR --> FUSE[Score Fusion\nRRF] KW --> FUSE FUSE --> RERANK[Cross-Encoder Re-ranker\ntop-3 from 50] RERANK --> LLM[LLM Generation] LLM --> ANS[Answer] style RERANK fill:#f96,stroke:#333 style VR fill:#69f,stroke:#333 style KW fill:#69f,stroke:#333

The re-ranker is not a replacement for good retrieval — it's a precision layer on top of high-recall retrieval. If your vector search is only returning 5 candidates, the re-ranker has nothing to work with.

Common misplacement mistakes:

Re-ranking before fusion — You re-rank vector results, then merge with keyword results. The merged set is now out of order.
Re-ranking too few candidates — Re-ranking 5 candidates when the right answer was rank 8.
Skipping re-ranking entirely — Raw cosine similarity is not a good enough relevance signal for production.

The Eval Gap That Hides Everything

Bad retrieval hides behind LLM quality. If your LLM is good enough, it can often compensate for mediocre retrieval. You see "good enough" answers, you ship, and the failure mode stays invisible until a high-stakes query exposes it.

The fix is a dedicated retrieval eval, separate from end-to-end eval.

# Minimal retrieval eval harness
from datasets import Dataset
import pandas as pd
 
def evaluate_retrieval(retriever, eval_set: list[dict]) -> dict:
    """
    eval_set: [{"query": str, "relevant_doc_ids": list[str]}]
    """
    recall_scores = []
    precision_scores = []
 
    for item in eval_set:
        results = retriever.retrieve(item["query"], k=10)
        retrieved_ids = [r.doc_id for r in results]
        relevant = set(item["relevant_doc_ids"])
        retrieved = set(retrieved_ids)
 
        recall = len(relevant & retrieved) / len(relevant) if relevant else 0
        precision = len(relevant & retrieved) / len(retrieved) if retrieved else 0
 
        recall_scores.append(recall)
        precision_scores.append(precision)
 
    return {
        "recall@10": sum(recall_scores) / len(recall_scores),
        "precision@10": sum(precision_scores) / len(precision_scores),
    }

Build this eval set with 50–100 queries where you know the ground-truth relevant chunks. You don't need thousands. You need enough to catch regressions when you change chunk size, embedding model, or retrieval strategy.

Query-Side Problems

Your retrieval degrades because the query going into the vector store is often not the right query.

Conversational context lost: User asks "what about the refund policy?" in message 5 of a chat. The standalone query is meaningless without context from messages 1–4.

Fix: Query rewriting before retrieval.

from openai import OpenAI
 
client = OpenAI()
 
def rewrite_query_for_retrieval(
    conversation_history: list[dict],
    current_query: str,
) -> str:
    messages = [
        {
            "role": "system",
            "content": (
                "Given a conversation history and a follow-up question, "
                "rewrite the follow-up as a standalone search query. "
                "Return only the rewritten query, no explanation."
            ),
        },
        *conversation_history[-4:],  # last 4 turns
        {"role": "user", "content": current_query},
    ]
 
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0,
    )
    return response.choices[0].message.content.strip()

Vocabulary mismatch: The user says "heart attack," the document says "myocardial infarction." Embeddings help here but don't fully solve it. Hybrid search (BM25 + vectors) handles this better — exact keyword match catches the synonym you missed.

Metadata Filtering Is Not Free

Adding metadata filters (where: {category: "finance"}) seems like it tightens results. Often it narrows the candidate pool so aggressively that the correct chunk is filtered out.

Test every metadata filter against your recall metric before deploying it. A filter that improves precision by 5% but drops recall by 20% is a net loss.

# Profile the effect of filters on recall
def compare_recall_with_filter(retriever, eval_set, filter_fn=None):
    baseline = evaluate_retrieval(retriever, eval_set)
    filtered_retriever = retriever.with_filter(filter_fn) if filter_fn else retriever
    filtered = evaluate_retrieval(filtered_retriever, eval_set)
 
    print(f"Baseline recall@10:  {baseline['recall@10']:.3f}")
    print(f"Filtered recall@10:  {filtered['recall@10']:.3f}")
    print(f"Delta:               {filtered['recall@10'] - baseline['recall@10']:+.3f}")

Key Takeaways

Chunk size is a hyperparameter specific to your documents — 512 tokens is not a default, it's a guess.
Optimize retrieval for recall, not precision; precision is the re-ranker's job.
Measure retrieval recall separately from end-to-end RAG quality — LLM quality masks retrieval failures.
Re-rank over a large candidate set (20–50), not over 3–5 results.
Rewrite conversational queries into standalone search queries before hitting the vector store.
Metadata filters narrow recall — validate every filter against ground-truth retrieval benchmarks before shipping.