Why Your RAG Retrieval Is Wrong
Your RAG system returns confident answers. Some of them are wrong — and the LLM isn't the problem. The retrieval layer is.
This happens because retrieval failures are silent. The model doesn't say "I couldn't find anything relevant." It hallucinates, or it answers from a slightly-wrong chunk and the answer looks plausible. By the time a user catches it, you have no idea which retrieval step failed.
Here's where the breaks actually are, and what to do about each one.
The Chunking Tax You're Ignoring
Every tutorial tells you to chunk at 512 tokens. That number is cargo-culted from early embedding model context limits and has nothing to do with your documents.
The right chunk size is a function of:
- Embedding model context window — Most modern models handle 8k tokens. Smaller chunks waste capacity.
- Document structure — Legal contracts and code are paragraph-coherent. Wikipedia articles are section-coherent. Chunking across natural boundaries destroys meaning.
- Query length distribution — Short keyword queries retrieve differently than long natural-language questions.
What Actually Goes Wrong
Chunk too small: A sentence fragment embeds fine in isolation but its meaning depends on the surrounding context. The retriever finds it, the LLM reads it, and the answer is wrong because the crucial qualifier was in the previous chunk.
Chunk too large: The chunk has high semantic coverage but the relevant sentence is buried. The embedding averages across the whole chunk. Relevance score dilutes. You miss it.
Fixed-size splits mid-sentence: The classic mistake. RecursiveCharacterTextSplitter with chunk_size=512, chunk_overlap=50 will split at character boundaries. A number, a date, a procedure step — all cut in half.
# What most people do — wrong for structured docs
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
)
# What you should do for structured docs
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
strip_headers=False, # keep headers in chunk for context
)The Parent-Child Pattern
For dense technical documents, use parent-child chunking: embed small child chunks for retrieval precision, but return the parent chunk to the LLM for context.
from llama_index.core.node_parser import HierarchicalNodeParser
parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128] # parent, child, grandchild
)
nodes = parser.get_nodes_from_documents(documents)
# Index 128-token nodes, retrieve parent 512-token on matchThis gives you the precision of small chunks and the coherence of large ones.
Recall vs Precision — You're Optimizing the Wrong One
Most teams look at whether retrieved chunks are relevant (precision). Almost nobody measures whether relevant chunks were retrieved at all (recall).
| Metric | What it tells you | What it misses |
|---|---|---|
| Precision@K | Are the top-K results good? | You may have missed the best result |
| Recall@K | Did you find the right chunks? | Doesn't care about rank order |
| MRR | Is the first good result early? | Ignores tail coverage |
| NDCG | Graded relevance + rank position | Requires graded labels |
A high-precision retriever that misses the best chunk will cause the LLM to answer from second-best evidence. This produces answers that are technically sourced but subtly wrong.
The practical fix: Retrieve more candidates (top-20 or top-50) and let a re-ranker do precision filtering. Never tune your vector search for precision — it's the wrong layer for that job.
# Don't do this — you're doing the re-ranker's job poorly
results = vector_store.similarity_search(query, k=3)
# Do this — high recall, delegate precision
candidates = vector_store.similarity_search(query, k=20)
results = reranker.rerank(query, candidates, top_n=3)The Embedding Model Mismatch
Your embedding model was trained on general text. Your documents are financial contracts, medical records, or internal engineering wikis. The semantic space doesn't match.
This isn't a theoretical problem. On domain-specific retrieval benchmarks, a general embedding model can underperform a fine-tuned one by 15–30% in recall@10.
# Check your embedding model's training domain before committing
from sentence_transformers import SentenceTransformer
# General purpose — good baseline, bad for specialized domains
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Domain-specific alternatives
# Medical: "pritamdeka/S-PubMedBert-MS-MARCO"
# Legal: "legal-bert-base-uncased"
# Code: "microsoft/codebert-base"Run a quick recall@10 test on 50 representative queries before choosing a model. This takes an afternoon and can save weeks of debugging.
Where Re-Ranking Actually Belongs
The re-ranker is not a replacement for good retrieval — it's a precision layer on top of high-recall retrieval. If your vector search is only returning 5 candidates, the re-ranker has nothing to work with.
Common misplacement mistakes:
- Re-ranking before fusion — You re-rank vector results, then merge with keyword results. The merged set is now out of order.
- Re-ranking too few candidates — Re-ranking 5 candidates when the right answer was rank 8.
- Skipping re-ranking entirely — Raw cosine similarity is not a good enough relevance signal for production.
The Eval Gap That Hides Everything
Bad retrieval hides behind LLM quality. If your LLM is good enough, it can often compensate for mediocre retrieval. You see "good enough" answers, you ship, and the failure mode stays invisible until a high-stakes query exposes it.
The fix is a dedicated retrieval eval, separate from end-to-end eval.
# Minimal retrieval eval harness
from datasets import Dataset
import pandas as pd
def evaluate_retrieval(retriever, eval_set: list[dict]) -> dict:
"""
eval_set: [{"query": str, "relevant_doc_ids": list[str]}]
"""
recall_scores = []
precision_scores = []
for item in eval_set:
results = retriever.retrieve(item["query"], k=10)
retrieved_ids = [r.doc_id for r in results]
relevant = set(item["relevant_doc_ids"])
retrieved = set(retrieved_ids)
recall = len(relevant & retrieved) / len(relevant) if relevant else 0
precision = len(relevant & retrieved) / len(retrieved) if retrieved else 0
recall_scores.append(recall)
precision_scores.append(precision)
return {
"recall@10": sum(recall_scores) / len(recall_scores),
"precision@10": sum(precision_scores) / len(precision_scores),
}Build this eval set with 50–100 queries where you know the ground-truth relevant chunks. You don't need thousands. You need enough to catch regressions when you change chunk size, embedding model, or retrieval strategy.
Query-Side Problems
Your retrieval degrades because the query going into the vector store is often not the right query.
Conversational context lost: User asks "what about the refund policy?" in message 5 of a chat. The standalone query is meaningless without context from messages 1–4.
Fix: Query rewriting before retrieval.
from openai import OpenAI
client = OpenAI()
def rewrite_query_for_retrieval(
conversation_history: list[dict],
current_query: str,
) -> str:
messages = [
{
"role": "system",
"content": (
"Given a conversation history and a follow-up question, "
"rewrite the follow-up as a standalone search query. "
"Return only the rewritten query, no explanation."
),
},
*conversation_history[-4:], # last 4 turns
{"role": "user", "content": current_query},
]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0,
)
return response.choices[0].message.content.strip()Vocabulary mismatch: The user says "heart attack," the document says "myocardial infarction." Embeddings help here but don't fully solve it. Hybrid search (BM25 + vectors) handles this better — exact keyword match catches the synonym you missed.
Metadata Filtering Is Not Free
Adding metadata filters (where: {category: "finance"}) seems like it tightens results. Often it narrows the candidate pool so aggressively that the correct chunk is filtered out.
Test every metadata filter against your recall metric before deploying it. A filter that improves precision by 5% but drops recall by 20% is a net loss.
# Profile the effect of filters on recall
def compare_recall_with_filter(retriever, eval_set, filter_fn=None):
baseline = evaluate_retrieval(retriever, eval_set)
filtered_retriever = retriever.with_filter(filter_fn) if filter_fn else retriever
filtered = evaluate_retrieval(filtered_retriever, eval_set)
print(f"Baseline recall@10: {baseline['recall@10']:.3f}")
print(f"Filtered recall@10: {filtered['recall@10']:.3f}")
print(f"Delta: {filtered['recall@10'] - baseline['recall@10']:+.3f}")Key Takeaways
- Chunk size is a hyperparameter specific to your documents — 512 tokens is not a default, it's a guess.
- Optimize retrieval for recall, not precision; precision is the re-ranker's job.
- Measure retrieval recall separately from end-to-end RAG quality — LLM quality masks retrieval failures.
- Re-rank over a large candidate set (20–50), not over 3–5 results.
- Rewrite conversational queries into standalone search queries before hitting the vector store.
- Metadata filters narrow recall — validate every filter against ground-truth retrieval benchmarks before shipping.