Building Production RAG

Chunking Strategies That Actually Move Recall

Ravinder·January 22, 2025·7 min read

RAGAILLMText ProcessingInformation Retrieval

Series

Building Production RAG

Part 2 of 10

← Part 1

Problem Framing and Dataset Honesty

Part 3 →

Embedding Choice and Dimensionality

Chunking Strategies That Actually Move Recall

Chunking is the unglamorous foundation of every RAG system, and it's the decision teams get wrong most often. Pick chunks too large and your retrieval becomes imprecise — you return a page when you needed a paragraph. Pick chunks too small and you strip the context that makes an answer coherent. Neither failure mode shows up dramatically; they both just slowly erode your recall numbers until users stop trusting the system.

This post covers the four chunking approaches that actually matter, when each wins, and the specific parameters to tune.

Why Chunking Drives Recall

Embedding models have a context window. For most production models it's 512 tokens. When you embed a 2,000-token document as a single vector, the model averages over the entire content — the dense representation drifts toward "everything" and becomes specific to nothing.

Retrieval works by comparing the query embedding to chunk embeddings. The closer the match, the higher the score. A chunk that's tightly scoped to a single concept will score much higher against a relevant query than a chunk that spans three loosely related concepts.

The retrieval system can only return what's been indexed. If the answer to a user's question spans chunk boundaries, you're likely to miss it. That's the core tension chunking strategy must resolve.

Strategy 1: Fixed-Size with Overlap

The baseline. Split every document into N-token windows with a K-token overlap between consecutive chunks. Simple, predictable, and often good enough for homogenous corpora.

from langchain.text_splitter import RecursiveCharacterTextSplitter
 
def fixed_chunk(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    return splitter.split_text(text)
 
# Benchmark different overlap values on your golden set
results = {}
for overlap in [0, 32, 64, 128]:
    chunks = fixed_chunk(sample_doc, chunk_size=512, overlap=overlap)
    results[f"overlap_{overlap}"] = {
        "n_chunks": len(chunks),
        "avg_len": sum(len(c) for c in chunks) / len(chunks),
    }

When it wins: Uniform documents like support tickets, short articles, or product reviews where content doesn't have strong structural boundaries.

Where it breaks: Technical documentation with tables, API references, or any document where the logical unit doesn't align with character count.

The overlap parameter matters more than most teams realize. An overlap of 10–15% of chunk size is the practical default. Go below 5% and you lose cross-boundary context. Go above 25% and you index redundant content that inflates storage and confuses retrieval scores.

Strategy 2: Recursive Semantic Splitting

Instead of splitting on token count, split on semantic boundaries — paragraphs, then sentences, falling back to tokens only when necessary. The RecursiveCharacterTextSplitter above does this but its separator list is what drives the behavior.

A better version builds document-type awareness into the separator hierarchy:

def make_splitter(doc_type: str, chunk_size: int = 512) -> RecursiveCharacterTextSplitter:
    separators_by_type = {
        "markdown": ["\n## ", "\n### ", "\n\n", "\n", ". ", " "],
        "code": ["\nclass ", "\ndef ", "\n\n", "\n", " "],
        "prose": ["\n\n", "\n", ". ", "? ", "! ", " "],
        "html": ["</p>", "<br>", "\n\n", "\n", ". ", " "],
    }
    return RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size * 0.12),
        separators=separators_by_type.get(doc_type, separators_by_type["prose"]),
    )

When it wins: Mixed corpora where documents have different structures. Preserves paragraph and section integrity better than fixed-size.

Strategy 3: Semantic Chunking with Embedding-Based Splits

This is the strategy that actually surprises people with recall improvements. Instead of splitting on structural markers, embed consecutive sentences and split when the cosine similarity between adjacent sentences drops below a threshold. Splits happen at topic transitions.

import numpy as np
from sentence_transformers import SentenceTransformer
 
def semantic_chunk(
    text: str,
    model: SentenceTransformer,
    threshold: float = 0.75,
    min_chunk_sentences: int = 3,
    max_chunk_sentences: int = 12,
) -> list[str]:
    sentences = [s.strip() for s in text.split(". ") if s.strip()]
    if len(sentences) < 2:
        return [text]
 
    embeddings = model.encode(sentences, normalize_embeddings=True)
 
    # Compute consecutive cosine similarities
    similarities = [
        float(np.dot(embeddings[i], embeddings[i + 1]))
        for i in range(len(embeddings) - 1)
    ]
 
    # Split where similarity drops below threshold
    split_points = [0]
    current_chunk_size = 0
    for i, sim in enumerate(similarities):
        current_chunk_size += 1
        should_split = (
            sim < threshold and current_chunk_size >= min_chunk_sentences
        ) or current_chunk_size >= max_chunk_sentences
        if should_split:
            split_points.append(i + 1)
            current_chunk_size = 0
    split_points.append(len(sentences))
 
    chunks = []
    for start, end in zip(split_points, split_points[1:]):
        chunk_text = ". ".join(sentences[start:end])
        if chunk_text:
            chunks.append(chunk_text)
    return chunks

When it wins: Long-form documents that cover multiple topics — blog posts, whitepapers, legal documents. Recall improvements of 8–15 percentage points over fixed-size are common in these corpora.

Cost: You're running the embedding model at indexing time to determine splits, so indexing takes ~2x longer. Worth it if your documents are rich and heterogeneous.

Strategy 4: Hierarchical / Parent-Child Chunking

Index small chunks for retrieval precision, but when you retrieve a match, return its parent chunk for LLM context. This separates the retrieval unit from the generation unit.

from dataclasses import dataclass, field
 
@dataclass
class HierarchicalChunk:
    chunk_id: str
    parent_id: str | None
    text: str
    level: int  # 0 = parent, 1 = child
    children: list["HierarchicalChunk"] = field(default_factory=list)
 
def build_hierarchical_index(
    document: str,
    parent_size: int = 1024,
    child_size: int = 256,
) -> list[HierarchicalChunk]:
    parent_splitter = RecursiveCharacterTextSplitter(
        chunk_size=parent_size, chunk_overlap=0
    )
    child_splitter = RecursiveCharacterTextSplitter(
        chunk_size=child_size, chunk_overlap=32
    )
 
    parents = parent_splitter.split_text(document)
    result = []
 
    for p_idx, parent_text in enumerate(parents):
        parent_id = f"parent_{p_idx}"
        parent_chunk = HierarchicalChunk(
            chunk_id=parent_id,
            parent_id=None,
            text=parent_text,
            level=0,
        )
        children = child_splitter.split_text(parent_text)
        for c_idx, child_text in enumerate(children):
            child = HierarchicalChunk(
                chunk_id=f"{parent_id}_child_{c_idx}",
                parent_id=parent_id,
                text=child_text,
                level=1,
            )
            parent_chunk.children.append(child)
        result.append(parent_chunk)
 
    return result
 
# At retrieval time: find the best child, return its parent text to the LLM

When it wins: Documents where precision and context are both important. API documentation, legal contracts, technical specs. Typical improvement: retrieval precision stays high (small child chunks) while answer quality improves (parent context).

Choosing the Right Strategy

The decision isn't purely about document type — it's also about your retrieval latency budget and indexing pipeline complexity you can maintain.

Strategy	Index Speed	Retrieval Precision	Context Quality	Best For
Fixed-size	Fast	Moderate	Moderate	Homogenous, short docs
Recursive semantic	Fast	Good	Good	Mixed structure docs
Embedding-based semantic	Slow	Very good	Good	Long-form, topic-rich docs
Hierarchical	Moderate	Very good	Very good	Technical reference docs

Measuring Impact on Your Golden Set

Don't pick a strategy by intuition. Measure it with the golden set you built in post 1:

from sklearn.metrics import ndcg_score
import numpy as np
 
def evaluate_chunking_strategy(
    golden_examples: list[dict],
    retrieved_results: list[list[str]],  # chunk_ids per query
    k: int = 5,
) -> dict:
    recalls = []
    precisions = []
 
    for example, retrieved in zip(golden_examples, retrieved_results):
        relevant = set(example["relevant_chunk_ids"])
        retrieved_at_k = set(retrieved[:k])
 
        recall = len(relevant & retrieved_at_k) / max(1, len(relevant))
        precision = len(relevant & retrieved_at_k) / max(1, len(retrieved_at_k))
        recalls.append(recall)
        precisions.append(precision)
 
    return {
        f"recall@{k}": round(np.mean(recalls), 4),
        f"precision@{k}": round(np.mean(precisions), 4),
        "f1": round(
            2 * np.mean(recalls) * np.mean(precisions) /
            max(1e-9, np.mean(recalls) + np.mean(precisions)), 4
        ),
    }

Run this for each strategy, compare numbers, pick the winner. Anything less is guessing.

Key Takeaways

Chunk size and overlap are the single highest-leverage parameters in a RAG pipeline — tune them against your actual data before moving on.
Fixed-size chunking is the right starting point; it fails when document structure doesn't align with character boundaries.
Embedding-based semantic chunking can improve recall by 8–15 points on heterogeneous corpora but costs more at indexing time.
Hierarchical (parent-child) chunking decouples retrieval precision from generation context — worth the complexity for technical documentation.
Overlap of 10–15% of chunk size is a reliable default; test variations on your golden set before settling.
Never choose a chunking strategy without measuring recall@k on a representative sample of real queries.

Series

Building Production RAG

Part 2 of 10

← Part 1

Problem Framing and Dataset Honesty

Part 3 →

Embedding Choice and Dimensionality