Building Production RAG

Embedding Choice and Dimensionality

Ravinder·January 29, 2025·8 min read

RAGAILLMEmbeddingsVector Search

Series

Building Production RAG

Part 3 of 10

← Part 2

Chunking Strategies That Actually Move Recall

Part 4 →

Hybrid Search: BM25 + Vector

Teams spend weeks debating vector database options and almost no time on the thing that determines whether their retrieval actually works: which embedding model they use and at what dimensionality. The embedding model is the translation layer between language and geometry — get it wrong and no amount of tuning downstream will compensate.

This post covers how to pick a model, what dimensionality actually costs you, and when fine-tuning is worth the effort.

Why This Decision Is Hard to Reverse

Your embedding model is baked into every vector in your index. Change it and you re-embed everything. For a corpus of 5 million chunks at $0.10 per million tokens, that's $500 and several hours of pipeline time — every time you switch. That cost means teams get anchored on their first choice even when it's wrong.

The right way to handle this: treat embedding selection as a first-class architectural decision, evaluate it rigorously before indexing your production corpus, and design your pipeline to make re-embedding possible without a full outage.

flowchart TD A[Define domain + query types] --> B[Shortlist 3-4 candidate models] B --> C[Embed golden set with each model] C --> D[Measure recall@5 and recall@10] D --> E{Delta > 5 ppts?} E -->|Yes| F[Pick winner, document decision] E -->|No| G[Pick smallest / cheapest model] F --> H[Design re-embedding pipeline before indexing] G --> H H --> I[Index production corpus]

The Model Landscape in 2025

Three tiers worth knowing:

Tier 1 — General-purpose, high quality: text-embedding-3-large (OpenAI, 3072d), embed-english-v3.0 (Cohere, 1024d), gte-Qwen2-7B-instruct (Alibaba/HuggingFace, 3584d). These win on general benchmarks but may underperform on specialized domains.

Tier 2 — Fast, deployable, good enough for most: text-embedding-3-small (OpenAI, 1536d), bge-large-en-v1.5 (BAAI, 1024d), e5-large-v2 (Microsoft, 1024d). These are the workhorses of production RAG at companies where latency and cost matter.

Tier 3 — Small, fast, on-premise: all-MiniLM-L6-v2 (384d), bge-small-en-v1.5 (384d). Useful when you can't send data to an API or when your embedding throughput requirements are extreme.

The MTEB leaderboard (huggingface.co/spaces/mteb/leaderboard) is your starting point, but filter it: look at retrieval task scores specifically, filter for your language, and distrust any model that only tops the aggregate leaderboard by performing well on STS tasks that don't reflect real retrieval.

Dimensionality Tradeoffs

Higher dimensionality is not strictly better. The tradeoffs are concrete:

Dimension	Storage per vector	Index RAM (10M vecs)	Approx query latency	MTEB retrieval avg
384	1.5 KB	~15 GB	~2 ms	~52
768	3 KB	~30 GB	~4 ms	~55
1024	4 KB	~40 GB	~5 ms	~56
1536	6 KB	~60 GB	~8 ms	~57
3072	12 KB	~120 GB	~15 ms	~59

The jump from 384 to 1024 dimensions often buys real quality gains. The jump from 1024 to 3072 usually doesn't — you're paying 3x storage for 1–3 MTEB points. Unless your domain is highly specialized and benefits from the additional representational capacity, 768–1024d is the practical sweet spot for production.

OpenAI's text-embedding-3 family supports Matryoshka truncation — you embed at full dimensionality and truncate to a lower dimension without re-embedding. Useful if you want to experiment with precision/cost tradeoffs without re-running your indexing pipeline.

import openai
import numpy as np
 
client = openai.OpenAI()
 
def embed_with_truncation(
    texts: list[str],
    full_dim: int = 1536,
    truncated_dim: int = 512,
) -> np.ndarray:
    """
    Embed at full dimensionality, then truncate + renormalize.
    Only works correctly for Matryoshka-trained models.
    """
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
        dimensions=full_dim,
    )
    vectors = np.array([r.embedding for r in response.data])
    # Truncate to lower dim, then L2-normalize
    truncated = vectors[:, :truncated_dim]
    norms = np.linalg.norm(truncated, axis=1, keepdims=True)
    return truncated / np.maximum(norms, 1e-10)
 
 
def compare_truncation_recall(
    golden_set: list[dict],
    dims_to_test: list[int] = [512, 768, 1024, 1536],
) -> dict[int, float]:
    """
    Measure recall@5 at each truncation level on your golden set.
    The curve tells you where quality drops off.
    """
    results = {}
    for dim in dims_to_test:
        # ... retrieval and scoring logic against golden set
        # plug in your evaluate_chunking_strategy from post 2
        results[dim] = 0.0  # placeholder for measured recall
    return results

Domain Mismatch: The Silent Killer

General-purpose embeddings fail when your corpus uses terminology outside their training distribution. Common failure domains:

Legal: contract clause semantics are radically different from conversational English
Biomedical: drug names, gene symbols, clinical abbreviations
Code-heavy: queries mixing natural language with API names, variable names, error messages
Internal enterprise jargon: product names, team abbreviations, proprietary processes

You can detect domain mismatch with a simple test: take 20 queries from your golden set, embed them, find their nearest neighbors in the corpus, and manually inspect the top-3 for each. If the results are semantically bizarre — clearly wrong documents ranking high — you have a domain mismatch problem.

from sentence_transformers import SentenceTransformer
import numpy as np
 
def domain_mismatch_audit(
    queries: list[str],
    corpus_texts: list[str],
    model_name: str = "BAAI/bge-large-en-v1.5",
    top_k: int = 3,
) -> list[dict]:
    model = SentenceTransformer(model_name)
    q_embs = model.encode(queries, normalize_embeddings=True)
    c_embs = model.encode(corpus_texts, normalize_embeddings=True)
 
    scores = q_embs @ c_embs.T  # shape: (n_queries, n_corpus)
    results = []
    for i, query in enumerate(queries):
        top_indices = np.argsort(scores[i])[::-1][:top_k]
        results.append({
            "query": query,
            "top_matches": [
                {"text": corpus_texts[j], "score": float(scores[i][j])}
                for j in top_indices
            ],
        })
    return results
    # Eyeball these. If the top match for "what is the MAC period?"
    # is a document about Apple computers, you have a domain problem.

Fine-Tuning ROI

Fine-tuning is the solution to domain mismatch, but it has real costs. The honest math:

Minimum viable fine-tuning dataset: ~1,000 (query, positive passage, hard negative) triples. Building these manually takes 20–40 hours. LLM-assisted generation can get you there faster but needs human review to avoid label noise.

Infrastructure cost: a single fine-tuning run on bge-large-en-v1.5 takes ~2 hours on a single A100. If you use a cloud provider, budget $10–30 per run. Expect 3–5 runs before you're happy with the result.

Recall uplift: on in-domain specialized corpora, fine-tuning typically improves recall@5 by 10–20 percentage points over the base model. On general corpora, uplift is often 2–5 points — not worth it.

The ROI calculation: if your golden set shows recall@5 of 0.65 with a general model, and you have a budget for 2 weeks of engineering time, fine-tuning will likely get you to 0.78–0.82. If you're already at 0.82, spend that time on re-ranking instead.

from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
 
def fine_tune_embedder(
    base_model: str,
    train_examples: list[tuple[str, str, str]],  # (query, positive, negative)
    output_path: str,
    epochs: int = 3,
    batch_size: int = 32,
) -> SentenceTransformer:
    model = SentenceTransformer(base_model)
 
    examples = [
        InputExample(texts=[q, pos, neg])
        for q, pos, neg in train_examples
    ]
    loader = DataLoader(examples, batch_size=batch_size, shuffle=True)
    loss = losses.TripletLoss(model=model)
 
    model.fit(
        train_objectives=[(loader, loss)],
        epochs=epochs,
        warmup_steps=int(0.1 * len(loader) * epochs),
        output_path=output_path,
        show_progress_bar=True,
        save_best_model=True,
    )
    return model

The Evaluation You Must Run

Before locking in a model, run this comparison on your golden set:

def embedding_model_shootout(
    golden_examples: list[dict],
    corpus_chunks: list[dict],  # [{id, text}, ...]
    candidate_models: list[str],
    k: int = 5,
) -> dict[str, dict]:
    from sentence_transformers import SentenceTransformer
    import numpy as np
 
    corpus_texts = [c["text"] for c in corpus_chunks]
    corpus_ids = [c["id"] for c in corpus_chunks]
    queries = [ex["query"] for ex in golden_examples]
 
    results = {}
    for model_name in candidate_models:
        model = SentenceTransformer(model_name)
        c_embs = model.encode(corpus_texts, normalize_embeddings=True, batch_size=64)
        q_embs = model.encode(queries, normalize_embeddings=True, batch_size=64)
        scores = q_embs @ c_embs.T
 
        recalls = []
        for i, ex in enumerate(golden_examples):
            top_k_ids = {corpus_ids[j] for j in np.argsort(scores[i])[::-1][:k]}
            relevant = set(ex["relevant_chunk_ids"])
            recalls.append(len(top_k_ids & relevant) / max(1, len(relevant)))
 
        results[model_name] = {
            "recall@k": round(float(np.mean(recalls)), 4),
            "model_dim": model.get_sentence_embedding_dimension(),
        }
    return results

Run this. The output table will tell you whether you're leaving retrieval quality on the table or paying for dimensions you don't need.

Key Takeaways

Embedding model choice is one of the hardest decisions to reverse — evaluate multiple models against your golden set before indexing your production corpus.
The 768–1024 dimension range gives the best quality-to-cost ratio for most production workloads; going beyond 1536d rarely pays off.
Matryoshka models let you experiment with dimensionality tradeoffs without re-embedding — use this when you're still calibrating your cost/quality operating point.
Domain mismatch is detectable with a simple nearest-neighbor inspection; if results look semantically wrong, no amount of downstream tuning will fix it.
Fine-tuning is worth it when your domain is specialized and your baseline recall@5 is below 0.75; expect 10–20 point gains and budget 1–2 weeks of engineering time.
Always run a model shootout on your actual golden set — MTEB benchmarks are a starting point, not a decision.

Series

Building Production RAG

Part 3 of 10

← Part 2

Chunking Strategies That Actually Move Recall

Part 4 →

Hybrid Search: BM25 + Vector