Embedding Choice and Dimensionality
Series
Building Production RAGTeams spend weeks debating vector database options and almost no time on the thing that determines whether their retrieval actually works: which embedding model they use and at what dimensionality. The embedding model is the translation layer between language and geometry — get it wrong and no amount of tuning downstream will compensate.
This post covers how to pick a model, what dimensionality actually costs you, and when fine-tuning is worth the effort.
Why This Decision Is Hard to Reverse
Your embedding model is baked into every vector in your index. Change it and you re-embed everything. For a corpus of 5 million chunks at $0.10 per million tokens, that's $500 and several hours of pipeline time — every time you switch. That cost means teams get anchored on their first choice even when it's wrong.
The right way to handle this: treat embedding selection as a first-class architectural decision, evaluate it rigorously before indexing your production corpus, and design your pipeline to make re-embedding possible without a full outage.
The Model Landscape in 2025
Three tiers worth knowing:
Tier 1 — General-purpose, high quality: text-embedding-3-large (OpenAI, 3072d), embed-english-v3.0 (Cohere, 1024d), gte-Qwen2-7B-instruct (Alibaba/HuggingFace, 3584d). These win on general benchmarks but may underperform on specialized domains.
Tier 2 — Fast, deployable, good enough for most: text-embedding-3-small (OpenAI, 1536d), bge-large-en-v1.5 (BAAI, 1024d), e5-large-v2 (Microsoft, 1024d). These are the workhorses of production RAG at companies where latency and cost matter.
Tier 3 — Small, fast, on-premise: all-MiniLM-L6-v2 (384d), bge-small-en-v1.5 (384d). Useful when you can't send data to an API or when your embedding throughput requirements are extreme.
The MTEB leaderboard (huggingface.co/spaces/mteb/leaderboard) is your starting point, but filter it: look at retrieval task scores specifically, filter for your language, and distrust any model that only tops the aggregate leaderboard by performing well on STS tasks that don't reflect real retrieval.
Dimensionality Tradeoffs
Higher dimensionality is not strictly better. The tradeoffs are concrete:
| Dimension | Storage per vector | Index RAM (10M vecs) | Approx query latency | MTEB retrieval avg |
|---|---|---|---|---|
| 384 | 1.5 KB | ~15 GB | ~2 ms | ~52 |
| 768 | 3 KB | ~30 GB | ~4 ms | ~55 |
| 1024 | 4 KB | ~40 GB | ~5 ms | ~56 |
| 1536 | 6 KB | ~60 GB | ~8 ms | ~57 |
| 3072 | 12 KB | ~120 GB | ~15 ms | ~59 |
The jump from 384 to 1024 dimensions often buys real quality gains. The jump from 1024 to 3072 usually doesn't — you're paying 3x storage for 1–3 MTEB points. Unless your domain is highly specialized and benefits from the additional representational capacity, 768–1024d is the practical sweet spot for production.
OpenAI's text-embedding-3 family supports Matryoshka truncation — you embed at full dimensionality and truncate to a lower dimension without re-embedding. Useful if you want to experiment with precision/cost tradeoffs without re-running your indexing pipeline.
import openai
import numpy as np
client = openai.OpenAI()
def embed_with_truncation(
texts: list[str],
full_dim: int = 1536,
truncated_dim: int = 512,
) -> np.ndarray:
"""
Embed at full dimensionality, then truncate + renormalize.
Only works correctly for Matryoshka-trained models.
"""
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
dimensions=full_dim,
)
vectors = np.array([r.embedding for r in response.data])
# Truncate to lower dim, then L2-normalize
truncated = vectors[:, :truncated_dim]
norms = np.linalg.norm(truncated, axis=1, keepdims=True)
return truncated / np.maximum(norms, 1e-10)
def compare_truncation_recall(
golden_set: list[dict],
dims_to_test: list[int] = [512, 768, 1024, 1536],
) -> dict[int, float]:
"""
Measure recall@5 at each truncation level on your golden set.
The curve tells you where quality drops off.
"""
results = {}
for dim in dims_to_test:
# ... retrieval and scoring logic against golden set
# plug in your evaluate_chunking_strategy from post 2
results[dim] = 0.0 # placeholder for measured recall
return resultsDomain Mismatch: The Silent Killer
General-purpose embeddings fail when your corpus uses terminology outside their training distribution. Common failure domains:
- Legal: contract clause semantics are radically different from conversational English
- Biomedical: drug names, gene symbols, clinical abbreviations
- Code-heavy: queries mixing natural language with API names, variable names, error messages
- Internal enterprise jargon: product names, team abbreviations, proprietary processes
You can detect domain mismatch with a simple test: take 20 queries from your golden set, embed them, find their nearest neighbors in the corpus, and manually inspect the top-3 for each. If the results are semantically bizarre — clearly wrong documents ranking high — you have a domain mismatch problem.
from sentence_transformers import SentenceTransformer
import numpy as np
def domain_mismatch_audit(
queries: list[str],
corpus_texts: list[str],
model_name: str = "BAAI/bge-large-en-v1.5",
top_k: int = 3,
) -> list[dict]:
model = SentenceTransformer(model_name)
q_embs = model.encode(queries, normalize_embeddings=True)
c_embs = model.encode(corpus_texts, normalize_embeddings=True)
scores = q_embs @ c_embs.T # shape: (n_queries, n_corpus)
results = []
for i, query in enumerate(queries):
top_indices = np.argsort(scores[i])[::-1][:top_k]
results.append({
"query": query,
"top_matches": [
{"text": corpus_texts[j], "score": float(scores[i][j])}
for j in top_indices
],
})
return results
# Eyeball these. If the top match for "what is the MAC period?"
# is a document about Apple computers, you have a domain problem.Fine-Tuning ROI
Fine-tuning is the solution to domain mismatch, but it has real costs. The honest math:
Minimum viable fine-tuning dataset: ~1,000 (query, positive passage, hard negative) triples. Building these manually takes 20–40 hours. LLM-assisted generation can get you there faster but needs human review to avoid label noise.
Infrastructure cost: a single fine-tuning run on bge-large-en-v1.5 takes ~2 hours on a single A100. If you use a cloud provider, budget $10–30 per run. Expect 3–5 runs before you're happy with the result.
Recall uplift: on in-domain specialized corpora, fine-tuning typically improves recall@5 by 10–20 percentage points over the base model. On general corpora, uplift is often 2–5 points — not worth it.
The ROI calculation: if your golden set shows recall@5 of 0.65 with a general model, and you have a budget for 2 weeks of engineering time, fine-tuning will likely get you to 0.78–0.82. If you're already at 0.82, spend that time on re-ranking instead.
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
def fine_tune_embedder(
base_model: str,
train_examples: list[tuple[str, str, str]], # (query, positive, negative)
output_path: str,
epochs: int = 3,
batch_size: int = 32,
) -> SentenceTransformer:
model = SentenceTransformer(base_model)
examples = [
InputExample(texts=[q, pos, neg])
for q, pos, neg in train_examples
]
loader = DataLoader(examples, batch_size=batch_size, shuffle=True)
loss = losses.TripletLoss(model=model)
model.fit(
train_objectives=[(loader, loss)],
epochs=epochs,
warmup_steps=int(0.1 * len(loader) * epochs),
output_path=output_path,
show_progress_bar=True,
save_best_model=True,
)
return modelThe Evaluation You Must Run
Before locking in a model, run this comparison on your golden set:
def embedding_model_shootout(
golden_examples: list[dict],
corpus_chunks: list[dict], # [{id, text}, ...]
candidate_models: list[str],
k: int = 5,
) -> dict[str, dict]:
from sentence_transformers import SentenceTransformer
import numpy as np
corpus_texts = [c["text"] for c in corpus_chunks]
corpus_ids = [c["id"] for c in corpus_chunks]
queries = [ex["query"] for ex in golden_examples]
results = {}
for model_name in candidate_models:
model = SentenceTransformer(model_name)
c_embs = model.encode(corpus_texts, normalize_embeddings=True, batch_size=64)
q_embs = model.encode(queries, normalize_embeddings=True, batch_size=64)
scores = q_embs @ c_embs.T
recalls = []
for i, ex in enumerate(golden_examples):
top_k_ids = {corpus_ids[j] for j in np.argsort(scores[i])[::-1][:k]}
relevant = set(ex["relevant_chunk_ids"])
recalls.append(len(top_k_ids & relevant) / max(1, len(relevant)))
results[model_name] = {
"recall@k": round(float(np.mean(recalls)), 4),
"model_dim": model.get_sentence_embedding_dimension(),
}
return resultsRun this. The output table will tell you whether you're leaving retrieval quality on the table or paying for dimensions you don't need.
Key Takeaways
- Embedding model choice is one of the hardest decisions to reverse — evaluate multiple models against your golden set before indexing your production corpus.
- The 768–1024 dimension range gives the best quality-to-cost ratio for most production workloads; going beyond 1536d rarely pays off.
- Matryoshka models let you experiment with dimensionality tradeoffs without re-embedding — use this when you're still calibrating your cost/quality operating point.
- Domain mismatch is detectable with a simple nearest-neighbor inspection; if results look semantically wrong, no amount of downstream tuning will fix it.
- Fine-tuning is worth it when your domain is specialized and your baseline recall@5 is below 0.75; expect 10–20 point gains and budget 1–2 weeks of engineering time.
- Always run a model shootout on your actual golden set — MTEB benchmarks are a starting point, not a decision.