Fine-Tuning vs Prompting vs RAG: A Decision Tree for the Right LLM Strategy
The most common mistake I see teams make is choosing their LLM strategy based on what sounds most technically sophisticated rather than what solves their actual problem. Fine-tuning gets chosen because it sounds rigorous. RAG gets chosen because the blog posts made it look easy. Prompting gets dismissed as "just prompting." The result is months of training runs and infrastructure build-out for a problem that a well-written system prompt would have solved in a day.
There is a right tool for each job. Here is the decision tree.
The Decision Tree
Work through this honestly before writing code. Most teams stop at "needs knowledge" and jump straight to RAG. Many of those cases are better served by prompting with a well-structured context window.
When Prompting Wins
Prompting is underrated because it is invisible. When it works, nobody notices. When it fails, teams blame "LLM limitations" and reach for heavier solutions.
Prompting wins when:
- Your knowledge base fits in a context window (< 50 pages of relevant material)
- Your task is well-defined and the base model has general capability for it
- You need results this week, not next quarter
- You are still discovering what the system needs to do
The cost advantage is dramatic. A well-crafted system prompt costs nothing to deploy and nothing to maintain infrastructure-wise. The only cost is token usage on each call.
# This system prompt replaced a 6-week RAG build for an internal legal Q&A tool.
# The entire legal policy manual fit in 40k tokens.
SYSTEM_PROMPT = """You are a legal compliance assistant for Acme Corp.
You answer questions about our internal policies based only on the policy text provided.
Rules:
- Answer only from the provided policy context. Never guess or extrapolate.
- If the policy does not address the question, say exactly: "This is not covered by current policy. Please contact legal@acme.com."
- Cite the specific policy section (e.g., "Section 4.2 — Data Retention") for every claim.
- Use plain language. Avoid legal jargon unless quoting directly.
<policy_context>
{POLICY_TEXT}
</policy_context>
"""If your knowledge changes daily or your documents exceed your context window budget, prompting alone breaks down. That is when RAG earns its complexity cost.
When RAG Wins
RAG is the right choice when your knowledge base is too large for context stuffing, or when it changes frequently enough that you need retrieval to stay current without redeployment.
The critical design decision in RAG is not the vector database. It is the chunking strategy. Most teams spend weeks on embedding models and miss that their chunks are the problem.
from dataclasses import dataclass
from typing import Iterator
@dataclass
class Chunk:
text: str
source_id: str
section_title: str
chunk_index: int
metadata: dict
def chunk_document_with_overlap(
text: str,
source_id: str,
chunk_size: int = 512, # tokens, not characters
overlap: int = 64,
section_title: str = ""
) -> list[Chunk]:
"""
Overlapping chunks preserve context across boundaries.
Critical for tables and numbered lists that span chunk edges.
"""
words = text.split()
chunks = []
step = chunk_size - overlap
for i, start in enumerate(range(0, len(words), step)):
chunk_words = words[start:start + chunk_size]
if len(chunk_words) < 50: # Skip tiny trailing chunks
break
chunks.append(Chunk(
text=" ".join(chunk_words),
source_id=source_id,
section_title=section_title,
chunk_index=i,
metadata={"word_start": start, "word_end": start + len(chunk_words)}
))
return chunksRAG has real operational costs that teams underestimate:
- Embedding pipeline for new documents (latency + API cost)
- Vector database infrastructure and scaling
- Re-embedding when you switch embedding models
- Retrieval quality monitoring and reranking tuning
- Handling documents that don't chunk well (tables, code, PDFs)
If your knowledge base has fewer than 500 documents and changes less than weekly, you might be better served by a simpler approach.
When Fine-Tuning Wins
Fine-tuning is the right choice when the base model's behavior is wrong in a structural way — not just its knowledge, but how it responds. Specifically:
Format consistency at scale. If you need every response in a specific JSON schema, very specific output length, or a proprietary tone/style, fine-tuning bakes this in. Prompting can achieve this but adds tokens on every call and drifts under adversarial inputs.
Specialized domain vocabulary. Medical coding, legal citation formats, financial calculation schemas — if the base model consistently fails at domain-specific tasks even with good prompting, it may not have seen enough domain examples in pretraining.
Latency optimization. A fine-tuned smaller model (e.g., GPT-4o-mini or Llama 3.1 8B) often outperforms a larger general model with a long system prompt — for less cost and lower latency.
Hardcoded behavioral guardrails. Fine-tuning can make certain behaviors extremely consistent in ways that system prompt instructions cannot, because the behavior is in the weights rather than the context.
What fine-tuning does NOT fix: hallucination on facts not in the training data, lack of current information, and tasks the model architecture is fundamentally bad at.
Total Cost of Ownership Comparison
This is where most analyses fail. They compare token costs but ignore engineering time and ongoing maintenance.
| Strategy | Upfront Engineering | Monthly Infra Cost | Ongoing Maintenance | Update Frequency |
|---|---|---|---|---|
| Prompting | 1–5 days | $0 (token cost only) | Low | Immediate |
| RAG | 2–6 weeks | $100–$2000 (vector DB, embedding) | Medium | Hours–Days |
| Fine-tuning | 4–12 weeks | $500–$5000 (training compute) + serving | High | Weeks per iteration |
| Fine-tuning + RAG | 8–20 weeks | $600–$7000 | Very High | Complex |
These ranges are wide because they depend on scale, but the ordering is consistent. Fine-tuning compounds: every time your requirements change, you re-train and re-evaluate.
Hybrid Approaches
The most powerful production systems combine strategies. The combination that works most often is fine-tuning for style/format + RAG for knowledge:
The query rewriter is fine-tuned to expand and clarify user queries before retrieval (this dramatically improves recall). The response model is fine-tuned for your specific output format and tone. The retrieval layer stays updatable without retraining.
A simpler hybrid that is often overlooked: prompt caching + long context. If your knowledge base is 200k tokens and your provider supports prompt caching (Anthropic, OpenAI both do), you can stuff the entire knowledge base into a cached system prompt and pay only for output tokens on subsequent calls.
Drift and Maintenance
Fine-tuned models drift in ways that are hard to detect. The training distribution shifts relative to production inputs. What worked at evaluation time starts failing six months later as your data changes.
Build eval suites that run on every model version and on a schedule against your fine-tuned models:
def evaluate_strategy(strategy: str, eval_dataset: list[dict], model_fn) -> dict:
results = []
for example in eval_dataset:
predicted = model_fn(example["input"])
results.append({
"exact_match": predicted == example["expected"],
"format_valid": validate_format(predicted, example["expected_format"]),
"semantic_score": semantic_similarity(predicted, example["expected"]),
})
return {
"strategy": strategy,
"n": len(results),
"exact_match_rate": sum(r["exact_match"] for r in results) / len(results),
"format_valid_rate": sum(r["format_valid"] for r in results) / len(results),
"avg_semantic_score": sum(r["semantic_score"] for r in results) / len(results),
}RAG systems drift too — when documents are updated, when the query distribution shifts, when the embedding model is changed. Build scheduled eval runs into your MLOps pipeline regardless of which strategy you choose.
Key Takeaways
- Start with prompting; reach for RAG or fine-tuning only when you have confirmed that prompting cannot meet your requirements, not as a default first choice.
- RAG solves knowledge scale and freshness problems; fine-tuning solves behavioral consistency problems — they address different failure modes.
- Total cost of ownership for fine-tuning includes re-training cycles, eval time, and serving infrastructure — often 10–20x the apparent upfront cost.
- Prompt caching on long-context models is an underused middle path between full RAG and per-call context stuffing.
- Hybrid systems (fine-tuning for style + RAG for knowledge) are powerful but compound complexity — only go there after validating each component independently.
- Drift affects all three strategies; scheduled eval runs on production inputs are not optional if you care about reliability over time.