Building Production RAG

Failure Modes and Runbooks

Ravinder·March 19, 2025·8 min read

RAGAILLMReliabilityIncident ResponseMLOps

Series

Building Production RAG

Part 10 of 10

← Part 9

Multi-Tenant Isolation

End of series

A RAG system in production will fail. Not if — when. The question is whether the failure is a 3-hour incident with a postmortem or a 15-minute recovery because someone wrote a runbook. After running RAG systems at scale for two years, here are the failure modes that actually recur, the symptoms that surface them before users notice, and the resolution steps that work at 2am when you're not at your sharpest.

The Failure Taxonomy

RAG failures fall into four categories with very different signatures:

flowchart TD F[RAG System Failure] --> IR[Infrastructure Failures\nFast, noisy, obvious] F --> QD[Quality Degradation\nSlow, silent, insidious] F --> DF[Data Failures\nSudden, scoped, confusing] F --> SC[Scale Failures\nGradual then sudden] IR --> IR1[Vector DB down] IR --> IR2[LLM API unavailable] IR --> IR3[Cache eviction storm] QD --> QD1[Embedding model version drift] QD --> QD2[Corpus staleness] QD --> QD3[Prompt regression] DF --> DF1[Bad document ingested] DF --> DF2[Index/document mismatch] DF --> DF3[Tenant data leak attempt] SC --> SC1[Vector DB index memory exhaustion] SC --> SC2[Embedding API rate limit] SC --> SC3[Re-ranker GPU saturation]

Incident 1: Vector Database Latency Spike

Symptom: p99 retrieval latency jumps from 20ms to 800ms+. Total query latency exceeds SLA. Users experience timeouts.

Cause: most often, the vector DB index has been evicted from memory (after a restart or OOM), forcing disk reads on every query. Less commonly: too many simultaneous writes competing with reads.

Detection: p99_retrieval_latency_ms > 500 alert triggers.

Resolution:

# Step 1: Check vector DB process health
curl http://vectordb-host:6333/healthz
 
# Step 2: Check if index is in memory (Qdrant example)
curl http://vectordb-host:6333/collections/your_collection | \
  python3 -c "import sys,json; d=json.load(sys.stdin); print(d['result']['optimizer_status'])"
 
# Step 3: Trigger index warming (Qdrant)
curl -X POST http://vectordb-host:6333/collections/your_collection/index \
  -H 'Content-Type: application/json' \
  -d '{"wait": true}'
 
# Step 4: While warming, enable fallback to BM25-only retrieval
# Set feature flag: VECTOR_SEARCH_ENABLED=false
# BM25 can serve queries at degraded quality while vector index warms

# Fallback handler: graceful degradation to BM25 when vector DB is unhealthy
import httpx
from enum import Enum
 
class RetrievalMode(Enum):
    HYBRID = "hybrid"
    BM25_ONLY = "bm25_only"
    VECTOR_ONLY = "vector_only"
 
async def get_retrieval_mode() -> RetrievalMode:
    try:
        async with httpx.AsyncClient(timeout=2.0) as client:
            resp = await client.get("http://vectordb-host:6333/healthz")
            if resp.status_code == 200:
                return RetrievalMode.HYBRID
    except (httpx.TimeoutException, httpx.ConnectError):
        pass
    return RetrievalMode.BM25_ONLY

Prevention: configure minimum RAM allocation to prevent index eviction; set up a scheduled warm-up job that runs after any restart.

Incident 2: LLM API Rate Limit Cascade

Symptom: sudden spike in 429 errors from your LLM provider. Queries start failing or returning empty responses. Alert: error rate > 5%.

Cause: a batch job (re-indexing, offline evaluation) is running concurrently with user traffic and consuming token quota.

Resolution:

import asyncio
import openai
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
 
@retry(
    retry=retry_if_exception_type(openai.RateLimitError),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    stop=stop_after_attempt(5),
    reraise=True,
)
async def call_llm_with_backoff(client: openai.AsyncOpenAI, **kwargs) -> str:
    response = await client.chat.completions.create(**kwargs)
    return response.choices[0].message.content
 
# Separate token budget pools for user traffic vs. batch jobs
class TokenBudgetManager:
    def __init__(self, total_tpm: int):
        self._user_pool = int(total_tpm * 0.80)   # 80% reserved for user queries
        self._batch_pool = int(total_tpm * 0.20)  # 20% for batch jobs
        self._user_used = 0
        self._batch_used = 0
        self._window_start = asyncio.get_event_loop().time()
 
    def can_proceed(self, estimated_tokens: int, is_batch: bool = False) -> bool:
        now = asyncio.get_event_loop().time()
        if now - self._window_start > 60:
            self._user_used = 0
            self._batch_used = 0
            self._window_start = now
        pool = self._batch_pool if is_batch else self._user_pool
        used = self._batch_used if is_batch else self._user_used
        return (used + estimated_tokens) < pool

Prevention: separate API keys for user traffic and batch jobs; configure rate limit alerts at 70% of quota, not 100%.

Incident 3: Silent Hallucination Surge After Prompt Change

Symptom: no infrastructure alerts. Latency is normal. But user satisfaction drops and support tickets mention "wrong answers." Faithfulness score in your eval dashboard drops from 0.91 to 0.73.

Cause: a prompt change intended to make responses more concise accidentally removed the "only answer based on the provided context" instruction.

Resolution: this is a rollback incident, not a fix-forward incident.

# Prompt versioning — every change is a new version, rollback is instant
PROMPT_REGISTRY = {
    "v1.0.0": {
        "system": "You are a helpful assistant. Answer only using the provided context. If the context does not contain the answer, say so explicitly.",
        "user_template": "Context:\n{context}\n\nQuestion: {query}",
        "deployed_at": "2025-01-15",
    },
    "v1.1.0": {
        "system": "You are a concise assistant. Answer using the context below.",
        # REGRESSION: removed "only" and "say so explicitly" — caused hallucination surge
        "user_template": "Context:\n{context}\n\nQuestion: {query}",
        "deployed_at": "2025-03-10",
    },
}
 
ACTIVE_PROMPT_VERSION = "v1.0.0"  # rolled back
 
def get_active_prompt() -> dict:
    return PROMPT_REGISTRY[ACTIVE_PROMPT_VERSION]

Prevention: version every prompt change; run generation eval (faithfulness score) before every prompt deploy; set a minimum faithfulness threshold in CI.

Incident 4: Index-Document Mismatch After Partial Re-index

Symptom: users report answers from documents that were deleted or updated weeks ago. The retrieved chunks are stale.

Cause: a re-indexing job failed partway through due to an API timeout. Some documents were deleted from storage but their chunks were not removed from the vector index.

Resolution:

import asyncio
from typing import AsyncGenerator
 
async def audit_index_vs_storage(
    vector_index,
    document_store,
    batch_size: int = 1000,
) -> AsyncGenerator[dict, None]:
    """Yield chunks in the vector index that have no corresponding document."""
    offset = None
    while True:
        # Scroll through all indexed chunk IDs
        chunks, offset = vector_index.scroll(limit=batch_size, offset=offset)
        if not chunks:
            break
 
        doc_ids = {c["doc_id"] for c in chunks}
        existing_docs = await document_store.get_many(list(doc_ids))
        existing_doc_ids = {d["id"] for d in existing_docs}
 
        orphaned_chunk_ids = [
            c["id"] for c in chunks if c["doc_id"] not in existing_doc_ids
        ]
        if orphaned_chunk_ids:
            yield {"orphaned_chunk_ids": orphaned_chunk_ids, "count": len(orphaned_chunk_ids)}
 
        if offset is None:
            break
 
# Run this audit, then delete orphaned chunks

Prevention: implement transactional indexing — write to the vector index and document store atomically, or use a job queue with at-least-once delivery and idempotent upserts.

Incident 5: Memory Exhaustion from Index Growth

Symptom: vector DB OOM errors, container restarts. Happens 6 weeks after a large corpus expansion.

Cause: the index grew beyond the instance's RAM. Nobody tracked index size against available memory.

Resolution: immediate mitigation is to reduce ef_search (HNSW search quality parameter) to reduce memory pressure, then scale up the instance.

# Monitor index size and project when you'll hit memory limits
def project_memory_exhaustion(
    current_vectors: int,
    vector_dim: int,
    growth_per_day: int,
    available_ram_gb: float,
    overhead_factor: float = 2.5,  # HNSW graph overhead vs raw vector storage
) -> dict:
    bytes_per_vector = vector_dim * 4  # float32
    bytes_per_vector_with_index = bytes_per_vector * overhead_factor
    current_gb = (current_vectors * bytes_per_vector_with_index) / 1e9
    days_until_full = (available_ram_gb - current_gb) * 1e9 / (growth_per_day * bytes_per_vector_with_index)
    return {
        "current_index_gb": round(current_gb, 2),
        "available_ram_gb": available_ram_gb,
        "headroom_gb": round(available_ram_gb - current_gb, 2),
        "growth_rate_vectors_per_day": growth_per_day,
        "projected_days_to_exhaustion": round(max(0, days_until_full), 0),
    }

Prevention: run this projection weekly; set an alert at 70% RAM utilization; plan capacity upgrades 4 weeks in advance.

The On-Call Playbook

Keep this checklist accessible during incidents:

## RAG System On-Call Checklist
 
### First 5 minutes
[ ] Identify which layer is failing: infra / retrieval quality / generation quality / data
[ ] Check: Is there a deployment in the last 2 hours? (Most incidents are deploy-related)
[ ] Check: Is there an ongoing LLM provider status event? (status.openai.com / status.cohere.com)
[ ] Check: Is the vector DB process running and healthy?
 
### Retrieval quality drop (no infra alert)
[ ] Pull last 100 spans from log store, check median top-1 rerank score
[ ] Check if a corpus update ran in the last 24 hours
[ ] Check if embedding model or prompt version changed
[ ] If prompt changed: rollback to previous version
 
### Infrastructure failure
[ ] Enable BM25-only fallback (feature flag VECTOR_SEARCH_ENABLED=false)
[ ] Page vector DB or LLM provider if SLA breach is imminent
[ ] Document timeline in incident channel
 
### Post-incident
[ ] File postmortem within 48 hours
[ ] Add a new alert or test that would have caught this earlier
[ ] Add the failure scenario to the evaluation golden set

Key Takeaways

The most dangerous RAG failures are the silent ones — quality degradation that produces no infrastructure alerts but steadily erodes user trust; instrument your semantic layer, not just your servers.
Build graceful degradation for every external dependency: BM25-only fallback when the vector DB is unhealthy, cached responses when the LLM API is rate-limited.
Version every prompt change and run generation eval before deployment; prompt regressions are the most common cause of silent quality drops.
Implement transactional indexing from day one — partial re-index failures that leave the index and document store out of sync are a recurring incident cause.
Project memory exhaustion weekly using your current vector count, dimensionality, and growth rate; a 4-week runway for capacity upgrades is the minimum safe margin.
Every incident should produce at least one new alert rule or evaluation test case — your observability and evaluation coverage should improve with each failure.

Series

Building Production RAG

Part 10 of 10

← Part 9

Multi-Tenant Isolation

End of series