Building a Prompt Registry: Versioning, A/B Testing, and Rollback for LLM Prompts

Ravinder·April 2, 2025·7 min read

AILLMPromptsDevOps

Building a Prompt Registry: Versioning, A/B Testing, and Rollback for LLM Prompts

Three months into production, your customer-support LLM starts giving subtly wrong refund policy answers. You trace it back to a prompt edit made two weeks ago — but nobody remembers who changed it, what it used to say, or why the change was made. There is no diff. There is no rollback. There is just a Slack message from a panicked engineer asking "does anyone have the old prompt saved somewhere?"

This is what happens when prompts are treated as config strings instead of production code. The fix is a prompt registry.

What a Prompt Registry Actually Is

A prompt registry is a versioned, queryable store for prompt templates that provides the same guarantees you already expect from your application code: history, authorship, rollback, staged rollout, and observability.

It is not a fancy prompt editor. It is not a playground. It is infrastructure.

The registry has four core concerns:

Versioning: every change creates a new immutable version
Ownership: every prompt has an accountable team or individual
Experiment surface: A/B between prompt versions with traffic splitting
Rollback path: revert to any prior version in seconds, not hours

Schema Design

Start with a schema that captures everything you will eventually need to query. Migrating schema later is painful when you have thousands of versions in production.

-- prompts table: one row per logical prompt
CREATE TABLE prompts (
  id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  slug          TEXT UNIQUE NOT NULL,        -- e.g. "support/refund-policy"
  description   TEXT,
  owner_team    TEXT NOT NULL,
  owner_email   TEXT NOT NULL,
  created_at    TIMESTAMPTZ DEFAULT now(),
  archived_at   TIMESTAMPTZ
);
 
-- prompt_versions: immutable, append-only
CREATE TABLE prompt_versions (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  prompt_id       UUID REFERENCES prompts(id),
  semver          TEXT NOT NULL,             -- "1.3.2"
  template        TEXT NOT NULL,             -- the actual prompt text
  variables       JSONB NOT NULL DEFAULT '[]', -- ["customer_name", "order_id"]
  model_hint      TEXT,                      -- "gpt-4o", "claude-3-5-sonnet"
  max_tokens      INT,
  temperature     NUMERIC(3,2),
  created_by      TEXT NOT NULL,
  created_at      TIMESTAMPTZ DEFAULT now(),
  change_summary  TEXT,                      -- required, not nullable
  UNIQUE(prompt_id, semver)
);
 
-- active_deployments: which version is live per environment
CREATE TABLE active_deployments (
  prompt_id    UUID REFERENCES prompts(id),
  environment  TEXT NOT NULL,               -- "prod", "staging", "canary"
  version_id   UUID REFERENCES prompt_versions(id),
  deployed_at  TIMESTAMPTZ DEFAULT now(),
  deployed_by  TEXT NOT NULL,
  PRIMARY KEY (prompt_id, environment)
);

The change_summary field is NOT NULL deliberately. Engineers hate filling it in, but six months later they love having it.

Semver for Prompts

Semantic versioning maps cleanly to prompt changes once you define what each level means:

Bump	Meaning	Example
MAJOR	Output format changes (breaks downstream parsers)	JSON → Markdown, added/removed required fields
MINOR	Behavior change (same format, different output)	Tone shift, new instructions, persona change
PATCH	Wording fix, typo, whitespace cleanup	No behavioral change expected

This gives reviewers a signal about risk before they approve a PR. A PATCH bump needs a spot-check. A MAJOR bump needs a full eval run.

Enforce this at the API layer:

import re
from enum import Enum
 
class BumpType(Enum):
    MAJOR = "major"
    MINOR = "minor"
    PATCH = "patch"
 
def next_version(current: str, bump: BumpType) -> str:
    match = re.match(r"(\d+)\.(\d+)\.(\d+)", current)
    if not match:
        raise ValueError(f"Invalid semver: {current}")
    major, minor, patch = map(int, match.groups())
    if bump == BumpType.MAJOR:
        return f"{major + 1}.0.0"
    elif bump == BumpType.MINOR:
        return f"{major}.{minor + 1}.0"
    else:
        return f"{major}.{minor}.{patch + 1}"
 
def create_version(prompt_slug: str, template: str, bump: BumpType,
                   change_summary: str, created_by: str, db) -> dict:
    prompt = db.fetch_one("SELECT * FROM prompts WHERE slug = %s", [prompt_slug])
    latest = db.fetch_one(
        "SELECT semver FROM prompt_versions WHERE prompt_id = %s ORDER BY created_at DESC LIMIT 1",
        [prompt["id"]]
    )
    current_ver = latest["semver"] if latest else "0.0.0"
    new_ver = next_version(current_ver, bump)
 
    version = db.insert("prompt_versions", {
        "prompt_id": prompt["id"],
        "semver": new_ver,
        "template": template,
        "change_summary": change_summary,
        "created_by": created_by,
    })
    return version

The A/B Harness

Traffic splitting happens at the experiment layer, not in the registry schema. Keep them separate — the registry stores facts, experiments are ephemeral.

flowchart TD A[Application Request] --> B{Experiment Active?} B -- No --> C[Fetch active_deployments prod version] B -- Yes --> D{Hash user_id % 100} D -- < split_pct --> E[Fetch Variant B version] D -- >= split_pct --> F[Fetch Variant A version] C --> G[Render template with variables] E --> G F --> G G --> H[LLM API Call] H --> I[Log: prompt_version_id, user_id, latency, tokens] I --> J[Eval pipeline reads logs]

import hashlib
from dataclasses import dataclass
 
@dataclass
class Experiment:
    id: str
    prompt_id: str
    variant_a_version_id: str
    variant_b_version_id: str
    split_pct: int          # % of traffic going to variant B
    metric: str             # "thumbs_up_rate", "task_completion", "cost_per_session"
    active: bool
 
def resolve_prompt_version(prompt_slug: str, user_id: str,
                           db, experiment_store) -> str:
    prompt = db.fetch_one("SELECT id FROM prompts WHERE slug = %s", [prompt_slug])
    experiment = experiment_store.get_active(prompt["id"])
 
    if experiment:
        # Deterministic bucketing — same user always sees same variant
        bucket = int(hashlib.sha256(
            f"{experiment.id}:{user_id}".encode()
        ).hexdigest(), 16) % 100
        version_id = (
            experiment.variant_b_version_id
            if bucket < experiment.split_pct
            else experiment.variant_a_version_id
        )
    else:
        deployment = db.fetch_one(
            "SELECT version_id FROM active_deployments WHERE prompt_id = %s AND environment = 'prod'",
            [prompt["id"]]
        )
        version_id = deployment["version_id"]
 
    version = db.fetch_one(
        "SELECT template, variables FROM prompt_versions WHERE id = %s",
        [version_id]
    )
    return version, version_id

Log version_id on every LLM call. Without it, your eval results are useless.

Rollback Procedure

Rollback is an UPDATE to active_deployments. The old version is still in the database — you are just pointing prod back at it.

def rollback(prompt_slug: str, target_semver: str,
             initiated_by: str, db, audit_log) -> None:
    prompt = db.fetch_one("SELECT id FROM prompts WHERE slug = %s", [prompt_slug])
    target = db.fetch_one(
        "SELECT id FROM prompt_versions WHERE prompt_id = %s AND semver = %s",
        [prompt["id"], target_semver]
    )
    if not target:
        raise ValueError(f"Version {target_semver} not found for {prompt_slug}")
 
    current = db.fetch_one(
        "SELECT version_id FROM active_deployments WHERE prompt_id = %s AND environment = 'prod'",
        [prompt["id"]]
    )
 
    db.execute(
        """UPDATE active_deployments
           SET version_id = %s, deployed_at = now(), deployed_by = %s
           WHERE prompt_id = %s AND environment = 'prod'""",
        [target["id"], initiated_by, prompt["id"]]
    )
 
    audit_log.record({
        "action": "rollback",
        "prompt_slug": prompt_slug,
        "from_version_id": current["version_id"],
        "to_semver": target_semver,
        "initiated_by": initiated_by,
    })

Write the audit log to a separate, append-only table or stream (Kafka, Kinesis). You will need it during incident reviews.

Owner Accountability

Ownership without teeth is theater. Wire the registry into your incident runbook:

PagerDuty / OpsGenie routing: when an LLM-related alert fires, the on-call rule queries the registry for owner_team and routes accordingly.
Change review gate: PRs that modify a prompt template in the registry automatically request a review from owner_email.
SLO tracking per prompt: error rates and latency SLOs live on the prompt slug, not the endpoint. When support/refund-policy degrades, the owner gets the alert.

# GitHub Actions step — auto-request review on prompt changes
def get_owners_for_changed_prompts(changed_files: list[str], db) -> list[str]:
    owners = set()
    for file_path in changed_files:
        # Assume prompts are stored as files mirroring slug structure
        # e.g. prompts/support/refund-policy.yaml → slug "support/refund-policy"
        slug = file_path.removeprefix("prompts/").removesuffix(".yaml")
        prompt = db.fetch_one(
            "SELECT owner_email FROM prompts WHERE slug = %s", [slug]
        )
        if prompt:
            owners.add(prompt["owner_email"])
    return list(owners)

Key Takeaways

Prompts are production artifacts — they deserve the same versioning, ownership, and rollback guarantees as application code.
Semver bump types (major/minor/patch) give reviewers a risk signal before approving prompt changes.
Deterministic user bucketing ensures A/B experiments are reproducible and users see consistent experiences within a session.
Log prompt_version_id on every LLM call — without it, you cannot attribute eval outcomes to specific prompt versions.
Rollback is an O(1) database update, not a deployment — design for it from day one.
Owner accountability requires integration with your alerting and review tooling, not just a column in a database table.