Testing MCP Servers: Contract Tests, Fixtures, and Replay Harnesses
Most MCP servers ship without a single test that actually verifies protocol compliance. I know because I've audited a dozen of them, and the pattern is always the same: a few unit tests on helper functions, maybe a smoke test that calls a tool and checks the return type, and then a vague promise that "we test it manually in Claude Desktop." That is not a test suite. That is wishful thinking dressed up in a CI badge.
The problem is that MCP servers have a layered contract. There is the JSON-RPC framing, the capability negotiation handshake, the schema of each tool's inputSchema, and finally the business logic inside the handler. Each layer can fail independently. A unit test on your handler doesn't tell you whether a real client will actually be able to call it, because the client discovers your tool via tools/list and validates it against its own expectations before it ever sends tools/call. If your schema is wrong, the client silently skips the tool. You will never know from a handler unit test.
This post builds a complete test strategy from the ground up: protocol contract tests that catch schema drift, fixture-based unit tests for fast feedback, and a replay harness that records real client sessions and plays them back in CI.
Why the Standard Test Pyramid Breaks for MCP
The classic test pyramid assumes you have a stable API contract enforced by something outside your tests — a type system, an OpenAPI validator, a generated client. MCP gives you a JSON schema for tool inputs, but nothing enforces that your server actually honors it at the transport layer. The schema is declared in tools/list, but the handler can accept anything it wants and nobody checks consistency.
This creates a specific failure mode: you change the handler to require a new field, you update the handler code, but you forget to update the inputSchema in the tool registration. The schema says the field is optional. A real client sends a request without it. Your handler throws. You blame the client.
The fix is to treat the inputSchema declaration itself as the thing under test, not an afterthought.
Setting Up the Test Environment
Before writing a single test, you need a transport-level client that speaks JSON-RPC over stdio or SSE. Don't use the MCP SDK's high-level Client for contract tests — it abstracts too much. You want to send raw JSON-RPC frames and inspect raw responses.
// test-utils/raw-client.ts
import { spawn, ChildProcess } from "node:child_process";
import { createInterface } from "node:readline";
interface JsonRpcRequest {
jsonrpc: "2.0";
id: number | string;
method: string;
params?: unknown;
}
interface JsonRpcResponse {
jsonrpc: "2.0";
id: number | string;
result?: unknown;
error?: { code: number; message: string; data?: unknown };
}
export class RawMcpClient {
private proc: ChildProcess;
private pending = new Map<number | string, (r: JsonRpcResponse) => void>();
private seq = 1;
constructor(command: string, args: string[] = []) {
this.proc = spawn(command, args, {
stdio: ["pipe", "pipe", "inherit"],
env: { ...process.env, NODE_ENV: "test" },
});
const rl = createInterface({ input: this.proc.stdout! });
rl.on("line", (line) => {
try {
const msg = JSON.parse(line) as JsonRpcResponse;
const resolve = this.pending.get(msg.id);
if (resolve) {
this.pending.delete(msg.id);
resolve(msg);
}
} catch {
// notifications or malformed — ignore in contract tests
}
});
}
send(method: string, params?: unknown): Promise<JsonRpcResponse> {
return new Promise((resolve) => {
const id = this.seq++;
const req: JsonRpcRequest = { jsonrpc: "2.0", id, method, params };
this.pending.set(id, resolve);
this.proc.stdin!.write(JSON.stringify(req) + "\n");
});
}
async close(): Promise<void> {
this.proc.stdin!.end();
await new Promise((r) => this.proc.on("close", r));
}
}This gives you a client that operates at exactly the same level as a real MCP host. You send text lines, you get text lines back. No SDK magic hiding the protocol.
Layer 1: Protocol Contract Tests
Contract tests verify that your server speaks valid MCP — correct JSON-RPC framing, correct capability advertisement, and correct tool list shape. These tests should run in under two seconds and never touch a database or external service.
// tests/contract/initialize.test.ts
import { describe, it, beforeEach, afterEach, expect } from "vitest";
import { RawMcpClient } from "../test-utils/raw-client";
describe("MCP initialize handshake", () => {
let client: RawMcpClient;
beforeEach(() => {
client = new RawMcpClient("node", ["dist/server.js"]);
});
afterEach(() => client.close());
it("responds with 2.0 protocol version", async () => {
const resp = await client.send("initialize", {
protocolVersion: "2024-11-05",
capabilities: {},
clientInfo: { name: "test", version: "0.0.1" },
});
expect(resp.error).toBeUndefined();
expect(resp.result).toMatchObject({
protocolVersion: expect.stringMatching(/^\d{4}-\d{2}-\d{2}$/),
capabilities: expect.any(Object),
serverInfo: {
name: expect.any(String),
version: expect.any(String),
},
});
});
it("tools/list returns array with required shape", async () => {
await client.send("initialize", {
protocolVersion: "2024-11-05",
capabilities: {},
clientInfo: { name: "test", version: "0.0.1" },
});
await client.send("notifications/initialized", {});
const resp = await client.send("tools/list", {});
expect(resp.error).toBeUndefined();
const tools = (resp.result as any).tools as unknown[];
expect(Array.isArray(tools)).toBe(true);
expect(tools.length).toBeGreaterThan(0);
for (const tool of tools) {
expect(tool).toMatchObject({
name: expect.any(String),
description: expect.any(String),
inputSchema: {
type: "object",
properties: expect.any(Object),
},
});
}
});
});Layer 2: Schema Validation Tests — The Critical Layer
This is where most projects fall short. You need to verify that the inputSchema your server advertises is actually what the handler enforces. The approach is to extract the schema from tools/list at test time and then validate your fixture inputs against it using a JSON Schema validator. Any mismatch is a contract violation.
// tests/contract/schema-consistency.test.ts
import Ajv from "ajv";
import { describe, it, beforeAll, afterAll, expect } from "vitest";
import { RawMcpClient } from "../test-utils/raw-client";
const ajv = new Ajv({ strict: false });
describe("tool schema consistency", () => {
let client: RawMcpClient;
let toolSchemas: Record<string, object> = {};
beforeAll(async () => {
client = new RawMcpClient("node", ["dist/server.js"]);
await client.send("initialize", {
protocolVersion: "2024-11-05",
capabilities: {},
clientInfo: { name: "test", version: "0.0.1" },
});
await client.send("notifications/initialized", {});
const resp = await client.send("tools/list", {});
const tools = (resp.result as any).tools as Array<{
name: string;
inputSchema: object;
}>;
for (const t of tools) {
toolSchemas[t.name] = t.inputSchema;
}
});
afterAll(() => client.close());
it("search_files: valid input passes schema", () => {
const schema = toolSchemas["search_files"];
const validate = ajv.compile(schema);
const valid = validate({ query: "hello", maxResults: 10 });
expect(valid).toBe(true);
});
it("search_files: missing required field fails schema", () => {
const schema = toolSchemas["search_files"];
const validate = ajv.compile(schema);
const valid = validate({ maxResults: 10 }); // missing query
expect(valid).toBe(false);
});
});Run these schema tests in your pre-commit hook. They take milliseconds and catch drift before it ever reaches a real agent.
Layer 3: Fixture-Based Handler Tests
Now you test business logic with injected fixtures. The key is that fixtures are stable, checked-in JSON that represents realistic (not toy) data. Rotate fixtures quarterly to keep them realistic.
// tests/fixtures/search_files.fixtures.ts
export const searchFixtures = {
basic_text_search: {
input: { query: "database connection", maxResults: 5 },
expected: {
content: [{ type: "text" }],
isError: false,
},
},
empty_results: {
input: { query: "xyzzy_nonexistent_8472", maxResults: 5 },
expected: {
content: [{ type: "text" }],
isError: false,
},
},
oversized_request: {
input: { query: "a", maxResults: 10000 },
expected: { isError: true },
},
};// tests/handlers/search-files.test.ts
import { describe, it, expect, vi } from "vitest";
import { searchFilesHandler } from "../../src/handlers/search-files";
import { searchFixtures } from "../fixtures/search_files.fixtures";
// Inject a deterministic file system adapter
const fakeFs = {
search: vi.fn().mockResolvedValue([
{ path: "/src/db.ts", snippet: "database connection pool", score: 0.92 },
]),
};
describe("searchFilesHandler", () => {
it("basic text search returns content array", async () => {
const result = await searchFilesHandler(
searchFixtures.basic_text_search.input,
{ fs: fakeFs }
);
expect(result.isError).toBe(false);
expect(result.content.length).toBeGreaterThan(0);
expect(result.content[0].type).toBe("text");
});
it("oversized maxResults returns error result", async () => {
const result = await searchFilesHandler(
searchFixtures.oversized_request.input,
{ fs: fakeFs }
);
expect(result.isError).toBe(true);
});
});Notice that isError: true in the result is different from throwing an exception. MCP handlers should almost never throw — they should return error content. A throw means the server crashes or returns a JSON-RPC error, which clients handle badly.
Layer 4: The Replay Harness
The replay harness is the most valuable tool in this stack and the least commonly built. The idea: record a real session between Claude Desktop (or any MCP client) and your server, capture every JSON-RPC frame in both directions, and save it as a fixture. In CI, replay the client side of that recording against a fresh server instance and assert that responses match within a defined tolerance.
// tools/record-session.ts — run this locally to capture sessions
import { createWriteStream } from "node:fs";
import { spawn } from "node:child_process";
import { createInterface } from "node:readline";
const logFile = createWriteStream(`session-${Date.now()}.ndjson`);
function record(direction: "C>S" | "S>C", line: string) {
logFile.write(JSON.stringify({ direction, ts: Date.now(), line }) + "\n");
}
// Intercept between the real client (stdio) and your server process
const server = spawn("node", ["dist/server.js"], { stdio: ["pipe", "pipe", "inherit"] });
// Stdin → server (client-to-server)
const stdinRl = createInterface({ input: process.stdin });
stdinRl.on("line", (line) => {
record("C>S", line);
server.stdin!.write(line + "\n");
});
// Server → stdout (server-to-client)
const serverRl = createInterface({ input: server.stdout! });
serverRl.on("line", (line) => {
record("S>C", line);
process.stdout.write(line + "\n");
});// tests/replay/replay.test.ts
import { readFileSync } from "node:fs";
import { describe, it, expect } from "vitest";
import { RawMcpClient } from "../test-utils/raw-client";
interface Frame {
direction: "C>S" | "S>C";
ts: number;
line: string;
}
async function replaySession(sessionFile: string) {
const frames: Frame[] = readFileSync(sessionFile, "utf-8")
.trim()
.split("\n")
.map((l) => JSON.parse(l));
const client = new RawMcpClient("node", ["dist/server.js"]);
const failures: string[] = [];
const clientFrames = frames.filter((f) => f.direction === "C>S");
const serverFrames = frames.filter((f) => f.direction === "S>C");
for (let i = 0; i < clientFrames.length; i++) {
const req = JSON.parse(clientFrames[i].line);
const expected = JSON.parse(serverFrames[i]?.line ?? "{}");
const actual = await client.send(req.method, req.params);
// For non-deterministic fields (timestamps, ids), compare structure not values
if (actual.error && !expected.error) {
failures.push(`Frame ${i}: got error "${actual.error.message}", expected success`);
}
if (!actual.error && expected.error) {
failures.push(`Frame ${i}: expected error but got success`);
}
}
await client.close();
return failures;
}
describe("session replay", () => {
it("replays recorded session without new errors", async () => {
const failures = await replaySession("tests/replay/fixtures/baseline-session.ndjson");
expect(failures).toEqual([]);
});
});Record a new baseline session whenever you intentionally change behavior. Treat session fixtures like migration files — they document what changed and when.
Wiring It Into CI
# .github/workflows/mcp-tests.yml
name: MCP Server Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- run: npm ci
- run: npm run build
- name: Contract tests (no external deps)
run: npx vitest run tests/contract
- name: Handler unit tests
run: npx vitest run tests/handlers
- name: Replay tests
run: npx vitest run tests/replayKeep contract tests in a separate Vitest workspace config so they can run in parallel without sharing state. Each contract test spawns its own server process — they're not cheap to start but they're honest.
Key Takeaways
- The
inputSchemaintools/listis a first-class contract, not documentation — test it against real inputs using a JSON Schema validator. - Build a raw JSON-RPC client for contract tests, not an SDK client; abstraction hides the protocol details you need to verify.
- MCP handlers should return
{ isError: true, content: [...] }for expected failures — throwing exceptions is a protocol violation. - Record real client sessions and check them into version control as regression fixtures; replay them in CI to catch subtle behavior changes.
- Run contract tests in pre-commit hooks — they start a real server process but finish in seconds and catch schema drift before it ships.
- A four-layer pyramid (protocol contract → schema consistency → handler unit → replay) gives you confidence at every level without the overhead of end-to-end browser automation.