MCP Deep Dive

Testing and Contract Verification

Ravinder·April 19, 2025·5 min read

MCPModel Context ProtocolAITestingContract TestingTypeScript

Series

MCP Deep Dive

Part 8 of 10

← Part 7

Wrapping a Legacy System

Part 9 →

Observability and Tracing

Most MCP servers ship with zero tests. The author calls the tool manually, it returns something that looks right, and it goes to production. Three weeks later an agent calls the tool with an unexpected argument combination and the server throws an unhandled exception that bubbles up as a cryptic model error. The fix takes a day. The test would have taken an hour.

Testing MCP servers is not exotic. The protocol is deterministic JSON-RPC — that means every interaction is inspectable, recordable, and replayable. You just need to know where to draw the test boundaries.

Three Testing Layers

graph TD L1["Layer 1: Unit tests\nAdapter / business logic in isolation"] L2["Layer 2: Protocol tests\nJSON-RPC request → expected response"] L3["Layer 3: Contract tests\nServer advertised schema vs actual handler behaviour"] L1 --> L2 L2 --> L3

Each layer catches different failure modes. Unit tests catch logic bugs. Protocol tests catch shape mismatches. Contract tests catch drift between what tools/list advertises and what tools/call actually accepts.

Layer 1: Unit Testing Handlers

Extract business logic from the MCP handler. Test the logic directly — no transport, no server, no JSON-RPC overhead.

// src/handlers/__tests__/tools.test.ts
import { describe, it, expect, vi } from "vitest";
import { taskflow } from "../../lib/taskflow-api.js";
import { handleCreateTask } from "../tools.js"; // pure function, not the handler
 
vi.mock("../../lib/taskflow-api.js", () => ({
  taskflow: {
    createTask: vi.fn().mockResolvedValue({ id: "T-42", title: "Fix login bug", status: "todo" })
  }
}));
 
describe("handleCreateTask", () => {
  it("creates a task and returns its ID", async () => {
    const result = await handleCreateTask({ project_id: "P-1", title: "Fix login bug" });
    expect(result.content[0].text).toContain("T-42");
    expect(taskflow.createTask).toHaveBeenCalledWith(expect.objectContaining({ title: "Fix login bug" }));
  });
 
  it("rejects missing required fields", async () => {
    await expect(handleCreateTask({ project_id: "P-1" } as any)).rejects.toThrow();
  });
});

The key refactor is that your tool handler calls a pure handle* function that you can test without standing up any server infrastructure.

Layer 2: Protocol-Level Tests

Here we exercise the actual JSON-RPC layer using an in-process client connected via a memory transport. The MCP TypeScript SDK provides InMemoryTransport for exactly this purpose.

// src/__tests__/protocol.test.ts
import { describe, it, expect, beforeAll, afterAll } from "vitest";
import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { InMemoryTransport } from "@modelcontextprotocol/sdk/inMemory.js";
import { registerTools } from "../handlers/tools.js";
 
let client: Client;
 
beforeAll(async () => {
  const server = new McpServer({ name: "test-server", version: "1.0.0" });
  registerTools(server);
 
  const [clientTransport, serverTransport] = InMemoryTransport.createLinkedPair();
  client = new Client({ name: "test-client", version: "1.0.0" });
  await Promise.all([
    server.connect(serverTransport),
    client.connect(clientTransport)
  ]);
});
 
afterAll(() => client.close());
 
describe("tools protocol", () => {
  it("lists expected tools", async () => {
    const { tools } = await client.listTools();
    const names = tools.map(t => t.name);
    expect(names).toContain("create_task");
    expect(names).toContain("update_task_status");
  });
 
  it("create_task returns a task ID", async () => {
    const result = await client.callTool({
      name: "create_task",
      arguments: { project_id: "P-1", title: "Test task" }
    });
    expect(result.content[0]).toMatchObject({ type: "text" });
    expect((result.content[0] as any).text).toMatch(/T-\d+/);
  });
 
  it("unknown tool returns an error", async () => {
    await expect(
      client.callTool({ name: "nonexistent_tool", arguments: {} })
    ).rejects.toThrow();
  });
});

No HTTP, no port, no process — just the protocol messages flowing through an in-memory pipe.

Layer 3: Contract Verification

Contract tests ensure the JSON Schema advertised in tools/list matches what the handler actually accepts and returns. Schema drift is insidious: you update the handler but forget to update the schema declaration, or vice versa.

// src/__tests__/contracts.test.ts
import Ajv from "ajv";
import { describe, it, expect } from "vitest";
import { client } from "./setup.js"; // reuse in-process client from Layer 2
 
const ajv = new Ajv({ allErrors: true });
 
describe("tool schema contracts", () => {
  it("all advertised tools have valid input schemas", async () => {
    const { tools } = await client.listTools();
    for (const tool of tools) {
      expect(() => ajv.compile(tool.inputSchema), `${tool.name} has invalid JSON Schema`).not.toThrow();
    }
  });
 
  it("create_task rejects extra unknown properties", async () => {
    await expect(
      client.callTool({
        name: "create_task",
        arguments: { project_id: "P-1", title: "T", totally_unknown_field: true }
      })
    ).rejects.toThrow();
  });
 
  it("create_task enforces required fields", async () => {
    await expect(
      client.callTool({ name: "create_task", arguments: { project_id: "P-1" } })
    ).rejects.toThrow(/title/i);
  });
});

Fixture Replay for External Dependencies

When your tools call external APIs or legacy systems, use fixture replay rather than live calls in CI. Record a real response once, commit it, replay it in every test run.

// src/__tests__/fixtures/taskflow-create-task.json
{
  "id": "T-42",
  "title": "Fix login bug",
  "status": "todo",
  "priority": "medium",
  "created_at": "2025-04-01T10:00:00Z"
}

// In your test setup
import { readFileSync } from "node:fs";
import { vi } from "vitest";
import * as api from "../../lib/taskflow-api.js";
 
const fixture = JSON.parse(
  readFileSync(new URL("./fixtures/taskflow-create-task.json", import.meta.url), "utf8")
);
 
vi.spyOn(api.taskflow, "createTask").mockResolvedValue(fixture);

The replay strategy for legacy systems (SOAP/CLI) follows the same pattern — capture the raw response in a fixture file, mock the transport, assert on the parsed result.

CI Pipeline Integration

flowchart LR PR["Pull Request"] --> Unit["Unit tests\nvitest --run"] Unit --> Protocol["Protocol tests\nvitest --run"] Protocol --> Contract["Contract tests\nvitest --run"] Contract --> Lint["Schema lint\najv compile"] Lint --> Gate["Merge gate\npasses / fails"]

A complete package.json test script:

{
  "scripts": {
    "test":          "vitest --run",
    "test:watch":    "vitest",
    "test:coverage": "vitest --run --coverage"
  }
}

Run npm test in CI. No special MCP test runner needed — standard unit test tooling works because the in-memory transport removes all network dependencies.

Key Takeaways

Three testing layers cover different failure modes: unit (logic), protocol (JSON-RPC shape), contract (schema drift).
InMemoryTransport from the MCP SDK enables full protocol-level testing with no HTTP or ports.
Extract handler logic into pure functions so unit tests do not need to boot a server.
Contract tests comparing tools/list schemas against actual handler behaviour catch the most common production bugs.
Fixture replay makes legacy-adapter tests deterministic and runnable offline in CI.
Standard test runners (Vitest, Jest, pytest) work without modification — MCP adds no special testing infrastructure.

Series

MCP Deep Dive

Part 8 of 10

← Part 7

Wrapping a Legacy System

Part 9 →

Observability and Tracing