Skip to main content
MCP Deep Dive

Observability and Tracing

Ravinder··5 min read
MCPModel Context ProtocolAIObservabilityOpenTelemetryTracingMonitoring
Share:
Observability and Tracing

When an agent calls your MCP tool and something goes wrong, you need to answer: which user session triggered this, which model call originated it, what did the server do, and how long did each step take? Without a trace ID that crosses the agent boundary, you are looking at disconnected log entries in three different systems and guessing at causality. This post covers the instrumentation pattern that makes agent interactions as traceable as any other distributed system.

The Observability Gap at the Agent Boundary

Traditional distributed tracing assumes a service calls another service over HTTP, propagating a trace context header. MCP introduces a new boundary: an AI model reasoning step that produces a tools/call request. The model itself is not instrumentable the way a microservice is. But the boundary between the model and the MCP server is — that is where your trace starts.

sequenceDiagram participant User participant Agent as Agent / Client participant LLM as Language Model participant MCP as MCP Server participant DB as Downstream Systems User ->> Agent: Request (user-session-id: abc) Agent ->> LLM: Messages + tool definitions LLM -->> Agent: tool_call: get_customer {id: 123} Note over Agent: Generate trace-id, start root span Agent ->> MCP: POST /mcp (traceparent: 00-{trace-id}-{span-id}-01) Note over MCP: Extract context, start child span MCP ->> DB: Query (trace propagated) DB -->> MCP: Result MCP -->> Agent: Tool result Agent -->> User: Answer

The agent generates a traceparent header (W3C Trace Context format) before the MCP call. The server extracts it, starts a child span, and propagates it to any downstream calls.

Instrumentation: Server Side

Use the OpenTelemetry SDK. Add it once at startup and all subsequent spans inherit the context automatically.

// src/telemetry.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { ATTR_SERVICE_NAME } from "@opentelemetry/semantic-conventions";
 
export function initTelemetry() {
  const sdk = new NodeSDK({
    resource: new Resource({ [ATTR_SERVICE_NAME]: "taskflow-mcp" }),
    traceExporter: new OTLPTraceExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318/v1/traces"
    })
  });
  sdk.start();
  process.on("SIGTERM", () => sdk.shutdown());
}
// src/server.ts — import telemetry before anything else
import "./telemetry.js";  // must be first
import { trace, context, propagation } from "@opentelemetry/api";
import express from "express";
 
const tracer = trace.getTracer("taskflow-mcp");
 
app.post("/mcp", async (req, res) => {
  // Extract W3C traceparent from incoming headers
  const parentContext = propagation.extract(context.active(), req.headers);
 
  await context.with(parentContext, async () => {
    const span = tracer.startSpan("mcp.request");
    try {
      const transport = new StreamableHTTPServerTransport({ sessionIdGenerator: undefined });
      await server.connect(transport);
      await transport.handleRequest(req, res, req.body);
    } finally {
      span.end();
    }
  });
});

Instrumentation: Tool Handlers

Wrap each tool call in its own span so you can see per-tool latency.

// src/handlers/tools.ts
import { trace, SpanStatusCode } from "@opentelemetry/api";
 
const tracer = trace.getTracer("taskflow-mcp");
 
server.setRequestHandler(CallToolRequestSchema, async (req) => {
  const span = tracer.startSpan(`tool.${req.params.name}`, {
    attributes: {
      "mcp.tool.name":     req.params.name,
      "mcp.tool.args_len": JSON.stringify(req.params.arguments ?? {}).length
    }
  });
 
  try {
    const result = await dispatchTool(req.params.name, req.params.arguments);
    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (err) {
    span.recordException(err as Error);
    span.setStatus({ code: SpanStatusCode.ERROR, message: (err as Error).message });
    throw err;
  } finally {
    span.end();
  }
});

Structured Logging with Trace Correlation

Logs are useful but only if you can join them to a trace. Use the trace and span IDs as log fields.

import { trace } from "@opentelemetry/api";
import pino from "pino";
 
const logger = pino({ level: "info" });
 
function log(level: "info" | "error" | "warn", msg: string, extra: Record<string, unknown> = {}) {
  const span = trace.getActiveSpan();
  const traceId = span?.spanContext().traceId;
  const spanId  = span?.spanContext().spanId;
  logger[level]({ traceId, spanId, ...extra }, msg);
}
 
// Usage inside a tool handler:
log("info", "Creating task", { project_id, title });

This gives you a traceId on every log line that matches the span in your tracing backend — Jaeger, Tempo, or any OTLP-compatible collector.

Metrics: What to Measure

graph LR subgraph "Metrics to emit" M1["mcp.tool.calls.total\n(counter, labelled by tool name)"] M2["mcp.tool.duration_ms\n(histogram, labelled by tool name)"] M3["mcp.tool.errors.total\n(counter, labelled by tool + error type)"] M4["mcp.resource.reads.total\n(counter, labelled by resource URI prefix)"] end
import { metrics } from "@opentelemetry/api";
 
const meter = metrics.getMeter("taskflow-mcp");
 
const toolCallsCounter = meter.createCounter("mcp.tool.calls.total");
const toolDuration     = meter.createHistogram("mcp.tool.duration_ms", { unit: "ms" });
const toolErrors       = meter.createCounter("mcp.tool.errors.total");
 
// In your tool dispatch wrapper:
const start = Date.now();
toolCallsCounter.add(1, { tool: req.params.name });
try {
  const result = await dispatchTool(req.params.name, req.params.arguments);
  toolDuration.record(Date.now() - start, { tool: req.params.name });
  return result;
} catch (err) {
  toolErrors.add(1, { tool: req.params.name, error: (err as Error).constructor.name });
  throw err;
}

Session and User Attribution

For multi-tenant servers you need to know whose agent made the call. Pull the user sub from the validated JWT and attach it to the span.

app.post("/mcp", async (req, res) => {
  const claims = await validateToken(req.headers.authorization);
  const parentContext = propagation.extract(context.active(), req.headers);
 
  await context.with(parentContext, async () => {
    const span = tracer.startSpan("mcp.request");
    span.setAttributes({
      "user.id":    claims.sub ?? "unknown",
      "client.id":  claims.azp ?? "unknown",  // authorized party
    });
    // ...
    span.end();
  });
});

Now every span in your tracing backend is annotated with the user and client identities, making incident investigation a search query rather than a log archaeology project.

Key Takeaways

  • The agent boundary is a distributed system boundary — treat it with the same tracing discipline as any microservice call.
  • W3C traceparent headers are the standard mechanism for propagating trace context into MCP servers.
  • OpenTelemetry's context propagation API handles context extraction and injection without manual plumbing.
  • Instrument at two levels: a root span per MCP request, and child spans per tool call, for granular latency visibility.
  • Correlate logs to traces by attaching traceId and spanId as structured log fields.
  • Emit counter and histogram metrics per tool name so dashboards and alerts can be tool-specific, not just server-wide.
Share: