Skip to main content
MCP

Observability for MCP Servers: Tracing Across the Agent Boundary

Ravinder··10 min read
MCPObservabilityTracingAI
Share:
Observability for MCP Servers: Tracing Across the Agent Boundary

The hardest debugging session I have had in the past year involved an agent that was silently failing about 3% of the time. The agent would call a tool, get an error response, and quietly decide to try a different approach — so from the user's perspective, the interaction still "worked," just slower and with a subtly wrong result. We had logs on the MCP server. We had logs on the application. We had no way to correlate them because nothing shared a request ID across the agent boundary.

That is the observability gap that MCP creates and that almost nobody talks about. The agent sits between your application and your MCP server. Every tool call crosses two boundaries: application → agent (usually an LLM API call), and agent → MCP server (a JSON-RPC call). Standard distributed tracing covers neither of those transitions by default.

This post builds a complete observability stack for MCP servers: trace context propagation through the agent boundary, structured log correlation, metrics collection, and cost attribution per tool call.

The Observability Gap Visualized

Before building anything, it helps to be precise about where observability breaks down.

sequenceDiagram participant App as Application participant LLM as LLM API participant MCP as MCP Server participant DB as Database / External API App->>LLM: messages + tools list (trace ID: A) LLM->>App: assistant message with tool_use block App->>MCP: tools/call (no trace ID propagated) Note over MCP: New span starts here — no parent MCP->>DB: downstream call (orphaned trace) DB-->>MCP: response MCP-->>App: tool result App->>LLM: messages + tool result (trace ID: A — gap lost)

The LLM API call from your application has a trace ID. The MCP tool call is a separate process with its own trace context. The downstream call from the MCP server to a database or external API has yet another context. These three chains are invisible to each other unless you explicitly wire them together.

The fix requires work on both sides: the application must inject trace context into tool call metadata, and the MCP server must extract and propagate it.

Propagating Trace Context Through Tool Calls

MCP tool calls support a _meta field in the request parameters. This is the right place to carry trace context — it is part of the protocol, it is ignored by handlers that do not read it, and it does not pollute the tool's actual input schema.

On the application side, when you are about to execute a tool call returned by the LLM, inject your current trace context:

// application/src/agent-loop.ts
import { trace, context, propagation } from "@opentelemetry/api";
import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic();
 
async function executeToolCall(
  toolUse: Anthropic.ToolUseBlock,
  mcpClient: McpClient
): Promise<string> {
  const currentSpan = trace.getActiveSpan();
  const traceHeaders: Record<string, string> = {};
 
  // Extract W3C trace context into a plain object
  propagation.inject(context.active(), traceHeaders);
 
  const result = await mcpClient.callTool({
    name: toolUse.name,
    arguments: toolUse.input as Record<string, unknown>,
    _meta: {
      traceParent: traceHeaders["traceparent"],
      traceState: traceHeaders["tracestate"],
      agentRunId: currentSpan?.spanContext().traceId,
      toolCallId: toolUse.id,
    },
  });
 
  return JSON.stringify(result.content);
}

On the MCP server side, extract and restore that context at the top of every request handler:

// mcp-server/src/middleware/trace-context.ts
import { propagation, context, trace, SpanStatusCode } from "@opentelemetry/api";
import { CallToolRequest } from "@modelcontextprotocol/sdk/types.js";
 
export function extractTraceContext(request: CallToolRequest) {
  const meta = request.params._meta as
    | {
        traceParent?: string;
        traceState?: string;
        agentRunId?: string;
        toolCallId?: string;
      }
    | undefined;
 
  if (!meta?.traceParent) return context.active();
 
  const carrier = {
    traceparent: meta.traceParent,
    tracestate: meta.traceState ?? "",
  };
 
  return propagation.extract(context.active(), carrier);
}
 
// Usage in your request handler
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const traceCtx = extractTraceContext(request);
 
  return context.with(traceCtx, async () => {
    const tracer = trace.getTracer("mcp-server");
    return tracer.startActiveSpan(`tool/${request.params.name}`, async (span) => {
      try {
        span.setAttributes({
          "mcp.tool.name": request.params.name,
          "mcp.tool_call.id": (request.params._meta as any)?.toolCallId ?? "unknown",
          "mcp.agent_run.id": (request.params._meta as any)?.agentRunId ?? "unknown",
        });
 
        const result = await dispatchTool(request.params.name, request.params.arguments);
        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (err) {
        span.recordException(err as Error);
        span.setStatus({ code: SpanStatusCode.ERROR });
        throw err;
      } finally {
        span.end();
      }
    });
  });
});

Now every span on the MCP server is a child of the span in your application. Jaeger, Honeycomb, or Grafana Tempo will show the full chain as a single trace.

Structured Log Correlation

Traces are for timing and causality. Logs are for content — what data was in the request, what the handler decided, what the external API returned. The two systems are only useful together if they share an ID you can join on.

The pattern is to write structured logs that include trace_id and span_id as top-level fields. Most log aggregators (Loki, Elasticsearch, CloudWatch) can then join log events to trace spans.

// mcp-server/src/logger.ts
import { trace } from "@opentelemetry/api";
import pino from "pino";
 
const baseLogger = pino({
  level: process.env.LOG_LEVEL ?? "info",
  formatters: {
    log(obj) {
      const span = trace.getActiveSpan();
      const ctx = span?.spanContext();
      return {
        ...obj,
        trace_id: ctx?.traceId ?? "none",
        span_id: ctx?.spanId ?? "none",
        trace_flags: ctx?.traceFlags ?? 0,
      };
    },
  },
});
 
export function getLogger(tool: string) {
  return baseLogger.child({ tool });
}
// mcp-server/src/handlers/query-orders.ts
import { getLogger } from "../logger";
 
const log = getLogger("query_orders");
 
export async function queryOrdersHandler(args: unknown) {
  const { customerId, status } = parseArgs(args);
 
  log.info({ customerId, status, event: "handler.start" }, "query_orders called");
 
  try {
    const orders = await db.query(
      "SELECT * FROM orders WHERE customer_id = $1 AND ($2::text IS NULL OR status = $2)",
      [customerId, status ?? null]
    );
 
    log.info(
      { customerId, count: orders.rows.length, event: "handler.success" },
      "query_orders returned results"
    );
 
    return formatOrdersResult(orders.rows);
  } catch (err) {
    log.error(
      { customerId, err, event: "handler.error" },
      "query_orders database error"
    );
    return {
      isError: true,
      content: [{ type: "text", text: "Database error — please retry" }],
    };
  }
}

Every log line now carries trace_id and span_id. In your log aggregator, you can filter trace_id = "abc123" and see exactly what happened inside the MCP server for a specific agent run, correlated with the application-side logs that share the same trace.

Metrics: What to Actually Measure

Most teams instrument the obvious things (request count, error rate, latency) and miss the MCP-specific metrics that actually tell you whether your server is working correctly from an agent's perspective.

// mcp-server/src/metrics.ts
import { metrics } from "@opentelemetry/api";
 
const meter = metrics.getMeter("mcp-server");
 
export const toolCallCounter = meter.createCounter("mcp.tool.calls", {
  description: "Number of tool calls by name and outcome",
});
 
export const toolCallDuration = meter.createHistogram("mcp.tool.duration_ms", {
  description: "Tool call end-to-end duration in milliseconds",
  unit: "ms",
  advice: { explicitBucketBoundaries: [10, 50, 100, 250, 500, 1000, 2500, 5000] },
});
 
export const toolResultSize = meter.createHistogram("mcp.tool.result_bytes", {
  description: "Size of tool result content in bytes",
  unit: "By",
});
 
export const toolErrorRate = meter.createObservableGauge("mcp.tool.error_rate_1m", {
  description: "1-minute rolling error rate per tool",
});
 
// Call this around every handler invocation
export function recordToolCall(
  tool: string,
  outcome: "success" | "error" | "validation_error",
  durationMs: number,
  resultBytes: number
) {
  toolCallCounter.add(1, { tool, outcome });
  toolCallDuration.record(durationMs, { tool, outcome });
  if (outcome === "success") {
    toolResultSize.record(resultBytes, { tool });
  }
}

The result_bytes metric is particularly important. If an LLM agent is calling a tool that returns 50KB of text for a simple query, that is a context budget problem. You want to see that in a dashboard before users start reporting that the agent is slow or hitting context limits.

Cost Attribution Per Tool Call

LLM API costs are dominated by token counts. When your agent calls tools, the tool results are injected into the next request as context. A tool that returns verbose output is directly costing you money in subsequent LLM calls. You need to track this.

The pattern is to count tokens in tool results at the MCP server level and emit them as metrics or log fields:

// mcp-server/src/cost-attribution.ts
import { encode } from "gpt-tokenizer"; // works for both GPT and Claude approximate counts
 
interface CostAttribution {
  toolName: string;
  agentRunId: string;
  inputTokens: number;
  outputTokens: number;
  estimatedOutputCostUsd: number;
}
 
// Approximate Claude Sonnet pricing — update when rates change
const OUTPUT_COST_PER_1K_TOKENS = 0.015;
 
export function attributeToolCost(
  toolName: string,
  agentRunId: string,
  inputArgs: unknown,
  resultContent: Array<{ type: string; text?: string }>
): CostAttribution {
  const inputText = JSON.stringify(inputArgs);
  const outputText = resultContent
    .filter((c) => c.type === "text")
    .map((c) => c.text ?? "")
    .join("");
 
  const inputTokens = encode(inputText).length;
  const outputTokens = encode(outputText).length;
  const estimatedOutputCostUsd = (outputTokens / 1000) * OUTPUT_COST_PER_1K_TOKENS;
 
  return { toolName, agentRunId, inputTokens, outputTokens, estimatedOutputCostUsd };
}

Log this attribution data as a structured event and aggregate it in your data warehouse. After a week of production traffic, you will know exactly which tools are the most expensive and whether that cost is proportional to the value they provide.

graph LR MCP[MCP Server] -->|structured logs with trace_id| Loki[Loki / CloudWatch] MCP -->|OTLP spans| Tempo[Grafana Tempo] MCP -->|OTLP metrics| Mimir[Prometheus / Mimir] Loki -->|join on trace_id| Dashboard[Grafana Dashboard] Tempo --> Dashboard Mimir --> Dashboard Dashboard -->|alert on error_rate > 1%| PagerDuty[PagerDuty]

Alerting on Agent-Specific Failure Modes

Standard HTTP alerting rules do not translate directly to MCP. The most important failure modes to alert on:

Silent error rate. MCP handlers return isError: true in the result content rather than HTTP 4xx/5xx. Your HTTP error rate metric will show 0% errors while the agent is silently getting failure responses. Build a metric that counts isError: true results separately from JSON-RPC errors.

Result size creep. Set an alert for when the P95 result_bytes for any tool exceeds 10KB. This almost always means the tool is returning too much context, which will eventually cause context limit errors.

Latency spikes correlated with LLM calls. If tool latency spikes at the same time as LLM API latency, that usually means you are calling the LLM from inside your MCP server — which you should not be. The MCP server should be fast data retrieval, not another inference loop.

# prometheus/alerts/mcp-server.yml
groups:
  - name: mcp_server
    rules:
      - alert: ToolSilentErrorRateHigh
        expr: |
          rate(mcp_tool_calls_total{outcome="error"}[5m])
          / rate(mcp_tool_calls_total[5m]) > 0.01
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Tool {{ $labels.tool }} silent error rate > 1%"
 
      - alert: ToolResultSizeP95High
        expr: |
          histogram_quantile(0.95, rate(mcp_tool_result_bytes_bucket[10m])) > 10240
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Tool {{ $labels.tool }} P95 result size > 10KB"

Key Takeaways

  • The agent boundary is an observability gap by default — trace context must be explicitly injected into the _meta field of tool call requests and extracted on the MCP server side.
  • Use W3C traceparent headers carried in _meta for propagation; this is framework-agnostic and compatible with all major tracing backends.
  • Structured logs must include trace_id and span_id as top-level fields so they can be joined to spans in your tracing backend.
  • Track isError: true results as a separate metric from JSON-RPC protocol errors — agents swallow silent tool failures and your error rate dashboard will lie to you otherwise.
  • Tool result size in bytes is one of the most important metrics to monitor — verbose tool outputs directly increase LLM inference costs in subsequent calls.
  • Cost attribution at the MCP server level (input tokens, output tokens, estimated cost per tool call) is the only way to understand which tools are expensive relative to the value they provide.