Observability for MCP Servers: Tracing Across the Agent Boundary
The hardest debugging session I have had in the past year involved an agent that was silently failing about 3% of the time. The agent would call a tool, get an error response, and quietly decide to try a different approach — so from the user's perspective, the interaction still "worked," just slower and with a subtly wrong result. We had logs on the MCP server. We had logs on the application. We had no way to correlate them because nothing shared a request ID across the agent boundary.
That is the observability gap that MCP creates and that almost nobody talks about. The agent sits between your application and your MCP server. Every tool call crosses two boundaries: application → agent (usually an LLM API call), and agent → MCP server (a JSON-RPC call). Standard distributed tracing covers neither of those transitions by default.
This post builds a complete observability stack for MCP servers: trace context propagation through the agent boundary, structured log correlation, metrics collection, and cost attribution per tool call.
The Observability Gap Visualized
Before building anything, it helps to be precise about where observability breaks down.
The LLM API call from your application has a trace ID. The MCP tool call is a separate process with its own trace context. The downstream call from the MCP server to a database or external API has yet another context. These three chains are invisible to each other unless you explicitly wire them together.
The fix requires work on both sides: the application must inject trace context into tool call metadata, and the MCP server must extract and propagate it.
Propagating Trace Context Through Tool Calls
MCP tool calls support a _meta field in the request parameters. This is the right place to carry trace context — it is part of the protocol, it is ignored by handlers that do not read it, and it does not pollute the tool's actual input schema.
On the application side, when you are about to execute a tool call returned by the LLM, inject your current trace context:
// application/src/agent-loop.ts
import { trace, context, propagation } from "@opentelemetry/api";
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function executeToolCall(
toolUse: Anthropic.ToolUseBlock,
mcpClient: McpClient
): Promise<string> {
const currentSpan = trace.getActiveSpan();
const traceHeaders: Record<string, string> = {};
// Extract W3C trace context into a plain object
propagation.inject(context.active(), traceHeaders);
const result = await mcpClient.callTool({
name: toolUse.name,
arguments: toolUse.input as Record<string, unknown>,
_meta: {
traceParent: traceHeaders["traceparent"],
traceState: traceHeaders["tracestate"],
agentRunId: currentSpan?.spanContext().traceId,
toolCallId: toolUse.id,
},
});
return JSON.stringify(result.content);
}On the MCP server side, extract and restore that context at the top of every request handler:
// mcp-server/src/middleware/trace-context.ts
import { propagation, context, trace, SpanStatusCode } from "@opentelemetry/api";
import { CallToolRequest } from "@modelcontextprotocol/sdk/types.js";
export function extractTraceContext(request: CallToolRequest) {
const meta = request.params._meta as
| {
traceParent?: string;
traceState?: string;
agentRunId?: string;
toolCallId?: string;
}
| undefined;
if (!meta?.traceParent) return context.active();
const carrier = {
traceparent: meta.traceParent,
tracestate: meta.traceState ?? "",
};
return propagation.extract(context.active(), carrier);
}
// Usage in your request handler
server.setRequestHandler(CallToolRequestSchema, async (request) => {
const traceCtx = extractTraceContext(request);
return context.with(traceCtx, async () => {
const tracer = trace.getTracer("mcp-server");
return tracer.startActiveSpan(`tool/${request.params.name}`, async (span) => {
try {
span.setAttributes({
"mcp.tool.name": request.params.name,
"mcp.tool_call.id": (request.params._meta as any)?.toolCallId ?? "unknown",
"mcp.agent_run.id": (request.params._meta as any)?.agentRunId ?? "unknown",
});
const result = await dispatchTool(request.params.name, request.params.arguments);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.recordException(err as Error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end();
}
});
});
});Now every span on the MCP server is a child of the span in your application. Jaeger, Honeycomb, or Grafana Tempo will show the full chain as a single trace.
Structured Log Correlation
Traces are for timing and causality. Logs are for content — what data was in the request, what the handler decided, what the external API returned. The two systems are only useful together if they share an ID you can join on.
The pattern is to write structured logs that include trace_id and span_id as top-level fields. Most log aggregators (Loki, Elasticsearch, CloudWatch) can then join log events to trace spans.
// mcp-server/src/logger.ts
import { trace } from "@opentelemetry/api";
import pino from "pino";
const baseLogger = pino({
level: process.env.LOG_LEVEL ?? "info",
formatters: {
log(obj) {
const span = trace.getActiveSpan();
const ctx = span?.spanContext();
return {
...obj,
trace_id: ctx?.traceId ?? "none",
span_id: ctx?.spanId ?? "none",
trace_flags: ctx?.traceFlags ?? 0,
};
},
},
});
export function getLogger(tool: string) {
return baseLogger.child({ tool });
}// mcp-server/src/handlers/query-orders.ts
import { getLogger } from "../logger";
const log = getLogger("query_orders");
export async function queryOrdersHandler(args: unknown) {
const { customerId, status } = parseArgs(args);
log.info({ customerId, status, event: "handler.start" }, "query_orders called");
try {
const orders = await db.query(
"SELECT * FROM orders WHERE customer_id = $1 AND ($2::text IS NULL OR status = $2)",
[customerId, status ?? null]
);
log.info(
{ customerId, count: orders.rows.length, event: "handler.success" },
"query_orders returned results"
);
return formatOrdersResult(orders.rows);
} catch (err) {
log.error(
{ customerId, err, event: "handler.error" },
"query_orders database error"
);
return {
isError: true,
content: [{ type: "text", text: "Database error — please retry" }],
};
}
}Every log line now carries trace_id and span_id. In your log aggregator, you can filter trace_id = "abc123" and see exactly what happened inside the MCP server for a specific agent run, correlated with the application-side logs that share the same trace.
Metrics: What to Actually Measure
Most teams instrument the obvious things (request count, error rate, latency) and miss the MCP-specific metrics that actually tell you whether your server is working correctly from an agent's perspective.
// mcp-server/src/metrics.ts
import { metrics } from "@opentelemetry/api";
const meter = metrics.getMeter("mcp-server");
export const toolCallCounter = meter.createCounter("mcp.tool.calls", {
description: "Number of tool calls by name and outcome",
});
export const toolCallDuration = meter.createHistogram("mcp.tool.duration_ms", {
description: "Tool call end-to-end duration in milliseconds",
unit: "ms",
advice: { explicitBucketBoundaries: [10, 50, 100, 250, 500, 1000, 2500, 5000] },
});
export const toolResultSize = meter.createHistogram("mcp.tool.result_bytes", {
description: "Size of tool result content in bytes",
unit: "By",
});
export const toolErrorRate = meter.createObservableGauge("mcp.tool.error_rate_1m", {
description: "1-minute rolling error rate per tool",
});
// Call this around every handler invocation
export function recordToolCall(
tool: string,
outcome: "success" | "error" | "validation_error",
durationMs: number,
resultBytes: number
) {
toolCallCounter.add(1, { tool, outcome });
toolCallDuration.record(durationMs, { tool, outcome });
if (outcome === "success") {
toolResultSize.record(resultBytes, { tool });
}
}The result_bytes metric is particularly important. If an LLM agent is calling a tool that returns 50KB of text for a simple query, that is a context budget problem. You want to see that in a dashboard before users start reporting that the agent is slow or hitting context limits.
Cost Attribution Per Tool Call
LLM API costs are dominated by token counts. When your agent calls tools, the tool results are injected into the next request as context. A tool that returns verbose output is directly costing you money in subsequent LLM calls. You need to track this.
The pattern is to count tokens in tool results at the MCP server level and emit them as metrics or log fields:
// mcp-server/src/cost-attribution.ts
import { encode } from "gpt-tokenizer"; // works for both GPT and Claude approximate counts
interface CostAttribution {
toolName: string;
agentRunId: string;
inputTokens: number;
outputTokens: number;
estimatedOutputCostUsd: number;
}
// Approximate Claude Sonnet pricing — update when rates change
const OUTPUT_COST_PER_1K_TOKENS = 0.015;
export function attributeToolCost(
toolName: string,
agentRunId: string,
inputArgs: unknown,
resultContent: Array<{ type: string; text?: string }>
): CostAttribution {
const inputText = JSON.stringify(inputArgs);
const outputText = resultContent
.filter((c) => c.type === "text")
.map((c) => c.text ?? "")
.join("");
const inputTokens = encode(inputText).length;
const outputTokens = encode(outputText).length;
const estimatedOutputCostUsd = (outputTokens / 1000) * OUTPUT_COST_PER_1K_TOKENS;
return { toolName, agentRunId, inputTokens, outputTokens, estimatedOutputCostUsd };
}Log this attribution data as a structured event and aggregate it in your data warehouse. After a week of production traffic, you will know exactly which tools are the most expensive and whether that cost is proportional to the value they provide.
Alerting on Agent-Specific Failure Modes
Standard HTTP alerting rules do not translate directly to MCP. The most important failure modes to alert on:
Silent error rate. MCP handlers return isError: true in the result content rather than HTTP 4xx/5xx. Your HTTP error rate metric will show 0% errors while the agent is silently getting failure responses. Build a metric that counts isError: true results separately from JSON-RPC errors.
Result size creep. Set an alert for when the P95 result_bytes for any tool exceeds 10KB. This almost always means the tool is returning too much context, which will eventually cause context limit errors.
Latency spikes correlated with LLM calls. If tool latency spikes at the same time as LLM API latency, that usually means you are calling the LLM from inside your MCP server — which you should not be. The MCP server should be fast data retrieval, not another inference loop.
# prometheus/alerts/mcp-server.yml
groups:
- name: mcp_server
rules:
- alert: ToolSilentErrorRateHigh
expr: |
rate(mcp_tool_calls_total{outcome="error"}[5m])
/ rate(mcp_tool_calls_total[5m]) > 0.01
for: 2m
labels:
severity: warning
annotations:
summary: "Tool {{ $labels.tool }} silent error rate > 1%"
- alert: ToolResultSizeP95High
expr: |
histogram_quantile(0.95, rate(mcp_tool_result_bytes_bucket[10m])) > 10240
for: 5m
labels:
severity: warning
annotations:
summary: "Tool {{ $labels.tool }} P95 result size > 10KB"Key Takeaways
- The agent boundary is an observability gap by default — trace context must be explicitly injected into the
_metafield of tool call requests and extracted on the MCP server side. - Use W3C
traceparentheaders carried in_metafor propagation; this is framework-agnostic and compatible with all major tracing backends. - Structured logs must include
trace_idandspan_idas top-level fields so they can be joined to spans in your tracing backend. - Track
isError: trueresults as a separate metric from JSON-RPC protocol errors — agents swallow silent tool failures and your error rate dashboard will lie to you otherwise. - Tool result size in bytes is one of the most important metrics to monitor — verbose tool outputs directly increase LLM inference costs in subsequent calls.
- Cost attribution at the MCP server level (input tokens, output tokens, estimated cost per tool call) is the only way to understand which tools are expensive relative to the value they provide.