MCP in production: what the spec doesn’t tell you

The Model Context Protocol is elegant. It gives you a clean abstraction for tool discovery, invocation, and response — a typed contract between AI agents and the systems they interact with. If you've read the specification, you'll appreciate the design: minimal, composable, and deliberately unopinionated about what sits above it.

That last part is where things get interesting.

A protocol spec tells you what the messages look like. It doesn't tell you what happens when a tool returns 40,000 tokens into a 128k context window, or when the downstream service is down, or when your compliance team wants to know exactly what the agent did and why it did it.

This isn't a criticism. HTTP didn't tell you how to build resilient microservices either. The gap between protocol and production is where architecture lives — and with MCP, that gap is significant enough to be worth writing about.

I've seen this from both sides. As a maintainer of the MCP TypeScript SDK, I've helped shape the protocol. As an enterprise architect, I've had to build the production layers around it. This article is about those layers: four problems the specification stays intentionally silent on, and what I've learned about solving each one.

Token budgeting

MCP tools return data. LLMs have finite context windows. Nothing in the protocol manages the relationship between the two.

This sounds obvious, but the failure mode is subtle. A tool that returns a full order history, a product catalogue, or a customer record can consume your token budget in a single call — crowding out the reasoning space the model needs to actually use the data. The model doesn't throw an error. It degrades. It starts ignoring parts of the response, loses track of earlier context, or hallucinates rather than admitting it can't see the full picture.

You need a token-aware layer between tool responses and model input. This is an architecture concern, not a protocol concern.

In practice, this means building middleware that wraps tool responses before they reach the model. The simplest version looks something like this:

async function withTokenBudget(
  toolCall: ToolCall,
  budget: number,
  execute: (call: ToolCall) => Promise<ToolResponse>
): Promise<ToolResponse> {
  const response = await execute(toolCall);
  const tokenCount = estimateTokens(response.content);

  if (tokenCount <= budget) return response;

  return {
    ...response,
    content: truncateWithSummary(response.content, budget),
    metadata: {
      truncated: true,
      originalTokens: tokenCount,
      returnedTokens: budget,
      totalResults: response.metadata?.totalResults,
    },
  };
}

The pattern extends beyond simple truncation. Tools themselves should be designed with pagination in mind — accepting offset and limit parameters so the agent can request more data if the first page isn't sufficient. Response budgets at the orchestration layer let you set a maximum token allocation per tool call and enforce it before anything reaches the model.

The trade-off is real: too little context and the model can't reason accurately; too much and it drowns. There's no universal answer. But having the mechanism to control it — and the instrumentation to see when you're getting it wrong — is the baseline.

Why doesn't the spec handle this? Because it shouldn't. MCP defines the shape of tool responses, not their size. That's the right design decision for an interoperable protocol — different models have different context windows, different use cases have different tolerance for truncation, and the “right” budget depends on what else is in the conversation. But it means every production deployment needs to own this layer. Most teams don't realise it until they're debugging hallucinations that trace back to “the tool returned too much data.”

Graceful tool failure

In production, tools fail. The downstream API times out. The service returns a 500. The data comes back malformed. Authentication expires mid-session.

MCP defines error responses — the protocol has a way to signal that a tool call didn't succeed. But the spec doesn't prescribe what the agent should do when that happens. Retry? Fall back to a different tool? Ask the user for guidance? Abort the workflow? That's your problem.

In a demo, you skip this. In production, this is most of what you build.

The answer is failure policies at the orchestration layer. Not every failure is the same, and the right response depends on context. An agent summarising data can work with partial results. An agent placing an order cannot. You need to express that distinction somewhere, and the protocol isn't the place for it.

Patterns that work: retry with backoff, but with a budget — you can't retry indefinitely when a user is waiting. Fallback tool chains, where a degraded alternative exists (a cached dataset, a simpler query). And critically, signalling failure to the model in a way that lets it adjust its plan rather than hallucinate an answer.

That last point matters more than people expect. When a tool fails, the model needs enough information to reason about what to do next. A generic error message isn't sufficient. Something like this is closer to useful:

interface ToolFailureResponse {
  status: "error";
  errorType: "timeout" | "upstream_error" | "auth_expired" | "malformed";
  canRetry: boolean;
  retryAfterMs?: number;
  fallbackAvailable: boolean;
  fallbackToolName?: string;
  userMessage: string;  // What to tell the end user if recovery fails
}

The model can now decide: retry if canRetry is true and the budget allows it, try the fallback tool if one exists, or inform the user and stop. Without this structure, the model guesses — and guessing is where production agents break down.

The parallel to microservices is direct. The early HTTP specs didn't include circuit breakers or bulkheads. Those patterns emerged from production pain. The same is happening with MCP. The spec gives you the building blocks for error signalling. The resilience strategy is yours to design — and right now, you're designing it from first principles.

Observability and tracing

An AI agent makes a series of tool calls, receives responses, reasons over them, and produces output. When something goes wrong — or when an auditor asks “why did the agent do that?” — you need a trace.

Not a log. A trace. You need to answer: which tools were called, with what parameters, what came back, what the model did with that information, and why it chose one path over another. In a multi-turn agent session, this is a complex causal chain spread across multiple tool calls, model inferences, and decision points.

The good news: this isn't a new discipline. If you've built observability for distributed systems — correlation IDs, structured logging, trace propagation — you already know the patterns. Agent observability is an extension of existing practice, not a replacement for it.

A well-instrumented agent session produces structured traces that look something like this:

{
  "sessionId": "sess_8f2a91c",
  "turn": 3,
  "toolCall": {
    "name": "getOrderHistory",
    "parameters": { "customerId": "cust_441", "limit": 20 },
    "latencyMs": 340,
    "tokenCount": 1847,
    "status": "success"
  },
  "modelReasoning": "Customer asked about recent orders. Retrieved last 20 orders to identify the one they're referring to.",
  "budgetRemaining": {
    "tokensUsed": 4210,
    "tokensAvailable": 11790,
    "toolCallsThisTurn": 2
  }
}

Correlation IDs tie tool calls together within a session. The model's reasoning — captured alongside the tool interactions, not just the inputs and outputs — lets you reconstruct the decision chain after the fact. Dashboards and alerts answer the operational question: what does “healthy” look like for an agent workflow? Latency per tool call, error rates, token usage per session, and cost per interaction are the starting metrics.

MCP's structured format actually makes this easier than most people expect. The protocol already gives you typed tool calls and typed responses with clear boundaries. You're not parsing unstructured logs — you're instrumenting a well-defined request/response chain. The raw material is good.

The hard part is the orchestration-layer instrumentation: correlating tool calls across turns, capturing the model's intermediate reasoning, and making all of this queryable at scale. That's architecture work, not protocol work — and it's the layer that turns “the agent works” into “the business trusts the agent.”

Governance: who decides what the agent can do?

In enterprise environments, you cannot ship an AI agent that can call any tool with any parameters. Full stop.

You need policy boundaries. Which tools are available to which agents? Under what conditions can an agent execute a high-impact action — placing an order, modifying a record, sending a message on behalf of a user? Who approves the agent's tool access, and how is that approval recorded? What does the audit trail look like when compliance asks?

The spec doesn't answer these questions. Nor should it — governance is context-dependent. A healthcare deployment and a commerce deployment need fundamentally different policy boundaries. But every production deployment needs some answer.

The pattern that works is policy-as-code: governance rules defined in configuration, not buried in application logic, so they can be reviewed, versioned, and changed independently of the agent's behaviour.

agentPolicies:
  customerServiceAgent:
    allowedTools:
      - getOrderHistory
      - getCustomerProfile
      - createSupportTicket
    deniedTools:
      - modifyBillingRecord
      - issueRefund
    constraints:
      createSupportTicket:
        maxPerSession: 3
        requiresFields: ["category", "description"]
    approvalGates:
      issueRefund:
        required: true
        approver: "human_supervisor"
        maxAmount: 50000  # in smallest currency unit
    auditLevel: full  # log every tool call with parameters and response

Tool whitelisting per agent role is the baseline — an agent handling customer queries doesn't need access to the billing API. Parameter validation at the orchestration layer constrains not just which tools, but what inputs they accept in a given context. Human-in-the-loop gates for high-impact actions let the agent propose and a human approve. And audit logging that captures every tool call, every parameter, and every response — with timestamps and session context — is what satisfies the compliance team when they come asking.

AI agents have a governance problem, not a capability problem. Most enterprise AI pilots stall not because the technology can't do the work, but because nobody defined who's accountable when the agent acts.

The model can call the tool. The question is whether it should — and who decided.

MCP enables governance. Structured tool definitions mean you can build policy checks against a known schema. Every tool has a typed contract — name, parameters, expected response — which makes whitelisting, parameter validation, and audit logging tractable. That's a significant advantage over unstructured API calls.

But the protocol doesn't prescribe governance, and that's the right decision. The foundation is there. The architecture is yours.

The spec is the foundation, not the building

MCP gives you a solid protocol for agent-tool interaction. But a protocol is infrastructure, not architecture. Token budgets, failure resilience, observability, governance — these are architecture decisions that depend on your business context, your risk tolerance, and your team's capability.

The spec is intentionally silent on these problems. That's good protocol design. It's also why the architecture layer above the protocol is where most of the production work lives.

As MCP adoption grows, some of these patterns will get codified into shared libraries, reference architectures, and possibly protocol extensions. But right now, if you're taking MCP to production, these are the problems you'll solve yourself. And solving them well is the difference between an AI demo and an AI system your business can actually trust.

MCP in production: what the spec doesn’t tell you

Token budgeting

Graceful tool failure

Observability and tracing

Governance: who decides what the agent can do?

The spec is the foundation, not the building

Konstantin Konstantinov

Taking MCP to production?

MCP in production: what the spec doesn’t tell you

Token budgeting

Graceful tool failure

Observability and tracing

Governance: who decides what the agent can do?

The spec is the foundation, not the building

Konstantin Konstantinov

Why your composable commerce migration will stall (and how to prevent it)

Taking MCP to production?