Drip · Agents & RAG · 13 min read

Multi-MCP Architecture

MCP grew from a Claude-only feature into a Linux Foundation hosted project in roughly thirteen months — donated to the new Agentic AI Foundation in December 2025. The patterns that emerged around composing multiple servers are different from what the spec anticipated — and the hidden cost of connecting everything is real.

Brain Drip EditorsMCPMCP. Model Context Protocol — the JSON-RPC standard for connecting AI agents to external tools and data sources. Current spec revision 2025-11-25; hosted by the Linux Foundation's Agentic AI Foundation since December 2025. · Many-small vs one-big · May 2026

The bottom line.By early 2026, MCP — now hosted by the Linux Foundation’s Agentic AI Foundation — was seeing on the order of 97 million monthly SDK downloads, with well over 10,000 servers in production across public registries and most usage shifting to remote endpoints. The architectural patterns that emerged: many small servers per domain beat one big monolithic server (Particula Tech’s MCP developer guide, corroborated by Anthropic’s MCP engineering writing); remote MCP via Streamable HTTP + OAuth 2.1 is the platform shape, not local stdio; and connecting every server you have costs you context tokens and planner accuracy — even at a conservative ~800 tokens of schemas per server, ten servers burn ~8K tokens before your prompt even starts. The fix is lazy-loading by task class. The lab below lets you watch the token ledger move.

§ 00 · FROM LOCAL PROTOTYPE TO A STANDARD~13 months from Claude-only to Linux Foundation

Anthropic introduced MCP in late 2024 as a way for Claude Desktop to talk to local tools through a JSON-RPC server. That’s the same shape almost every text-editor plugin had been using for years, lifted into the LLM era. It worked, and it spread fast.

On December 9, 2025 the protocol moved out of Anthropic’s repository and into the Linux Foundation’s new Agentic AI Foundation. By early 2026 it was seeing on the order of 97 million monthly downloads of MCP SDK packages and well over ten thousand servers in production across public registries. More important than any single count: most production MCP usage is now remote, not stdio — the protocol had outgrown its local-only prototype phase.

Plotted out, the chronology is tighter than the “industry standard overnight” framing suggests: four real milestones, about thirteen months end to end, with no “April 2026 spec” anywhere on the line.

Fig 1From a Claude-only prototype to Linux Foundation governance in ~13 months — Nov 2024 to Dec 2025. There is no April 2026 spec; the current revision is dated 2025-11-25.

The journey from local prototype to a Linux Foundation hosted project took about thirteen months and concluded in December 2025 — not eighteen months, and not via an April 2026 spec.

Two architectural questions matter for any team building on top of this. How do you carve up your servers? And how do you avoid drowning your agent in their combined tool inventory? The rest of this drip is the answers production teams converged on.

§ 01 · MANY SMALL SERVERS BEAT ONE BIG ONEMicroservices for agents

The first design instinct most teams have is to build one MCP server that exposes every tool the agent might need. CRM, billing, inventory, GitHub, Slack — all one server, 50 tools. It works in a demo. It doesn’t work in production.

Particula Tech’s MCP developer guide documents the failure mode, and the empirical evidence backs it up: agents get measurably worse at routing decisions as the tool inventory grows. In the RAG-MCP study, tool-selection accuracy fell from ~43% with a retrieved, relevant subset to ~14% once the full inventory was loaded — with the triage cost rising steeply past roughly 15-20 distinct tools. The model has to figure out which tool fits the current intent; the more tools in play, the more of that triage it has to do. Even Sonnet-class models start making sloppy calls — picking the wrong tool, mis-ordering arguments, invoking unrelated tools “just in case.”

The fix is microservices for agents. One server per business domain. Each independently deployable. The orchestrator composes them at the host. Concretely:

Draw the domains.Model your business the way you’d model microservices — by who owns what, what changes together, what fails together.
One domain, one server. CRM is its own server; billing is its own server; inventory is its own server. Each exposes 4-8 tools tightly scoped to its domain.
Compose at the host. The orchestrator (Claude Code, Claude Desktop, your own agent harness) connects to multiple servers and routes calls based on intent.
Deploy independently. Each server has its own release cycle, its own deploy schedule, its own on-call rotation. Failure isolation is the prize.

The benefits read like the standard microservices argument applied at a new layer — and that’s the right frame. MCP servers are the microservice abstraction for AI tooling. Treat them that way.

But splitting alone is a trap if you stop there. The lab below puts one 50-tool monolith next to the same capabilities split into ten domain servers — then lets you toggle lazy-loading and watch the token ledger and planner accuracy move.

Lab · Architecture decisionOne 50-tool monolith vs. ten domain servers — and what lazy-loading actually buys you

Tools the task actually needs · 4Tokens per tool schema · 160Lazy-load relevant domains only

Schema tokens loaded

8.1K

50 tools · 1 server

Planner accuracy

56%

on routing decisions

Task budget left

180K

of 200K window

One monolithic server loads all 50 tool schemas on every call — maximum token cost, minimum routing accuracy, no matter how few tools the task needs. Token and accuracy figures are illustrative.

Splitting a monolith into domain servers does not by itself save tokens — flip lazy-load off in microservices mode and you pay for all 50 tools again. The win comes from lazy-loading only the domains a task needs, which simultaneously slashes schema tokens and restores planner accuracy.

§ 02 · REMOTE MCP — STREAMABLE HTTP + OAUTHStdio was the prototype. HTTP is the platform.

The MCP spec (revision 2025-11-25), now hosted by the Linux Foundation’s Agentic AI Foundation, normalizes two transports: stdio (the local original) and Streamable HTTP (the platform shape). Stdio was the right primitive for the demo era — your IDE spawns a subprocess, talks to it over stdin/stdout, kills it on shutdown. Streamable HTTP is the right primitive for production — your server runs in the cloud, the client speaks HTTPS + Server-Sent Events to it, and OAuth 2.1 with PKCE handles auth.

The migration is small in code, big in implications. Four moves:

Switch to HTTP transport. The SDK provides a single drop-in change from StdioServerTransport to StreamableHTTPServerTransport. The handler code is identical.
Add OAuth 2.1 with PKCE. Validate the bearer token on every request, against the right issuer and audience. RFC 8707 audience binding is the part most teams miss; without it, a token a user obtained for some other MCP server can be replayed against yours.
Ship to the edge. Cloudflare Workers, Vercel Functions, Supabase Edge Functions — anywhere with low cold-start latency and global presence. The model client wants the server fast and close.
Register with the client.Add the URL to the client’s MCP config, OAuth flow handles the rest.

Once the server is remote, the “works on my laptop” discount goes away. Authentication is real. Rate limiting is real. Per-user data isolation is real. RLS-shaped policies in the database become the actual security boundary, with the OAuth-extracted auth.uid()as the discriminator. The work is the work — but the payoff is that your MCP server runs in the same place your team’s other production services do.

§ 03 · THE HIDDEN COST OF MULTI-MCPMore MCPs. Worse agents.

Here’s the counterintuitive result. Once you have the many-small server pattern in place and remote MCP working, the next failure mode is the one nobody warned you about: connecting more servers makes the agent worse.

Each MCP server wires its tool descriptions into every model call. Even at a conservative ~800 tokens of schemas per server, ten servers mean roughly 8,000 tokens of tool inventory loaded before your prompt even starts — and real servers run far heavier: Anthropic measured ~55K tokens for a five-server setup, and GitHub’s MCP server alone is around 42K tokens of definitions. The model has less budget for the actual task. And worse — even with a long context window, large tool inventories degrade planner accuracy regardless of whether you have token budget left.

Four things to put in place:

The token ledger.Count what each MCP costs you per call. Most clients don’t surface this number; measure it yourself by diffing call sizes with vs without each server connected.
Understand why it tanks accuracy. The planner has to triage tools by intent on every turn. A 50- tool inventory is a 50-element ranking problem. Even with tool descriptions perfectly written, the planner spends attention on triage that should go to the actual task.
Lazy-load servers based on task classification. Classify each incoming request into a task class; load only the servers relevant to that class. A support ticket doesn’t need warehouse SQL access; loading the warehouse server’s schemas only hurts.
Trace per server. Track which servers each task class actually invokes. Drop the dead weight monthly — servers connected but never called are pure overhead.

The retrieval-based version of this fix is what the RAG-MCP paper measured directly. Instead of loading every tool, retrieve only the top-k relevant ones per query. Sweep the slider below from “load all” down to a tight top-k and watch the published endpoints emerge.

Lab · Retrieve vs. load-allSweep the retriever from “load all 30 tools” down to top-k — watch tokens fall and selection accuracy climb

Query:3 truly-relevant tools

Retriever top-k · top 3

Prompt tokens

2.3K

3 tool schemas

Tool-selection accuracy

58.1%

retrieved subset

Drag to load alland accuracy collapses to the RAG-MCP load-all baseline (13.6%) while green relevant tools drown in amber noise; retrieve just the top few and accuracy jumps past the paper’s ~43% retrieved-subset figure while prompt tokens more than halve. Endpoints anchored on RAG-MCP (arXiv:2505.03275); intermediate points are illustrative.

Retrieving only the top-k relevant tools per query is the same insight as lazy-loading, generalized: smaller, relevant tool sets more than triple selection accuracy (RAG-MCP: ~14% → ~43%) while cutting prompt tokens by half-plus.

§ 04 · LAZY-LOAD BY TASK CLASSClassify first. Load second. Trace always.

The lab below makes the dynamics concrete. Toggle ten servers on or off, pick the current task class (support / data-ops / billing / code-review), and switch lazy-load on. The token ledger updates, the task budget moves, and the synthetic planner accuracy responds to tool count.

Lab · MCP token ledgerToggle servers, pick the task, switch on lazy-load — see what each MCP costs you per call

Current task:

Lazy-load by task

Tool schemas loaded

9.0K

from 10 servers

Task budget left

179K

of 200K window

Planner accuracy

84%

on routing decisions

Lazy-load is off. Every connected server's tool descriptions load into the context regardless of whether the current task needs them. Planner accuracy degrades with tool count, even before token budget runs out.

Two things to feel in the lab. Without lazy-load, every connected server costs you regardless of relevance — the warehouse server is loaded even for a support ticket where nothing in it matters. With lazy-load on, the same set of servers can stay connected without paying their full cost on every turn; the agent only sees the slice it needs.

Implementation-wise, the classifier doesn’t have to be sophisticated. A cheap Haiku-class call (“classify this request into one of: support / data-ops / billing / code-review”) can route requests into a handful of task classes at high accuracy for a fraction of a cent per call — typically well under the token savings it unlocks on the main model call.

§ 05 · A REFERENCE TOPOLOGYHow a production multi-MCP stack looks

Rendering diagram…

Classifier → host → lazy-loaded MCP servers

Three things to take away from the topology. The classifier runs before the planner; the planner runs after the relevant servers are loaded. The host is doing real work — it knows about your task taxonomy, your server registry, and the mapping between them. The MCP servers themselves are dumb — they expose tools, validate OAuth, and execute. Smartness lives one layer up.

CHECKYour agent has 8 MCP servers connected. Average user request only needs tools from 2 of them. Without changing anything else, which intervention has the highest expected impact on production quality?

§ · FURTHER READINGReferences & deeper sources

Tiantian Gan & Qiyao Sun (2025). RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation · arXiv:2505.03275
Anthropic (2025). Code execution with MCP: building more efficient AI agents · Anthropic Engineering
Nelson F. Liu et al. (2023). Lost in the Middle: How Language Models Use Long Contexts · arXiv:2307.03172 (TACL 2024)
Anthropic (2024). Introducing the Model Context Protocol · Anthropic News
Linux Foundation (2025). Linux Foundation Announces the Formation of the Agentic AI Foundation · Linux Foundation Press
Model Context Protocol (2025). Specification 2025-11-25 — Changelog (transports + authorization) · modelcontextprotocol.io
Particula Tech (2026). MCP Developer Guide: Build Servers, Connect Tools, Ship Agents · Particula Engineering
Cloudflare (2025). Build and deploy Remote Model Context Protocol (MCP) servers to Cloudflare · Cloudflare Blog
Brain Drip Editors (2026). Blueprint: Build a Shared-Skills MCP Server (Supabase + OAuth 2.1) · Brain Drip Blueprints

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.