Agent Design with MCP and KV Cache
How agentic workflows — long contexts, tool loops, multi-turn dialogue, MCP server fan-out — exploit KV cache reuse to cut latency and cost. Architecture, prompt layout, branching, and cross-request prefix sharing.
1. Why Cache Matters for Agents
An agent is not a single LLM call — it is a loop of calls. Each iteration appends a model response, tool invocation, or tool result to the context, then re-invokes the model. The prefix grows monotonically, and most of it is identical from one call to the next.
Naively, every iteration re-encodes the entire prompt, paying full prefill cost on tokens it processed seconds earlier. With KV cache reuse, the inference server skips re-encoding the unchanged prefix and only processes the new suffix.
2. MCP in 60 Seconds
Model Context Protocol (MCP) is an open standard for connecting LLM applications to external tools, data, and prompts. An MCP server exposes:
- Tools — callable functions with JSON schemas
- Resources — readable content (files, DB rows, URLs)
- Prompts — parameterized prompt templates
The agent host runs an MCP client that discovers servers, lists their tools, and forwards tool invocations. From the LLM's perspective, MCP tools look identical to native tool definitions — they are just merged into the model's tool list at request time.
3. Anatomy of an Agent Prompt
Every request an agent sends to the model has a layered structure. Understanding which layers are stable vs volatile is the key to designing for cache reuse.
| Layer | Volatility | Cache strategy |
|---|---|---|
| System prompt | Static | Cache forever, share across users |
| Tool definitions | Static (per agent) | Cache; invalidate on tool registry change |
| Few-shot examples | Static | Cache as part of stable prefix |
| Retrieved context (RAG) | Per-query | Generally not worth caching |
| Conversation history | Append-only | Cache up to turn N-1; reuse next turn |
| New turn | Always fresh | Always prefilled |
4. The Cache Opportunity
Visualize the same agent making three sequential model calls. Without a prefix cache, every call re-encodes everything. With one, only the suffix is new each time.
5. Reference Architecture
A KV-cache-aware agent system has three planes:
- Agent plane — runs the loop, decides next action, manages history
- Tool plane — MCP servers exposing capabilities
- Inference plane — model server with prefix-aware KV cache (paged blocks, radix tree, etc.)
6. Cache Across Turns
Inside one user session, the conversation grows monotonically. The model server can keep the KV blocks for everything up to turn N-1 warm; turn N only prefills the new user message and any tool results.
7. Branching & Parallel Tool Calls
Many agent frameworks let the model emit several tool calls in one turn that execute in parallel (e.g. list_files and read_config at once). After all tool results arrive, the agent re-prompts the model with all results appended.
Some agent designs go further and explore multiple branches in parallel — speculative tool calls, tree-of-thought, or N-best sampling. Each branch shares a common parent prefix but diverges at the leaves. Cache implementations like vLLM use copy-on-write on KV blocks so branches share physical blocks until they diverge.
8. Cross-Request Prefix Sharing
The biggest win in multi-tenant agent serving comes from sharing prefix KV blocks across different sessions that happen to use the same system prompt and tool set. Most production agents use the same configured prompt for every user — those tokens only need to be encoded once for the whole fleet.
9. RadixAttention — A Tree of Prefixes
RadixAttention (introduced by SGLang) generalizes prefix caching to a radix tree of token sequences. Every prompt seen by the server is inserted into the tree; common prefixes are shared at every depth, not just at the system-prompt boundary.
10. MCP-Specific Patterns
Stable tool ordering
When the agent enumerates MCP servers and merges their tool lists into the prompt, the order must be deterministic. If servers come up in different order or tools are sorted by a non-stable key, the prompt prefix changes byte-for-byte and the cache misses.
Rule: sort tools by (server_name, tool_name) at registration time and keep that order across requests.
Tool definition versioning
An MCP server can re-publish its tool list (e.g. after a hot reload). When this happens, every cached prefix that included the old definitions becomes stale. Track a tools_hash per request — when it changes, the inference server should evict matching prefix blocks.
Resource content vs tool result
MCP resources (e.g. file contents read via resources/read) are generally large and changeable; including them inline in the prefix usually defeats caching. Two patterns work:
- Lazy: only the tool definition is in the prefix; the agent reads the resource on demand and includes the content as a tool result after the cache boundary.
- Pinned: for read-mostly resources (a project README, schema), include in the cached prefix and invalidate when the source changes.
Don't put dynamic data in the system prompt
It is tempting to include the current time, user ID, or session UUID at the top of the system prompt. Don't — every request becomes a unique prefix and nothing is cacheable. Pass dynamic context as a separate user message after the cache boundary.
| Pattern | Cache friendly? | Why |
|---|---|---|
| Sort MCP tools deterministically | Yes | Keeps prefix bytes stable |
| Inject timestamp in system prompt | No | Every request → unique prefix |
| Include user ID in system prompt | No | Prevents cross-session sharing |
| Tool result appended after history | Yes | Below cache boundary |
| Re-fetch tool list every request | Maybe | OK if hash matches |
| Compress / summarize old turns | No | Rewrites history → invalidates cache |
11. Cache-Aware Agent Loop
A simplified loop using Anthropic-style explicit cache markers (cache_control). The same shape works for any provider that supports prefix caching.
def build_prompt(system, mcp_tools, history, new_input):
# MCP tools sorted deterministically
tools = sorted(mcp_tools, key=lambda t: (t.server, t.name))
return {
"system": [
{"type": "text", "text": system,
"cache_control": {"type": "ephemeral"}} # ← cache boundary #1
],
"tools": [t.to_schema() for t in tools], # tools also cached
"messages": [
*history, # everything up to N-1
# cache boundary #2 — insert marker on the LAST history message
{"role": "user", "content": new_input} # fresh, prefilled fully
]
}
def agent_step(state, user_msg):
state.history.append({"role": "user", "content": user_msg})
while True:
prompt = build_prompt(SYSTEM, state.mcp_tools,
state.history[:-1], state.history[-1])
resp = anthropic.messages.create(**prompt, model="claude-…")
state.history.append(resp.message)
if resp.stop_reason != "tool_use":
return resp # done
tool_results = run_tools_in_parallel(resp.tool_uses, state.mcp_client)
state.history.append({"role": "user", "content": tool_results})
Two practical notes:
- The
cache_controlmarker tells the inference server "everything up to and including this content block is cacheable." On Anthropic, you can place up to 4 markers — typically one after the system prompt, one after tools, and one on the last assistant turn before each new tool result. - Most modern providers (OpenAI, Anthropic, vLLM, SGLang) auto-detect prefix matches even without explicit markers. Markers mainly let you control eviction and pin hot prefixes.
MCP client side
class McpToolRegistry:
def __init__(self, servers):
self.servers = servers
self.tools_hash = None
async def refresh(self):
all_tools = []
for srv in sorted(self.servers, key=lambda s: s.name):
tools = await srv.list_tools()
for t in sorted(tools, key=lambda t: t.name):
all_tools.append(McpTool(server=srv.name, **t))
new_hash = hash_tools(all_tools)
if new_hash != self.tools_hash:
log.info("tool set changed — prefix cache will miss once")
self.tools_hash = new_hash
self.tools = all_tools
12. Pitfalls & Anti-Patterns
- Per-request UUIDs in system prompt. Kills sharing. Move to a separate message.
- "Current time is …" preamble. Same problem. Either drop it or pass as user-turn metadata.
- Reordering tool definitions. A non-stable sort (e.g. dict ordering in older Python) silently invalidates prefixes.
- Inline-summarizing old history. Rewrites the prefix. Either accept the cache miss as a one-time cost when summarizing, or summarize after the cache boundary.
- Mutating tool descriptions per user. Personalizing tool docstrings prevents cross-session sharing — keep tool defs identical and pass user context separately.
- Forgetting the cache TTL. Most provider caches have a 5-minute TTL. An idle agent will pay full prefill on its next call. Pin or refresh hot prefixes if traffic is bursty.
- Mixing model versions. KV blocks are model-specific. Switching between Sonnet and Opus mid-session means a fresh prefill each switch.
13. Vendor Implementations
| System | Mechanism | API surface |
|---|---|---|
| Anthropic API | Explicit prefix cache, paged blocks | cache_control: {type: "ephemeral"} on content blocks |
| OpenAI API | Automatic prefix caching | None — happens on prompts ≥1024 tokens |
| vLLM | PagedAttention + prefix cache | --enable-prefix-caching flag |
| SGLang | RadixAttention | Default; tree of prefixes |
| TensorRT-LLM | KV cache reuse | enable_kv_cache_reuse in builder |
| llama.cpp | Per-session KV cache | --prompt-cache |
14. Summary
- Agentic workflows are loops of LLM calls — most of each prompt is identical to the last call. Prefix-cached KV is the dominant inference-time optimization.
- MCP fan-out concentrates a large, stable block of tool definitions in the prompt prefix. That block is the highest-value caching target.
- Order prompt content from most stable → most volatile, and place cache boundaries at the natural seams: after the system prompt, after the tool list, after each completed turn.
- Within a session, KV blocks accumulate. Across sessions, blocks for the system prompt and tool list are physically shared via paged or radix-tree KV caches.
- For parallel tool calls and N-best sampling, copy-on-write of KV blocks lets branches share their parent prefix until they diverge.
- Anti-patterns: per-request UUIDs/timestamps in the prefix, non-deterministic tool ordering, in-place history summarization, personalized tool descriptions.