Agent Design with MCP and KV Cache

How agentic workflows — long contexts, tool loops, multi-turn dialogue, MCP server fan-out — exploit KV cache reuse to cut latency and cost. Architecture, prompt layout, branching, and cross-request prefix sharing.

1. Why Cache Matters for Agents

An agent is not a single LLM call — it is a loop of calls. Each iteration appends a model response, tool invocation, or tool result to the context, then re-invokes the model. The prefix grows monotonically, and most of it is identical from one call to the next.

Naively, every iteration re-encodes the entire prompt, paying full prefill cost on tokens it processed seconds earlier. With KV cache reuse, the inference server skips re-encoding the unchanged prefix and only processes the new suffix.

Magnitude: A typical agentic call has a 4–10K token system prompt (instructions + tool definitions + few-shot examples) and grows by hundreds of tokens per turn. Reusing the prefix can reduce time-to-first-token by 5–10× and input cost by ~90% (Anthropic and OpenAI both price cached input tokens at ~10% of normal).

2. MCP in 60 Seconds

Model Context Protocol (MCP) is an open standard for connecting LLM applications to external tools, data, and prompts. An MCP server exposes:

Tools — callable functions with JSON schemas
Resources — readable content (files, DB rows, URLs)
Prompts — parameterized prompt templates

The agent host runs an MCP client that discovers servers, lists their tools, and forwards tool invocations. From the LLM's perspective, MCP tools look identical to native tool definitions — they are just merged into the model's tool list at request time.

Figure 1 — MCP fan-out from the agent host. Tool defs from many servers are merged into one prompt sent to the inference server.

Implication for caching: The set of MCP tools attached to an agent forms a large but stable chunk of every prompt. That chunk is the most valuable thing to cache.

3. Anatomy of an Agent Prompt

Every request an agent sends to the model has a layered structure. Understanding which layers are stable vs volatile is the key to designing for cache reuse.

Figure 2 — A typical agent prompt, ordered top-to-bottom. Stability decreases as you go down. Cache boundaries should be placed at the dashed lines.

Layer	Volatility	Cache strategy
System prompt	Static	Cache forever, share across users
Tool definitions	Static (per agent)	Cache; invalidate on tool registry change
Few-shot examples	Static	Cache as part of stable prefix
Retrieved context (RAG)	Per-query	Generally not worth caching
Conversation history	Append-only	Cache up to turn N-1; reuse next turn
New turn	Always fresh	Always prefilled

4. The Cache Opportunity

Visualize the same agent making three sequential model calls. Without a prefix cache, every call re-encodes everything. With one, only the suffix is new each time.

Figure 3 — Without/with prefix cache across three agent turns.

5. Reference Architecture

A KV-cache-aware agent system has three planes:

Agent plane — runs the loop, decides next action, manages history
Tool plane — MCP servers exposing capabilities
Inference plane — model server with prefix-aware KV cache (paged blocks, radix tree, etc.)

Figure 4 — Three-plane reference architecture. The agent runtime emits cache markers; the inference plane owns the actual KV blocks.

6. Cache Across Turns

Inside one user session, the conversation grows monotonically. The model server can keep the KV blocks for everything up to turn N-1 warm; turn N only prefills the new user message and any tool results.

Figure 5 — KV blocks accumulate within a session. Each turn extends the cached prefix.

Practical effect: A 50-turn agent conversation ends up with the model spending ~95% of input compute on cache hits. The remaining 5% is the new tool result or user message.

7. Branching & Parallel Tool Calls

Many agent frameworks let the model emit several tool calls in one turn that execute in parallel (e.g. list_files and read_config at once). After all tool results arrive, the agent re-prompts the model with all results appended.

Some agent designs go further and explore multiple branches in parallel — speculative tool calls, tree-of-thought, or N-best sampling. Each branch shares a common parent prefix but diverges at the leaves. Cache implementations like vLLM use copy-on-write on KV blocks so branches share physical blocks until they diverge.

Figure 6 — Parallel tool calls fan out from a shared prefix and re-merge. Cache reuses the trunk on the second prompt.

8. Cross-Request Prefix Sharing

The biggest win in multi-tenant agent serving comes from sharing prefix KV blocks across different sessions that happen to use the same system prompt and tool set. Most production agents use the same configured prompt for every user — those tokens only need to be encoded once for the whole fleet.

Figure 7 — Many sessions point at the same physical KV blocks for the system prompt and tool definitions.

Why this needs PagedAttention: Because KV blocks are addressed via a per-sequence block table (not contiguous tensors), two sequences can simply hold the same physical block ID for their shared prefix. Without paging, sharing requires copying.

9. RadixAttention — A Tree of Prefixes

RadixAttention (introduced by SGLang) generalizes prefix caching to a radix tree of token sequences. Every prompt seen by the server is inserted into the tree; common prefixes are shared at every depth, not just at the system-prompt boundary.

Figure 8 — Radix tree of prompt prefixes. Common prefixes share KV blocks at every level.

10. MCP-Specific Patterns

Stable tool ordering

When the agent enumerates MCP servers and merges their tool lists into the prompt, the order must be deterministic. If servers come up in different order or tools are sorted by a non-stable key, the prompt prefix changes byte-for-byte and the cache misses.

Rule: sort tools by (server_name, tool_name) at registration time and keep that order across requests.

Tool definition versioning

An MCP server can re-publish its tool list (e.g. after a hot reload). When this happens, every cached prefix that included the old definitions becomes stale. Track a tools_hash per request — when it changes, the inference server should evict matching prefix blocks.

Resource content vs tool result

MCP resources (e.g. file contents read via resources/read) are generally large and changeable; including them inline in the prefix usually defeats caching. Two patterns work:

Lazy: only the tool definition is in the prefix; the agent reads the resource on demand and includes the content as a tool result after the cache boundary.
Pinned: for read-mostly resources (a project README, schema), include in the cached prefix and invalidate when the source changes.

Don't put dynamic data in the system prompt

It is tempting to include the current time, user ID, or session UUID at the top of the system prompt. Don't — every request becomes a unique prefix and nothing is cacheable. Pass dynamic context as a separate user message after the cache boundary.

Pattern	Cache friendly?	Why
Sort MCP tools deterministically	Yes	Keeps prefix bytes stable
Inject timestamp in system prompt	No	Every request → unique prefix
Include user ID in system prompt	No	Prevents cross-session sharing
Tool result appended after history	Yes	Below cache boundary
Re-fetch tool list every request	Maybe	OK if hash matches
Compress / summarize old turns	No	Rewrites history → invalidates cache

11. Cache-Aware Agent Loop

A simplified loop using Anthropic-style explicit cache markers (cache_control). The same shape works for any provider that supports prefix caching.

def build_prompt(system, mcp_tools, history, new_input):
    # MCP tools sorted deterministically
    tools = sorted(mcp_tools, key=lambda t: (t.server, t.name))

    return {
        "system": [
            {"type": "text", "text": system,
             "cache_control": {"type": "ephemeral"}}      # ← cache boundary #1
        ],
        "tools": [t.to_schema() for t in tools],          # tools also cached
        "messages": [
            *history,                                     # everything up to N-1
            # cache boundary #2 — insert marker on the LAST history message
            {"role": "user", "content": new_input}        # fresh, prefilled fully
        ]
    }

def agent_step(state, user_msg):
    state.history.append({"role": "user", "content": user_msg})
    while True:
        prompt = build_prompt(SYSTEM, state.mcp_tools,
                              state.history[:-1], state.history[-1])
        resp = anthropic.messages.create(**prompt, model="claude-…")
        state.history.append(resp.message)

        if resp.stop_reason != "tool_use":
            return resp                                   # done

        tool_results = run_tools_in_parallel(resp.tool_uses, state.mcp_client)
        state.history.append({"role": "user", "content": tool_results})

Two practical notes:

The cache_control marker tells the inference server "everything up to and including this content block is cacheable." On Anthropic, you can place up to 4 markers — typically one after the system prompt, one after tools, and one on the last assistant turn before each new tool result.
Most modern providers (OpenAI, Anthropic, vLLM, SGLang) auto-detect prefix matches even without explicit markers. Markers mainly let you control eviction and pin hot prefixes.

MCP client side

class McpToolRegistry:
    def __init__(self, servers):
        self.servers = servers
        self.tools_hash = None

    async def refresh(self):
        all_tools = []
        for srv in sorted(self.servers, key=lambda s: s.name):
            tools = await srv.list_tools()
            for t in sorted(tools, key=lambda t: t.name):
                all_tools.append(McpTool(server=srv.name, **t))
        new_hash = hash_tools(all_tools)
        if new_hash != self.tools_hash:
            log.info("tool set changed — prefix cache will miss once")
        self.tools_hash = new_hash
        self.tools = all_tools

12. Pitfalls & Anti-Patterns

Per-request UUIDs in system prompt. Kills sharing. Move to a separate message.
"Current time is …" preamble. Same problem. Either drop it or pass as user-turn metadata.
Reordering tool definitions. A non-stable sort (e.g. dict ordering in older Python) silently invalidates prefixes.
Inline-summarizing old history. Rewrites the prefix. Either accept the cache miss as a one-time cost when summarizing, or summarize after the cache boundary.
Mutating tool descriptions per user. Personalizing tool docstrings prevents cross-session sharing — keep tool defs identical and pass user context separately.
Forgetting the cache TTL. Most provider caches have a 5-minute TTL. An idle agent will pay full prefill on its next call. Pin or refresh hot prefixes if traffic is bursty.
Mixing model versions. KV blocks are model-specific. Switching between Sonnet and Opus mid-session means a fresh prefill each switch.

13. Vendor Implementations

System	Mechanism	API surface
Anthropic API	Explicit prefix cache, paged blocks	`cache_control: {type: "ephemeral"}` on content blocks
OpenAI API	Automatic prefix caching	None — happens on prompts ≥1024 tokens
vLLM	PagedAttention + prefix cache	`--enable-prefix-caching` flag
SGLang	RadixAttention	Default; tree of prefixes
TensorRT-LLM	KV cache reuse	`enable_kv_cache_reuse` in builder
llama.cpp	Per-session KV cache	`--prompt-cache`

14. Summary

Agentic workflows are loops of LLM calls — most of each prompt is identical to the last call. Prefix-cached KV is the dominant inference-time optimization.
MCP fan-out concentrates a large, stable block of tool definitions in the prompt prefix. That block is the highest-value caching target.
Order prompt content from most stable → most volatile, and place cache boundaries at the natural seams: after the system prompt, after the tool list, after each completed turn.
Within a session, KV blocks accumulate. Across sessions, blocks for the system prompt and tool list are physically shared via paged or radix-tree KV caches.
For parallel tool calls and N-best sampling, copy-on-write of KV blocks lets branches share their parent prefix until they diverge.
Anti-patterns: per-request UUIDs/timestamps in the prefix, non-deterministic tool ordering, in-place history summarization, personalized tool descriptions.

Bottom line: Designing an agent for cache reuse is mostly about discipline in prompt construction — keep the prefix byte-stable, keep dynamic data below the cache line, and let the inference server do the rest. The savings (5–10× faster, ~90% cheaper input) compound with every turn.