Agent Design with MCP and KV Cache

How agentic workflows — long contexts, tool loops, multi-turn dialogue, MCP server fan-out — exploit KV cache reuse to cut latency and cost. Architecture, prompt layout, branching, and cross-request prefix sharing.

1. Why Cache Matters for Agents

An agent is not a single LLM call — it is a loop of calls. Each iteration appends a model response, tool invocation, or tool result to the context, then re-invokes the model. The prefix grows monotonically, and most of it is identical from one call to the next.

Naively, every iteration re-encodes the entire prompt, paying full prefill cost on tokens it processed seconds earlier. With KV cache reuse, the inference server skips re-encoding the unchanged prefix and only processes the new suffix.

Magnitude: A typical agentic call has a 4–10K token system prompt (instructions + tool definitions + few-shot examples) and grows by hundreds of tokens per turn. Reusing the prefix can reduce time-to-first-token by 5–10× and input cost by ~90% (Anthropic and OpenAI both price cached input tokens at ~10% of normal).

2. MCP in 60 Seconds

Model Context Protocol (MCP) is an open standard for connecting LLM applications to external tools, data, and prompts. An MCP server exposes:

  • Tools — callable functions with JSON schemas
  • Resources — readable content (files, DB rows, URLs)
  • Prompts — parameterized prompt templates

The agent host runs an MCP client that discovers servers, lists their tools, and forwards tool invocations. From the LLM's perspective, MCP tools look identical to native tool definitions — they are just merged into the model's tool list at request time.

Agent Host (LLM + agent loop) + MCP client stdio/HTTP MCP Server: GitHub tools: list_prs, create_issue… MCP Server: Postgres tools: query, schema, … MCP Server: Filesystem tools: read, write, glob, … Inference Server (KV cache lives here) Anthropic / vLLM / … prompts (with cache markers) → ← tokens (TTFT depends on cache hit)
Figure 1 — MCP fan-out from the agent host. Tool defs from many servers are merged into one prompt sent to the inference server.
Implication for caching: The set of MCP tools attached to an agent forms a large but stable chunk of every prompt. That chunk is the most valuable thing to cache.

3. Anatomy of an Agent Prompt

Every request an agent sends to the model has a layered structure. Understanding which layers are stable vs volatile is the key to designing for cache reuse.

System prompt + role/persona ~1–3K tokens • identical across every call Tool definitions (incl. MCP tools) ~3–8K tokens • stable until tool set changes ⟵ stable prefix (cache target #1) Few-shot examples (optional) stable per agent config Retrieved context / memory changes per query — usually NOT cacheable ⟵ conversation prefix (cache target #2) Prior turns (user / assistant / tool results) grows monotonically — cacheable up to N-1 New input (user msg or tool result) always fresh — full prefill required
Figure 2 — A typical agent prompt, ordered top-to-bottom. Stability decreases as you go down. Cache boundaries should be placed at the dashed lines.
LayerVolatilityCache strategy
System promptStaticCache forever, share across users
Tool definitionsStatic (per agent)Cache; invalidate on tool registry change
Few-shot examplesStaticCache as part of stable prefix
Retrieved context (RAG)Per-queryGenerally not worth caching
Conversation historyAppend-onlyCache up to turn N-1; reuse next turn
New turnAlways freshAlways prefilled

4. The Cache Opportunity

Visualize the same agent making three sequential model calls. Without a prefix cache, every call re-encodes everything. With one, only the suffix is new each time.

Without prefix cache With prefix cache Call 1 prefill: sys + tools + msg₁ (full) prefill: sys + tools + msg₁ (full) Call 2 RE-prefill: sys + tools (waste) + msg₁ + reply₁ + msg₂ cache hit: sys + tools delta Call 3 RE-prefill: sys + tools (waste) RE-prefill: msg₁..reply₂ (waste) + msg₃ cache hit: sys + tools + history re-prefilled (paid for) cache hit (≈10% cost) new tokens (real work) Cumulative savings grow with conversation length.
Figure 3 — Without/with prefix cache across three agent turns.

5. Reference Architecture

A KV-cache-aware agent system has three planes:

  1. Agent plane — runs the loop, decides next action, manages history
  2. Tool plane — MCP servers exposing capabilities
  3. Inference plane — model server with prefix-aware KV cache (paged blocks, radix tree, etc.)
users / sessions Agent Runtime • loop / planner • history manager • prompt assembler • MCP client cache markers prompt + cache_id Inference Server Prefix KV Cache paged blocks, radix tree LRU + pinned hot prefixes Model (decoder) prefill suffix only on hit decode loop → tokens tool calls MCP Servers github · stdio postgres · stdio filesystem · stdio ─ App plane ─ Agent + Tool planes ─ Inference plane (cache lives here)
Figure 4 — Three-plane reference architecture. The agent runtime emits cache markers; the inference plane owns the actual KV blocks.

6. Cache Across Turns

Inside one user session, the conversation grows monotonically. The model server can keep the KV blocks for everything up to turn N-1 warm; turn N only prefills the new user message and any tool results.

Turn 1 sys+tools user₁ → build cache blocks [B0,B1,B2] Turn 2 sys+tools user₁ asst₁ cache hit: blocks [B0..B3] tool_res → append B4 Turn 3 cache hit: blocks [B0..B5] user₂ → append B6 Turn 4 cache hit: blocks [B0..B7] tool_res prefix cache hit new prefill (delta only)
Figure 5 — KV blocks accumulate within a session. Each turn extends the cached prefix.
Practical effect: A 50-turn agent conversation ends up with the model spending ~95% of input compute on cache hits. The remaining 5% is the new tool result or user message.

7. Branching & Parallel Tool Calls

Many agent frameworks let the model emit several tool calls in one turn that execute in parallel (e.g. list_files and read_config at once). After all tool results arrive, the agent re-prompts the model with all results appended.

Some agent designs go further and explore multiple branches in parallel — speculative tool calls, tree-of-thought, or N-best sampling. Each branch shares a common parent prefix but diverges at the leaves. Cache implementations like vLLM use copy-on-write on KV blocks so branches share physical blocks until they diverge.

shared prefix (sys+tools+history) model: emits 3 tool calls tool_call₁ → result₁ tool_call₂ → result₂ tool_call₃ → result₃ re-prompt with all 3 results appended → shared prefix blocks reused, only the result-suffix is new For speculative N-best sampling: branches share parent KV blocks via copy-on-write until they emit divergent tokens.
Figure 6 — Parallel tool calls fan out from a shared prefix and re-merge. Cache reuses the trunk on the second prompt.

8. Cross-Request Prefix Sharing

The biggest win in multi-tenant agent serving comes from sharing prefix KV blocks across different sessions that happen to use the same system prompt and tool set. Most production agents use the same configured prompt for every user — those tokens only need to be encoded once for the whole fleet.

session A: msgs… session B: msgs… session C: msgs… session D: msgs… Shared prefix blocks [sys + tools + few-shot] refcounted A's tail blocks B's tail blocks C's tail blocks D's tail blocks One physical copy of the system prompt KV serves all sessions. VRAM saved = (# concurrent sessions − 1) × prefix_size
Figure 7 — Many sessions point at the same physical KV blocks for the system prompt and tool definitions.
Why this needs PagedAttention: Because KV blocks are addressed via a per-sequence block table (not contiguous tensors), two sequences can simply hold the same physical block ID for their shared prefix. Without paging, sharing requires copying.

9. RadixAttention — A Tree of Prefixes

RadixAttention (introduced by SGLang) generalizes prefix caching to a radix tree of token sequences. Every prompt seen by the server is inserted into the tree; common prefixes are shared at every depth, not just at the system-prompt boundary.

[sys + tool defs] "List my open PRs" "Search docs for X" "Run query SELECT…" tool_result A tool_result B "… plus context Y" "…with WHERE …" Each node owns a contiguous run of KV blocks. Insertion finds the longest matching path, then forks. LRU eviction at the leaves.
Figure 8 — Radix tree of prompt prefixes. Common prefixes share KV blocks at every level.

10. MCP-Specific Patterns

Stable tool ordering

When the agent enumerates MCP servers and merges their tool lists into the prompt, the order must be deterministic. If servers come up in different order or tools are sorted by a non-stable key, the prompt prefix changes byte-for-byte and the cache misses.

Rule: sort tools by (server_name, tool_name) at registration time and keep that order across requests.

Tool definition versioning

An MCP server can re-publish its tool list (e.g. after a hot reload). When this happens, every cached prefix that included the old definitions becomes stale. Track a tools_hash per request — when it changes, the inference server should evict matching prefix blocks.

Resource content vs tool result

MCP resources (e.g. file contents read via resources/read) are generally large and changeable; including them inline in the prefix usually defeats caching. Two patterns work:

  • Lazy: only the tool definition is in the prefix; the agent reads the resource on demand and includes the content as a tool result after the cache boundary.
  • Pinned: for read-mostly resources (a project README, schema), include in the cached prefix and invalidate when the source changes.

Don't put dynamic data in the system prompt

It is tempting to include the current time, user ID, or session UUID at the top of the system prompt. Don't — every request becomes a unique prefix and nothing is cacheable. Pass dynamic context as a separate user message after the cache boundary.

PatternCache friendly?Why
Sort MCP tools deterministicallyYesKeeps prefix bytes stable
Inject timestamp in system promptNoEvery request → unique prefix
Include user ID in system promptNoPrevents cross-session sharing
Tool result appended after historyYesBelow cache boundary
Re-fetch tool list every requestMaybeOK if hash matches
Compress / summarize old turnsNoRewrites history → invalidates cache

11. Cache-Aware Agent Loop

A simplified loop using Anthropic-style explicit cache markers (cache_control). The same shape works for any provider that supports prefix caching.

def build_prompt(system, mcp_tools, history, new_input):
    # MCP tools sorted deterministically
    tools = sorted(mcp_tools, key=lambda t: (t.server, t.name))

    return {
        "system": [
            {"type": "text", "text": system,
             "cache_control": {"type": "ephemeral"}}      # ← cache boundary #1
        ],
        "tools": [t.to_schema() for t in tools],          # tools also cached
        "messages": [
            *history,                                     # everything up to N-1
            # cache boundary #2 — insert marker on the LAST history message
            {"role": "user", "content": new_input}        # fresh, prefilled fully
        ]
    }

def agent_step(state, user_msg):
    state.history.append({"role": "user", "content": user_msg})
    while True:
        prompt = build_prompt(SYSTEM, state.mcp_tools,
                              state.history[:-1], state.history[-1])
        resp = anthropic.messages.create(**prompt, model="claude-…")
        state.history.append(resp.message)

        if resp.stop_reason != "tool_use":
            return resp                                   # done

        tool_results = run_tools_in_parallel(resp.tool_uses, state.mcp_client)
        state.history.append({"role": "user", "content": tool_results})

Two practical notes:

  • The cache_control marker tells the inference server "everything up to and including this content block is cacheable." On Anthropic, you can place up to 4 markers — typically one after the system prompt, one after tools, and one on the last assistant turn before each new tool result.
  • Most modern providers (OpenAI, Anthropic, vLLM, SGLang) auto-detect prefix matches even without explicit markers. Markers mainly let you control eviction and pin hot prefixes.

MCP client side

class McpToolRegistry:
    def __init__(self, servers):
        self.servers = servers
        self.tools_hash = None

    async def refresh(self):
        all_tools = []
        for srv in sorted(self.servers, key=lambda s: s.name):
            tools = await srv.list_tools()
            for t in sorted(tools, key=lambda t: t.name):
                all_tools.append(McpTool(server=srv.name, **t))
        new_hash = hash_tools(all_tools)
        if new_hash != self.tools_hash:
            log.info("tool set changed — prefix cache will miss once")
        self.tools_hash = new_hash
        self.tools = all_tools

12. Pitfalls & Anti-Patterns

  • Per-request UUIDs in system prompt. Kills sharing. Move to a separate message.
  • "Current time is …" preamble. Same problem. Either drop it or pass as user-turn metadata.
  • Reordering tool definitions. A non-stable sort (e.g. dict ordering in older Python) silently invalidates prefixes.
  • Inline-summarizing old history. Rewrites the prefix. Either accept the cache miss as a one-time cost when summarizing, or summarize after the cache boundary.
  • Mutating tool descriptions per user. Personalizing tool docstrings prevents cross-session sharing — keep tool defs identical and pass user context separately.
  • Forgetting the cache TTL. Most provider caches have a 5-minute TTL. An idle agent will pay full prefill on its next call. Pin or refresh hot prefixes if traffic is bursty.
  • Mixing model versions. KV blocks are model-specific. Switching between Sonnet and Opus mid-session means a fresh prefill each switch.

13. Vendor Implementations

SystemMechanismAPI surface
Anthropic API Explicit prefix cache, paged blocks cache_control: {type: "ephemeral"} on content blocks
OpenAI API Automatic prefix caching None — happens on prompts ≥1024 tokens
vLLM PagedAttention + prefix cache --enable-prefix-caching flag
SGLang RadixAttention Default; tree of prefixes
TensorRT-LLM KV cache reuse enable_kv_cache_reuse in builder
llama.cpp Per-session KV cache --prompt-cache

14. Summary

  • Agentic workflows are loops of LLM calls — most of each prompt is identical to the last call. Prefix-cached KV is the dominant inference-time optimization.
  • MCP fan-out concentrates a large, stable block of tool definitions in the prompt prefix. That block is the highest-value caching target.
  • Order prompt content from most stable → most volatile, and place cache boundaries at the natural seams: after the system prompt, after the tool list, after each completed turn.
  • Within a session, KV blocks accumulate. Across sessions, blocks for the system prompt and tool list are physically shared via paged or radix-tree KV caches.
  • For parallel tool calls and N-best sampling, copy-on-write of KV blocks lets branches share their parent prefix until they diverge.
  • Anti-patterns: per-request UUIDs/timestamps in the prefix, non-deterministic tool ordering, in-place history summarization, personalized tool descriptions.
Bottom line: Designing an agent for cache reuse is mostly about discipline in prompt construction — keep the prefix byte-stable, keep dynamic data below the cache line, and let the inference server do the rest. The savings (5–10× faster, ~90% cheaper input) compound with every turn.