Prompt & KV Caching Research: Paper Summaries

1. Overview

Prompt and KV caching has matured from a single optimization (cache the system prompt's prefill) into a layered discipline that spans application code, inference engines, model runtimes, and provider APIs. The four papers below each tackle a different layer of that stack:

Prompt Cache (2023) Foundational mechanism: a schema language plus position-ID assignment that makes precomputed attention states reusable across requests, even when modules appear in different combinations.

TurboRAG (2024) Applies precomputed KV caches to RAG. Each retrieved chunk's KV state is computed once offline; an "Independent Attention" mask plus reordered RoPE positions stitches them at inference.

Persistent Q4 KV Cache (2026) Persists each agent's KV cache to disk in 4-bit quantized form so that multi-agent workflows on memory-constrained edge devices avoid full re-prefill on every context switch.

Don't Break the Cache (2026) An empirical study of how agent prompt construction patterns interact with provider-side caching. Quantifies how naive full-context caching can paradoxically slow agents down.

Together they describe a progression: how reuse can be made mathematically safe (Prompt Cache), how to apply it to a workload that historically resisted caching (TurboRAG), how to make caches survive across processes and agents (Persistent KV), and how application-level prompt design either unlocks or sabotages cache benefits (Don't Break the Cache).

2. At-a-glance

Paper	Where it lives	What it caches	Headline result	Year
Prompt Cache	Inference engine	Per-module attention KV states (schema-defined)	Up to ~60x TTFT on GPU; ~8x on CPU	2023
TurboRAG	RAG application + engine	Per-chunk KV caches for retrieved documents	8.6x average, 9.4x peak TTFT on LongBench	2024
Persistent Q4 KV Cache	Edge runtime	4-bit quantized per-agent KV caches on disk	Up to 136x TTFT vs cold prefill; 4x more agent contexts in fixed RAM	2026
Don't Break the Cache	Agent application code	Provider prompt-cache hit rates	41-80% cost reduction; 13-31% TTFT improvement when used correctly	2026

Where each paper sits in the stack

Each paper targets a different layer. The benefits compose — none of them substitute for the others.

3. Prompt Cache: Modular Attention Reuse for Low-Latency Inference

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong
Yale University (with Google collaborators) · arxiv:2311.04934 · MLSys 2024

Open paper →

Problem

Transformer LLMs recompute self-attention K/V states for every input token on every request, even when prompts repeatedly share large reusable chunks: system instructions, few-shot examples, retrieved documents, templates. Time-to-first-token is dominated by this redundant prefill, which scales with prompt length.

Approach

Prompt Cache precomputes and reuses self-attention KV states for recurring text segments across requests. Reusable segments are declared in a Prompt Markup Language (PML) schema, which names "modules" (a system prompt, a document, a few-shot block) and assigns each one deterministic position-ID ranges within a parent template. At inference time the runtime concatenates the precomputed KV tensors for cached modules with freshly computed KV for the new spans, then runs the decoder. Because position IDs are baked into the schema, a module's stored attention states stay positionally consistent regardless of which other modules surround it.

A schema names the modules; each module's KV is computed once and reused across any request that references it.

Key innovations

PML schema declares reusable text modules and their structural relationships, making cache-eligible regions explicit and composable.
Modular attention (KV) state reuse across requests, generalizing prior single-prefix KV caching to arbitrary reusable substructures.
Schema-assigned position IDs so a module's precomputed attention remains valid in different module orderings.
On-the-fly assembly of a request's KV cache by concatenating module-level precomputed states with newly computed states for user-specific spans.
Cross-module attention reconstruction at decode time so only new tokens need full prefill.

Results

Evaluated on Llama 2 (7B), CodeLlama, MPT, and Falcon. Reported TTFT improvements of roughly ~8x on CPU and up to ~60x on GPU for long, modular prompts (LongBench-style document QA), with negligible accuracy loss on benchmarks such as LongBench and HumanEval. The output is essentially equivalent to standard inference because only prefill computation is reused, not approximated.

Limitations

Requires authoring a PML schema; benefits depend on prompts genuinely sharing modular structure.
Cached KV tensors are model-, layer-, and precision-specific and consume significant memory; ad-hoc, highly variable prompts see little benefit.

Why it matters: Prompt Cache is the conceptual foundation that the later papers build on. The idea that attention states can be reused as long as positional consistency is preserved is what makes per-chunk RAG caching (TurboRAG) and persistent per-agent caches (Persistent Q4) mathematically defensible.

4. TurboRAG: Accelerating RAG with Precomputed KV Caches for Chunked Text

Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang
Moore Threads AI · arxiv:2410.07590 · October 2024 · EMNLP 2025

Open paper →

Problem

Conventional RAG concatenates many retrieved document chunks and recomputes their KV caches online during prefill. This is quadratic in input length, produces large TTFT, wastes compute on text that does not change between requests, and limits per-device batch size.

Approach

TurboRAG splits prefill into an offline phase (per-chunk KV caches are precomputed once with each document and stored alongside its embedding) and an online phase (the retrieved documents' KV caches are fetched and concatenated to form the context KV cache; prefill runs only over the user query). Two adjustments make the result mathematically equivalent to standard prefill:

Independent Attention mask zeros cross-chunk attention, motivated by the empirical finding that inter-document attention is exceedingly sparse in RAG.
Reordered Positions re-index per-chunk RoPE position IDs into a contiguous sequence [0..l, l+1..2l, ...], exploiting the fact that RoPE depends only on relative offsets, so K and V can be saved without baked-in absolute positions.

A pretrained Qwen2-7B is then SFT-fine-tuned (50% doc-QA + 50% general tasks) under this mask and position regime so accuracy is preserved.

Indexing pays the prefill cost once per chunk; queries only prefill their own tokens.

Results

9.4xpeak TTFT speedup

8.6xaverage TTFT speedup

98.46%TFLOPs reduced

~4xend-to-end speedup with PCIe transfer

On LongBench multi-doc QA, TTFT drops from 1165 ms to 134 ms. On the RGB benchmark across noise ratios 0.2-0.8, TurboRAG-reordered averages 95.7 (Chinese) and 96.8 (English) versus Naive RAG 95.3 / 98.2 - within ~1%. OpenCompass regression: MMLU 70.73 vs 69.57; GSM-8K 79.45 vs 79.12. No architecture or inference-engine changes required.

"TurboRAG reduces TTFT by up to 9.4x compared to the conventional RAG systems (on an average of 8.6x), but reserving comparable performance to the standard RAG systems." - Abstract

"cross-attention among different documents is exceedingly sparse in RAG models and the text contents between most documents are actually independent." - Section 1 (Introduction)

Limitations

Storing per-chunk KV caches and shipping them CPU->GPU incurs storage and PCIe bandwidth costs (the paper explicitly measures the host-to-device penalty).
Requires fine-tuning the LLM to recover accuracy under independent attention; without fine-tuning accuracy can drop ~20% at high noise ratios.
Cached KV tensors are model-specific and invalidated on model upgrades.

5. Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent Inference

Yakov Pyotr Shkolnikov
Independent · arxiv:2603.04428 · February 2026
Code: github.com/yshk-mxim/agent-memory

Open paper →

Problem

Multi-agent LLM workflows (CrewAI, AutoGen, debate architectures) typically run 5-20 agents per task. Each agent needs its own KV cache because concatenating histories causes "lost-in-the-middle" position bias. On an edge device with 24 GB unified memory (~10.2 GB cache budget on an M4 Pro), only 3 agents fit at 8K context in FP16, forcing constant evict-and-reload cycles. Each eviction triggers a full O(n) re-prefill - 15.7 s per agent at 4K, 78.5 s of dead time after a server restart with 5 agents.

Approach

Each agent's KV cache is persisted to SSD in 4-bit quantized safetensors and reloaded directly into the attention layer, bypassing prefill entirely. Three components:

Block pool - per-agent isolated Q4 caches that survive server restarts.
BatchQuantizedKVCache with a ConcurrentScheduler that interleaves prefill chunks (default 512 tokens) and decode steps across agents in a single Metal kernel dispatch (Orca-style iteration-level scheduling adapted for quantized caches).
Cross-phase context injection using character-level prefix matching (EXACT / EXTEND / DIVERGE) - rather than token-ID matching like vLLM or SGLang - so the cache accumulates monotonically across conversation phases even when BPE retokenization shifts boundaries.

The ~500 ms reload latency is hidden behind the previous agent's decode phase, since multi-agent workflows naturally interleave one agent generating while the next loads.

Each agent owns a quantized cache on disk; the scheduler hides reload latency behind another agent's decode.

Results

136xpeak TTFT speedup (Gemma)

577 mswarm-disk TTFT (was 15.7 s cold)

4xmore agent contexts in fixed RAM

+/-3%perplexity impact

Across Gemma (22-136x at 4K-32K), DeepSeek (11-76x), and Llama (24-111x). Persistence alone contributes a 27x TTFT reduction at 4K - the largest single component in the ablation. Validated across three architectures (dense GQA, MoE MLA, hybrid attention) via a model-agnostic ModelCacheSpec. OpenAI-compatible API.

"We eliminate re-prefill by persisting each agent's KV cache to disk." - Abstract

"Virtual memory for attention state: agents see unbounded context." - Section 1

Limitations

Single device only - no cache transfer over Thunderbolt or network.
KV caches are model-specific and invalidated by model updates (RAG text chunks survive model swaps; caches do not).
Tested only at 8B-16B parameters; no 70B+ validation.
Speedup measured at fixed 64-token output; the relative win shrinks as outputs lengthen.

6. Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

Elias Lumer, Faheem Nizar, Akshaya Jangiti, Kevin Frank, Anmol Gulati, Mandar Phadate, Vamse Kumar Subbiah
arxiv:2601.06007 · January 2026

Open paper →

Problem

LLM agents now run multi-turn tasks spanning dozens of tool calls and very large context windows. Provider-side prompt caching exists across OpenAI, Anthropic, and Google, but the paper observes that "the benefits of prompt caching for these agentic workloads remain underexplored in the research literature" - and that naive use can actively hurt performance.

Approach

Empirical study rather than a new system. The authors run 500 agent sessions with 10,000-token system prompts across three providers and four production models. They compare three caching strategies and measure both API cost and TTFT, with an ablation across prompt sizes (500-50,000 tokens) and tool-call counts (3-50).

A single dynamic block in the middle of the prompt invalidates everything after it. Moving dynamic content to the tail preserves the cached prefix.

Strategies compared

Full context caching Cache the entire prompt including dynamic tool results. Maximum theoretical hit area, but high cache invalidation rate.

System prompt only Cache only the static system prefix. Lower hit area but very stable invalidation profile.

Exclude dynamic tool results Cache the system prompt and stable conversation prefix; place dynamic content (tool outputs) at the end so the cacheable prefix never mutates.

Cross-provider scope OpenAI, Anthropic, Google evaluated head-to-head with their respective cache mechanisms.

Results

41-80%API cost reduction

13-31%TTFT improvement

500agent sessions evaluated

3providers, 4 models

All four models tested showed statistically significant cost reductions when prompt caching was enabled. The more interesting finding is qualitative: strategic block placement beats naive caching.

"Strategic prompt cache block control... provides more consistent benefits than naive full-context caching, which can paradoxically increase latency." - Abstract

Concrete recommendations from the paper

Place dynamic content at the end of the system prompt, not interleaved.
Avoid traditional dynamic function-calling patterns that mutate the front of the message stream.
Exclude dynamic tool results from the cached prefix; let them live as appended messages instead.
Treat full-context caching as the suspicious option - it can increase latency through churn.

Limitations

Provider-side caching only; the study does not implement its own cache layer.
Findings depend on each provider's current cache TTL and pricing, which evolve.
Five hundred sessions is enough for headline numbers but smaller than a production-traffic study.

7. Cross-cutting themes

Position handling is the universal hard part

Every paper has a paragraph on positions. Prompt Cache assigns position IDs through the PML schema. TurboRAG reorders RoPE positions per chunk and exploits relative-offset invariance. The Persistent Q4 paper uses character-level prefix matching to survive BPE retokenization. "Don't Break the Cache" boils down to do not change the bytes that precede a cached region, which is the application-layer expression of the same constraint. If you want to cache attention state, the position scheme has to be your first design decision.

Independence assumptions unlock the largest wins

TurboRAG's independent-attention mask is the most aggressive case - it explicitly drops cross-chunk attention - and it earns the largest TFLOP reduction (~98.46%). Prompt Cache makes a milder version of the same bet by treating modules as composable units. The further you push toward "this segment can be analyzed in isolation," the more compute you get to skip. The cost is that you have to verify the assumption empirically, and re-verify when models or workloads change.

Quality verification is non-negotiable

TurboRAG runs RGB and OpenCompass to show no regression; Persistent Q4 reports perplexity deltas across three models; "Don't Break the Cache" measures cost and latency simultaneously to detect cases where caching was actively harmful. Reuse mechanisms create new opportunities for silent quality drift, so quality benchmarks must run alongside cost dashboards.

The bottleneck moves, it does not disappear

TurboRAG eliminates online prefill but introduces a CPU->GPU PCIe transfer; the paper measures it explicitly and the end-to-end speedup is closer to 4x than the headline 8.6x. Persistent Q4 turns prefill into a disk read and exploits agent interleaving to hide the ~500 ms reload. Once you cache attention state, the next bottleneck is moving that state to where the math happens. Plan for it.

Application-level discipline beats raw infrastructure

"Don't Break the Cache" is the most pragmatic of the four because it does not propose a new mechanism - it measures what happens when the mechanism is already there and the application uses it badly. The 41-80% cost reduction it reports comes entirely from prompt construction discipline. If the agent layer is sloppy, none of the deeper systems work matters.

8. Practitioner takeaways

Start at the application layer. Audit how your agent constructs prompts. Is dynamic content at the end? Are tool schemas in stable order? Are you mutating the system prompt mid-session? You can capture half the benefit of all four papers without changing inference infrastructure.
Characterize your reuse profile before building infrastructure. Prompt Cache's PML schema is overkill if your prompts share a single static prefix; provider prefix caching is enough. TurboRAG-style per-chunk caching only pays off if retrieved chunks are themselves repeated across queries.
Treat position handling as a first-class design decision. Every reuse strategy lives or dies on whether positions stay valid when modules are reassembled. Pick one mechanism (schema-assigned IDs, relative-offset RoPE, character-prefix matching) and apply it consistently.
Pair every cache rollout with a quality benchmark. Cost and latency dashboards lie about quality. Run a golden-task replay set on every cache config change and alert on quality drift even when cost improves.
Plan for the next bottleneck. Removing prefill exposes data movement. Budget for storage IO, PCIe / Thunderbolt, and (in distributed settings) the network cost of shipping KV tensors.
Cache invalidation discipline matters more than cache size. The Persistent Q4 paper notes that KV caches must be invalidated on model upgrades; "Don't Break the Cache" notes that subtle prompt edits invalidate provider caches silently. Tag every cached artifact with its model version, schema version, and source commit.

9. References

Gim, In; Chen, Guojun; Lee, Seung-seob; Sarda, Nikhil; Khandelwal, Anurag; Zhong, Lin. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. MLSys 2024. arxiv:2311.04934.
Lu, Songshuo; Wang, Hua; Rong, Yutian; Chen, Zhi; Tang, Yaohua. TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text. EMNLP 2025. arxiv:2410.07590.
Shkolnikov, Yakov Pyotr. Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices. arxiv:2603.04428 (February 2026). Code: github.com/yshk-mxim/agent-memory.
Lumer, Elias; Nizar, Faheem; Jangiti, Akshaya; Frank, Kevin; Gulati, Anmol; Phadate, Mandar; Subbiah, Vamse Kumar. Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks. arxiv:2601.06007 (January 2026).

Related work referenced in the synthesis

Gao, Bin; et al. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention (AttentionStore). arxiv:2403.19708. A complementary system that persists per-session KV caches across multi-turn conversations.

Note: Page numbers for the verbatim quotes are approximate; cite the canonical PDF on arxiv when reusing them in formal documents.