1. Overview
Prompt and KV caching has matured from a single optimization (cache the system prompt's prefill) into a layered discipline that spans application code, inference engines, model runtimes, and provider APIs. The four papers below each tackle a different layer of that stack:
Together they describe a progression: how reuse can be made mathematically safe (Prompt Cache), how to apply it to a workload that historically resisted caching (TurboRAG), how to make caches survive across processes and agents (Persistent KV), and how application-level prompt design either unlocks or sabotages cache benefits (Don't Break the Cache).
2. At-a-glance
| Paper | Where it lives | What it caches | Headline result | Year |
|---|---|---|---|---|
| Prompt Cache | Inference engine | Per-module attention KV states (schema-defined) | Up to ~60x TTFT on GPU; ~8x on CPU | 2023 |
| TurboRAG | RAG application + engine | Per-chunk KV caches for retrieved documents | 8.6x average, 9.4x peak TTFT on LongBench | 2024 |
| Persistent Q4 KV Cache | Edge runtime | 4-bit quantized per-agent KV caches on disk | Up to 136x TTFT vs cold prefill; 4x more agent contexts in fixed RAM | 2026 |
| Don't Break the Cache | Agent application code | Provider prompt-cache hit rates | 41-80% cost reduction; 13-31% TTFT improvement when used correctly | 2026 |
Where each paper sits in the stack
3. Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Yale University (with Google collaborators) · arxiv:2311.04934 · MLSys 2024
Problem
Transformer LLMs recompute self-attention K/V states for every input token on every request, even when prompts repeatedly share large reusable chunks: system instructions, few-shot examples, retrieved documents, templates. Time-to-first-token is dominated by this redundant prefill, which scales with prompt length.
Approach
Prompt Cache precomputes and reuses self-attention KV states for recurring text segments across requests. Reusable segments are declared in a Prompt Markup Language (PML) schema, which names "modules" (a system prompt, a document, a few-shot block) and assigns each one deterministic position-ID ranges within a parent template. At inference time the runtime concatenates the precomputed KV tensors for cached modules with freshly computed KV for the new spans, then runs the decoder. Because position IDs are baked into the schema, a module's stored attention states stay positionally consistent regardless of which other modules surround it.
Key innovations
- PML schema declares reusable text modules and their structural relationships, making cache-eligible regions explicit and composable.
- Modular attention (KV) state reuse across requests, generalizing prior single-prefix KV caching to arbitrary reusable substructures.
- Schema-assigned position IDs so a module's precomputed attention remains valid in different module orderings.
- On-the-fly assembly of a request's KV cache by concatenating module-level precomputed states with newly computed states for user-specific spans.
- Cross-module attention reconstruction at decode time so only new tokens need full prefill.
Results
Evaluated on Llama 2 (7B), CodeLlama, MPT, and Falcon. Reported TTFT improvements of roughly ~8x on CPU and up to ~60x on GPU for long, modular prompts (LongBench-style document QA), with negligible accuracy loss on benchmarks such as LongBench and HumanEval. The output is essentially equivalent to standard inference because only prefill computation is reused, not approximated.
Limitations
- Requires authoring a PML schema; benefits depend on prompts genuinely sharing modular structure.
- Cached KV tensors are model-, layer-, and precision-specific and consume significant memory; ad-hoc, highly variable prompts see little benefit.
4. TurboRAG: Accelerating RAG with Precomputed KV Caches for Chunked Text
Moore Threads AI · arxiv:2410.07590 · October 2024 · EMNLP 2025
Problem
Conventional RAG concatenates many retrieved document chunks and recomputes their KV caches online during prefill. This is quadratic in input length, produces large TTFT, wastes compute on text that does not change between requests, and limits per-device batch size.
Approach
TurboRAG splits prefill into an offline phase (per-chunk KV caches are precomputed once with each document and stored alongside its embedding) and an online phase (the retrieved documents' KV caches are fetched and concatenated to form the context KV cache; prefill runs only over the user query). Two adjustments make the result mathematically equivalent to standard prefill:
- Independent Attention mask zeros cross-chunk attention, motivated by the empirical finding that inter-document attention is exceedingly sparse in RAG.
- Reordered Positions re-index per-chunk RoPE position IDs into a contiguous sequence
[0..l, l+1..2l, ...], exploiting the fact that RoPE depends only on relative offsets, so K and V can be saved without baked-in absolute positions.
A pretrained Qwen2-7B is then SFT-fine-tuned (50% doc-QA + 50% general tasks) under this mask and position regime so accuracy is preserved.
Results
On LongBench multi-doc QA, TTFT drops from 1165 ms to 134 ms. On the RGB benchmark across noise ratios 0.2-0.8, TurboRAG-reordered averages 95.7 (Chinese) and 96.8 (English) versus Naive RAG 95.3 / 98.2 - within ~1%. OpenCompass regression: MMLU 70.73 vs 69.57; GSM-8K 79.45 vs 79.12. No architecture or inference-engine changes required.
Limitations
- Storing per-chunk KV caches and shipping them CPU->GPU incurs storage and PCIe bandwidth costs (the paper explicitly measures the host-to-device penalty).
- Requires fine-tuning the LLM to recover accuracy under independent attention; without fine-tuning accuracy can drop ~20% at high noise ratios.
- Cached KV tensors are model-specific and invalidated on model upgrades.
5. Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent Inference
Independent · arxiv:2603.04428 · February 2026
Code:
github.com/yshk-mxim/agent-memory
Problem
Multi-agent LLM workflows (CrewAI, AutoGen, debate architectures) typically run 5-20 agents per task. Each agent needs its own KV cache because concatenating histories causes "lost-in-the-middle" position bias. On an edge device with 24 GB unified memory (~10.2 GB cache budget on an M4 Pro), only 3 agents fit at 8K context in FP16, forcing constant evict-and-reload cycles. Each eviction triggers a full O(n) re-prefill - 15.7 s per agent at 4K, 78.5 s of dead time after a server restart with 5 agents.
Approach
Each agent's KV cache is persisted to SSD in 4-bit quantized safetensors and reloaded directly into the attention layer, bypassing prefill entirely. Three components:
- Block pool - per-agent isolated Q4 caches that survive server restarts.
- BatchQuantizedKVCache with a
ConcurrentSchedulerthat interleaves prefill chunks (default 512 tokens) and decode steps across agents in a single Metal kernel dispatch (Orca-style iteration-level scheduling adapted for quantized caches). - Cross-phase context injection using character-level prefix matching (EXACT / EXTEND / DIVERGE) - rather than token-ID matching like vLLM or SGLang - so the cache accumulates monotonically across conversation phases even when BPE retokenization shifts boundaries.
The ~500 ms reload latency is hidden behind the previous agent's decode phase, since multi-agent workflows naturally interleave one agent generating while the next loads.
Results
Across Gemma (22-136x at 4K-32K), DeepSeek (11-76x), and Llama (24-111x). Persistence alone contributes
a 27x TTFT reduction at 4K - the largest single component in the ablation. Validated across three architectures
(dense GQA, MoE MLA, hybrid attention) via a model-agnostic ModelCacheSpec. OpenAI-compatible API.
Limitations
- Single device only - no cache transfer over Thunderbolt or network.
- KV caches are model-specific and invalidated by model updates (RAG text chunks survive model swaps; caches do not).
- Tested only at 8B-16B parameters; no 70B+ validation.
- Speedup measured at fixed 64-token output; the relative win shrinks as outputs lengthen.
6. Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks
arxiv:2601.06007 · January 2026
Problem
LLM agents now run multi-turn tasks spanning dozens of tool calls and very large context windows. Provider-side prompt caching exists across OpenAI, Anthropic, and Google, but the paper observes that "the benefits of prompt caching for these agentic workloads remain underexplored in the research literature" - and that naive use can actively hurt performance.
Approach
Empirical study rather than a new system. The authors run 500 agent sessions with 10,000-token system prompts across three providers and four production models. They compare three caching strategies and measure both API cost and TTFT, with an ablation across prompt sizes (500-50,000 tokens) and tool-call counts (3-50).
Strategies compared
Results
All four models tested showed statistically significant cost reductions when prompt caching was enabled. The more interesting finding is qualitative: strategic block placement beats naive caching.
Concrete recommendations from the paper
- Place dynamic content at the end of the system prompt, not interleaved.
- Avoid traditional dynamic function-calling patterns that mutate the front of the message stream.
- Exclude dynamic tool results from the cached prefix; let them live as appended messages instead.
- Treat full-context caching as the suspicious option - it can increase latency through churn.
Limitations
- Provider-side caching only; the study does not implement its own cache layer.
- Findings depend on each provider's current cache TTL and pricing, which evolve.
- Five hundred sessions is enough for headline numbers but smaller than a production-traffic study.
7. Cross-cutting themes
Position handling is the universal hard part
Every paper has a paragraph on positions. Prompt Cache assigns position IDs through the PML schema. TurboRAG reorders RoPE positions per chunk and exploits relative-offset invariance. The Persistent Q4 paper uses character-level prefix matching to survive BPE retokenization. "Don't Break the Cache" boils down to do not change the bytes that precede a cached region, which is the application-layer expression of the same constraint. If you want to cache attention state, the position scheme has to be your first design decision.
Independence assumptions unlock the largest wins
TurboRAG's independent-attention mask is the most aggressive case - it explicitly drops cross-chunk attention - and it earns the largest TFLOP reduction (~98.46%). Prompt Cache makes a milder version of the same bet by treating modules as composable units. The further you push toward "this segment can be analyzed in isolation," the more compute you get to skip. The cost is that you have to verify the assumption empirically, and re-verify when models or workloads change.
Quality verification is non-negotiable
TurboRAG runs RGB and OpenCompass to show no regression; Persistent Q4 reports perplexity deltas across three models; "Don't Break the Cache" measures cost and latency simultaneously to detect cases where caching was actively harmful. Reuse mechanisms create new opportunities for silent quality drift, so quality benchmarks must run alongside cost dashboards.
The bottleneck moves, it does not disappear
TurboRAG eliminates online prefill but introduces a CPU->GPU PCIe transfer; the paper measures it explicitly and the end-to-end speedup is closer to 4x than the headline 8.6x. Persistent Q4 turns prefill into a disk read and exploits agent interleaving to hide the ~500 ms reload. Once you cache attention state, the next bottleneck is moving that state to where the math happens. Plan for it.
Application-level discipline beats raw infrastructure
"Don't Break the Cache" is the most pragmatic of the four because it does not propose a new mechanism - it measures what happens when the mechanism is already there and the application uses it badly. The 41-80% cost reduction it reports comes entirely from prompt construction discipline. If the agent layer is sloppy, none of the deeper systems work matters.
8. Practitioner takeaways
- Start at the application layer. Audit how your agent constructs prompts. Is dynamic content at the end? Are tool schemas in stable order? Are you mutating the system prompt mid-session? You can capture half the benefit of all four papers without changing inference infrastructure.
- Characterize your reuse profile before building infrastructure. Prompt Cache's PML schema is overkill if your prompts share a single static prefix; provider prefix caching is enough. TurboRAG-style per-chunk caching only pays off if retrieved chunks are themselves repeated across queries.
- Treat position handling as a first-class design decision. Every reuse strategy lives or dies on whether positions stay valid when modules are reassembled. Pick one mechanism (schema-assigned IDs, relative-offset RoPE, character-prefix matching) and apply it consistently.
- Pair every cache rollout with a quality benchmark. Cost and latency dashboards lie about quality. Run a golden-task replay set on every cache config change and alert on quality drift even when cost improves.
- Plan for the next bottleneck. Removing prefill exposes data movement. Budget for storage IO, PCIe / Thunderbolt, and (in distributed settings) the network cost of shipping KV tensors.
- Cache invalidation discipline matters more than cache size. The Persistent Q4 paper notes that KV caches must be invalidated on model upgrades; "Don't Break the Cache" notes that subtle prompt edits invalidate provider caches silently. Tag every cached artifact with its model version, schema version, and source commit.
9. References
- Gim, In; Chen, Guojun; Lee, Seung-seob; Sarda, Nikhil; Khandelwal, Anurag; Zhong, Lin. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. MLSys 2024. arxiv:2311.04934.
- Lu, Songshuo; Wang, Hua; Rong, Yutian; Chen, Zhi; Tang, Yaohua. TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text. EMNLP 2025. arxiv:2410.07590.
- Shkolnikov, Yakov Pyotr. Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices. arxiv:2603.04428 (February 2026). Code: github.com/yshk-mxim/agent-memory.
- Lumer, Elias; Nizar, Faheem; Jangiti, Akshaya; Frank, Kevin; Gulati, Anmol; Phadate, Mandar; Subbiah, Vamse Kumar. Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks. arxiv:2601.06007 (January 2026).
Related work referenced in the synthesis
- Gao, Bin; et al. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention (AttentionStore). arxiv:2403.19708. A complementary system that persists per-session KV caches across multi-turn conversations.