1. Executive Overview
Cache-Augmented Generation is an emerging alternative to traditional retrieval-augmented generation for mostly static enterprise knowledge. Instead of embedding a query, searching a vector database, retrieving chunks, and rebuilding a prompt for every request, the system pre-tokenizes static knowledge once, precomputes transformer KV attention states, stores those states, and reloads them at inference time.
2. Names and Meaning
| Name | Meaning |
|---|---|
| Cache-Augmented Generation | KV cache replaces most retrieval-time context construction. |
| Vectorless RAG | The architecture avoids embeddings and vector databases on the normal path. |
| Precompiled Prompt Architecture | Static prompt and knowledge modules are compiled ahead of runtime. |
| Persistent KV Cache | Transformer key-value attention states are saved and restored across requests. |
| Prompt Cache Architecture | Reusable prompt modules are stored as attention states instead of raw text. |
| Memory-Prefilled Inference | The model begins generation with preloaded context memory. |
These terms overlap. In this document, CAG means a production architecture where static knowledge is represented as reusable KV cache modules and loaded deterministically at runtime.
3. Traditional RAG vs Cache-Augmented Generation
Traditional RAG
Cache-Augmented Generation
The architectural shift is simple but important: context selection and attention-state computation happen before production traffic arrives. Runtime no longer depends on approximate nearest-neighbor retrieval for common cases.
4. Offline and Runtime Architecture
Offline phase
Runtime phase
static_data
-> canonicalize()
-> tokenize(model_tokenizer)
-> prefill_transformer()
-> persist_kv_cache()
-> route_table[module_id]
user_query
-> classify_intent()
-> load_kv_cache(module_id)
-> append_query_tokens()
-> decode()
5. Why This Is Fast
Long-context systems spend substantial time in prefill: the model must process every input token before it can emit the first answer token. Traditional RAG adds embedding, vector search, reranking, chunk formatting, and then a large prefill. CAG removes most of that hot-path work for static knowledge.
- No embedding call: fewer network hops and no embedding model capacity requirement.
- No vector DB query: no approximate nearest-neighbor latency or index tuning on the hot path.
- No reranking: avoids secondary model calls for common questions.
- No chunk assembly: avoids brittle context stitching and duplicated prompt formatting.
- No repeated static prefill: attention states for static knowledge are restored instead of recomputed.
6. Best Fit and Poor Fit
| Ideal Use Cases | Why CAG Fits |
|---|---|
| Enterprise policy engines | Policies are versioned, reviewed, and mostly static. |
| Code assistants | Repository maps, standards, APIs, and style guides can be compiled per branch or release. |
| API documentation | Specs and examples are static within a release version. |
| Workflow agents | Operating procedures and tool contracts are stable and routeable by intent. |
| Product catalogs | Catalog subsets can be compiled by region, brand, product line, or tenant. |
| Compliance systems | Strong need for deterministic grounding and auditability. |
| MCP tool registries | Tool definitions and capability descriptions are stable prompt modules. |
7. Hierarchical Cache Design
The runtime appends only the dynamic user query and small turn-specific state. Stable layers are restored as KV modules. This mirrors how advanced coding agents can reuse system instructions, tool definitions, repository context, and session state separately.
8. Deterministic Knowledge Routing
CAG replaces probabilistic semantic retrieval with deterministic module selection. A router maps the user request to a cache module by explicit intent, product area, tenant, permission scope, or workflow state.
- Billing refund policy?
- Shipping SLA for region?
- Rust API usage?
plus ACL check
billing.kvcacheshipping.kvcacherust_docs.kvcache
match intent {
"billing_question" => load("billing.v12.kvcache"),
"shipping_question" => load("shipping.us.v7.kvcache"),
"rust_question" => load("rust_docs.1_78.kvcache"),
_ => fallback_retrieval()
}
This is enterprise friendly because routing decisions can be audited, tested, versioned, and tied to access control.
9. Best Hybrid Design
The strongest production pattern is not pure CAG everywhere. Use deterministic KV cache modules for the common, static, high-volume path and keep retrieval as a rare fallback for unsupported, stale, or exploratory requests.
- Use cache modules for stable policies, tools, schemas, product areas, and docs.
- Use retrieval for long-tail, recently changed, or unknown-domain queries.
- Promote repeated fallback results into new cache modules after review.
- Route by explicit metadata first, classifier second, semantic fallback last.
10. Technical Details
What the KV cache contains
A transformer KV cache stores key and value tensors produced by attention layers for a specific token sequence. It is not portable raw text. It is the model's intermediate attention memory for that exact model, tokenizer, prompt layout, and cache position scheme.
Model specificity
- A Gemma cache cannot be loaded into Llama.
- A model version change invalidates cache modules unless compatibility is explicitly guaranteed.
- A tokenizer, RoPE scaling, attention implementation, or layer layout change can invalidate cache modules.
- Quantized caches must be validated for answer quality, not only load speed.
Cache metadata
interface KVCacheModule {
module_id: string;
tenant_id: string;
route: string;
model_name: string;
model_revision: string;
tokenizer_revision: string;
context_checksum: string;
token_count: number;
dtype: "fp16" | "bf16" | "q8" | "q4";
storage_uri: string;
acl_scope: "public" | "tenant" | "workspace" | "user";
created_at: string;
expires_at?: string;
}
Do not break the cache
Cache-aware prompt construction matters. Adding timestamps, random IDs, reordered tool schemas, or inconsistent separators can invalidate reusable prefixes and destroy hit rates. Treat prompt layout as an interface contract.
11. Recommended Rust Architecture
A practical implementation can use Rust for routing, cache metadata, storage orchestration, and serving control, with vLLM or TensorRT-LLM handling optimized model execution where persistent cache support is available or can be integrated.
Core services
- Router service: maps request to cache module by intent, tenant, product area, and ACL.
- Cache compiler: builds KV modules from reviewed static knowledge and model revisions.
- Cache registry: stores metadata, checksums, versions, TTLs, and invalidation state.
- GPU cache pool: manages resident KV modules, eviction, paging, and reuse across requests.
- Fallback RAG service: handles uncommon or stale queries and feeds module promotion workflows.
axum_gateway
-> authorize_tenant()
-> classify_intent()
-> redis_route_lookup()
-> load_kv_module()
-> append_query_tokens()
-> stream_generation()
12. Cache Compiler Pipeline
The cache compiler is the build system for precompiled knowledge. It turns reviewed static sources into versioned, model-specific KV cache modules. Treat it like a production compiler: deterministic inputs, reproducible outputs, test fixtures, metadata, and rollback.
Compiler artifacts
cache_compile --model llama-3.1-70b \
--tokenizer rev_2026_04 \
--source ./knowledge/billing \
--layout ./layouts/policy_agent.v3.json \
--acl tenant:acme:billing \
--out s3://kv-cache/acme/billing/v12/
13. Storage and Paging Design
KV cache modules can be large. Production systems need storage tiers and paging policies instead of assuming every module is always resident in GPU memory. The goal is to keep the hottest modules close to the decoder while preserving cheaper cold storage for less common domains.
Paging policy
- Pin global system and tool-definition modules in GPU memory when capacity allows.
- Promote domain modules from NVMe to RAM based on request rate and route confidence.
- Evict large tenant modules by cost-aware LFU or weighted recency, not plain LRU.
- Keep checksums next to each tier so corrupted modules fail closed before generation.
eviction_score =
size_gb * reload_cost_ms
/ max(1, hits_last_15m * route_priority)
14. Cache-Aware Scheduler
A normal inference scheduler batches by arrival time and token budget. A cache-aware scheduler also considers which KV modules are already resident, which requests can share modules, and whether loading a module will block higher-priority work.
Scheduling signals
- Module residency: GPU, RAM, NVMe, or object store.
- Module size: load time, transfer cost, and GPU memory pressure.
- Route confidence: whether a request should use CAG or fallback retrieval.
- Tenant priority: per-tenant SLOs, budget, and queue fairness.
- Suffix length: dynamic query and session-state tokens appended after cache restore.
15. Versioning, Invalidation, and Governance
CAG is deterministic only when cache lifecycle is deterministic. Every module needs source provenance, model provenance, access scope, test results, and a clear invalidation path.
Invalidation triggers
| Trigger | Invalidates | Required Action |
|---|---|---|
| Source checksum change | Domain module | Recompile and rerun golden tests. |
| Model revision change | All modules for that model | Full rebuild unless compatibility is proven. |
| Tokenizer change | All tokenized modules | Full rebuild and route table update. |
| ACL policy change | Private modules by scope | Reissue manifests and deny old routes. |
| Prompt layout change | Modules using that layout | Recompile and compare outputs. |
16. Evaluation and Benchmarking
CAG should be evaluated against both standard RAG and full-prompt long-context inference. Measure speed, cost, correctness, grounding, route accuracy, and rollback safety. A fast cache that answers from the wrong module is worse than a slower retrieval path.
Benchmark matrix
| Baseline | Purpose | Expected CAG Advantage |
|---|---|---|
| Traditional RAG | Compare against embedding, retrieval, reranking, prompt assembly, and prefill. | Lower TTFT and lower GPU prefill cost on static domains. |
| Full long-context prompt | Compare against resending all static context every request. | Much lower repeated prefill work. |
| Pure classifier routing | Check whether routing alone is enough without precompiled context. | Better grounded answers with static knowledge loaded. |
| Fallback retrieval only | Validate fallback quality and identify module gaps. | CAG handles common paths, fallback handles unknown paths. |
17. Limitations and Risks
| Risk | Mitigation |
|---|---|
| Model-specific cache files | Version every module by model, tokenizer, attention implementation, and prompt layout. |
| Stale knowledge | Use build pipelines, source checksums, TTLs, and invalidation events. |
| Large cache storage | Use quantization, module boundaries, paging, hot/cold tiers, and cache admission policies. |
| Router errors | Use deterministic rules where possible, confidence thresholds, tests, and fallback retrieval. |
| Security leakage | Namespace by tenant and ACL; never share private cache modules across scopes. |
| Prompt layout fragility | Treat module layout as an API and run regression tests before cache publication. |
18. Research Direction
This field is moving quickly. The most relevant work focuses on persistent KV cache, modular attention reuse, cache-aware schedulers, memory-paged inference, SSD-backed cache restoration, prefix trees for prompt reuse, and quantized KV storage.
| Work | Relevance |
|---|---|
| TurboRAG | Precomputes KV caches for retrieved text chunks and reports large TTFT reductions versus standard RAG. |
| Prompt Cache | Defines reusable prompt modules and reuses attention states for low-latency inference. |
| Agent Memory Below the Prompt | Explores persistent quantized KV cache for multi-agent inference and direct cache restoration. |
A useful mental model is compiled inference context: similar to compiled query plans, bytecode, or precomputed indexes in databases, but applied to LLM attention state.
19. Implementation Roadmap
Phase 1: Static module candidates
Identify stable knowledge domains such as policies, API docs, product catalog slices, tool definitions, and workflow manuals.
Phase 2: Cache compiler prototype
Build a pipeline that canonicalizes source content, tokenizes it for one model, computes KV cache files, and records metadata.
Phase 3: Deterministic router
Start with explicit rule-based routing and ACL checks. Add classifier-assisted routing only after route labels are stable.
Phase 4: Runtime integration
Integrate cache restoration into the model serving layer and measure TTFT, throughput, GPU memory pressure, and answer quality.
Phase 5: Hybrid fallback and promotion
Keep retrieval for low-confidence or stale requests. Promote repeated fallback patterns into reviewed cache modules.