Advanced RAG Systems: Cache-Augmented Generation

1. Executive Overview

Cache-Augmented Generation is an emerging alternative to traditional retrieval-augmented generation for mostly static enterprise knowledge. Instead of embedding a query, searching a vector database, retrieving chunks, and rebuilding a prompt for every request, the system pre-tokenizes static knowledge once, precomputes transformer KV attention states, stores those states, and reloads them at inference time.

No retrieval pathNo embeddings, chunking, vector search, reranking, or dynamic chunk assembly for the hot path.

Precompiled contextStatic knowledge is compiled into model-specific attention memory ahead of time.

Low latencyRuntime work focuses on loading the cache, appending the user query, and generating the answer.

Core idea: move the expensive prefill phase from request time to build time. The LLM starts each query with attention memory already loaded.

2. Names and Meaning

Name	Meaning
Cache-Augmented Generation	KV cache replaces most retrieval-time context construction.
Vectorless RAG	The architecture avoids embeddings and vector databases on the normal path.
Precompiled Prompt Architecture	Static prompt and knowledge modules are compiled ahead of runtime.
Persistent KV Cache	Transformer key-value attention states are saved and restored across requests.
Prompt Cache Architecture	Reusable prompt modules are stored as attention states instead of raw text.
Memory-Prefilled Inference	The model begins generation with preloaded context memory.

These terms overlap. In this document, CAG means a production architecture where static knowledge is represented as reusable KV cache modules and loaded deterministically at runtime.

3. Traditional RAG vs Cache-Augmented Generation

Traditional RAG

Query

Embed query

Vector search

Retrieve chunks

Rerank and assemble prompt

Full prefill plus generation

Cache-Augmented Generation

Static data compiled offline

Load KV cache module

Append user question

Generate response

Fallback retrieval only for misses

The architectural shift is simple but important: context selection and attention-state computation happen before production traffic arrives. Runtime no longer depends on approximate nearest-neighbor retrieval for common cases.

4. Offline and Runtime Architecture

Offline phase

Static KnowledgePolicies, APIs, schemas, workflows, catalog, docs

NormalizeStable ordering, access scope, model target, versions

Tokenize OnceModel-specific tokenizer and prompt layout

Prefill ModelRun transformer over the static context

Save KV CacheStore attention states to SSD, RAM, or object storage

Index ModuleRecord route, ACL, model, checksum, TTL

Runtime phase

User QueryNatural language request

Intent RouterSelect cache module deterministically

Load KV CacheRestore attention memory

Append QueryOnly dynamic tokens are new

GenerateAnswer with precompiled context

static_data
  -> canonicalize()
  -> tokenize(model_tokenizer)
  -> prefill_transformer()
  -> persist_kv_cache()
  -> route_table[module_id]

user_query
  -> classify_intent()
  -> load_kv_cache(module_id)
  -> append_query_tokens()
  -> decode()

5. Why This Is Fast

Long-context systems spend substantial time in prefill: the model must process every input token before it can emit the first answer token. Traditional RAG adds embedding, vector search, reranking, chunk formatting, and then a large prefill. CAG removes most of that hot-path work for static knowledge.

No embedding call: fewer network hops and no embedding model capacity requirement.
No vector DB query: no approximate nearest-neighbor latency or index tuning on the hot path.
No reranking: avoids secondary model calls for common questions.
No chunk assembly: avoids brittle context stitching and duplicated prompt formatting.
No repeated static prefill: attention states for static knowledge are restored instead of recomputed.

Important distinction: the cache stores transformer attention states, not just raw text. Runtime is loading attention memory, not rereading a document.

6. Best Fit and Poor Fit

Ideal Use Cases	Why CAG Fits
Enterprise policy engines	Policies are versioned, reviewed, and mostly static.
Code assistants	Repository maps, standards, APIs, and style guides can be compiled per branch or release.
API documentation	Specs and examples are static within a release version.
Workflow agents	Operating procedures and tool contracts are stable and routeable by intent.
Product catalogs	Catalog subsets can be compiled by region, brand, product line, or tenant.
Compliance systems	Strong need for deterministic grounding and auditability.
MCP tool registries	Tool definitions and capability descriptions are stable prompt modules.

Not ideal: massive internet-scale corpora, rapidly changing documents, billions of objects, or queries that require open-world discovery.

7. Hierarchical Cache Design

L1 System PromptCore agent behavior, safety rules, response format, platform policies.global

L2 Tool DefinitionsFunction schemas, MCP tools, API contracts, execution constraints.agent type

L3 Domain KnowledgePolicies, docs, schemas, product data, workflows, codebase summaries.tenant/domain

L4 Session MemoryUser preferences, active task state, recent conversation summary.session

The runtime appends only the dynamic user query and small turn-specific state. Stable layers are restored as KV modules. This mirrors how advanced coding agents can reuse system instructions, tool definitions, repository context, and session state separately.

8. Deterministic Knowledge Routing

CAG replaces probabilistic semantic retrieval with deterministic module selection. A router maps the user request to a cache module by explicit intent, product area, tenant, permission scope, or workflow state.

User query examples

Billing refund policy?
Shipping SLA for region?
Rust API usage?

Intent Router
plus ACL check

Selected module

billing.kvcache
shipping.kvcache
rust_docs.kvcache

match intent {
  "billing_question"  => load("billing.v12.kvcache"),
  "shipping_question" => load("shipping.us.v7.kvcache"),
  "rust_question"     => load("rust_docs.1_78.kvcache"),
  _                   => fallback_retrieval()
}

This is enterprise friendly because routing decisions can be audited, tested, versioned, and tied to access control.

9. Best Hybrid Design

The strongest production pattern is not pure CAG everywhere. Use deterministic KV cache modules for the common, static, high-volume path and keep retrieval as a rare fallback for unsupported, stale, or exploratory requests.

QueryUser request

Intent RouterRoute plus confidence

KV Cache Store95 percent target hot path

Fallback RetrievalRare edge cases

AnswerValidate and cite if needed

Use cache modules for stable policies, tools, schemas, product areas, and docs.
Use retrieval for long-tail, recently changed, or unknown-domain queries.
Promote repeated fallback results into new cache modules after review.
Route by explicit metadata first, classifier second, semantic fallback last.

10. Technical Details

What the KV cache contains

A transformer KV cache stores key and value tensors produced by attention layers for a specific token sequence. It is not portable raw text. It is the model's intermediate attention memory for that exact model, tokenizer, prompt layout, and cache position scheme.

Model specificity

A Gemma cache cannot be loaded into Llama.
A model version change invalidates cache modules unless compatibility is explicitly guaranteed.
A tokenizer, RoPE scaling, attention implementation, or layer layout change can invalidate cache modules.
Quantized caches must be validated for answer quality, not only load speed.

Cache metadata

interface KVCacheModule {
  module_id: string;
  tenant_id: string;
  route: string;
  model_name: string;
  model_revision: string;
  tokenizer_revision: string;
  context_checksum: string;
  token_count: number;
  dtype: "fp16" | "bf16" | "q8" | "q4";
  storage_uri: string;
  acl_scope: "public" | "tenant" | "workspace" | "user";
  created_at: string;
  expires_at?: string;
}

Do not break the cache

Cache-aware prompt construction matters. Adding timestamps, random IDs, reordered tool schemas, or inconsistent separators can invalidate reusable prefixes and destroy hit rates. Treat prompt layout as an interface contract.

11. Recommended Rust Architecture

A practical implementation can use Rust for routing, cache metadata, storage orchestration, and serving control, with vLLM or TensorRT-LLM handling optimized model execution where persistent cache support is available or can be integrated.

Rust API GatewayAuth, tenant, request policy

Intent RouterDeterministic module selection

Redis IndexModule metadata and route table

KV Cache LoaderSSD/RAM to GPU cache pool

LLM RuntimevLLM or TensorRT-LLM generation

Core services

Router service: maps request to cache module by intent, tenant, product area, and ACL.
Cache compiler: builds KV modules from reviewed static knowledge and model revisions.
Cache registry: stores metadata, checksums, versions, TTLs, and invalidation state.
GPU cache pool: manages resident KV modules, eviction, paging, and reuse across requests.
Fallback RAG service: handles uncommon or stale queries and feeds module promotion workflows.

axum_gateway
  -> authorize_tenant()
  -> classify_intent()
  -> redis_route_lookup()
  -> load_kv_module()
  -> append_query_tokens()
  -> stream_generation()

12. Cache Compiler Pipeline

The cache compiler is the build system for precompiled knowledge. It turns reviewed static sources into versioned, model-specific KV cache modules. Treat it like a production compiler: deterministic inputs, reproducible outputs, test fixtures, metadata, and rollback.

Source IntakeDocs, policies, APIs, schemas, workflows, tool definitions

NormalizeStable order, markup cleanup, metadata, ACL scope

LayoutPrompt envelope, separators, module boundaries

TokenizeTokenizer revision and position strategy

PrefillRun target model once for static context

PublishKV tensors, manifest, checksum, route table

Compiler artifacts

Source ManifestExact source files, versions, approvals, ACLs, and checksums.

Prompt LayoutSystem envelope, module separators, route labels, and guardrails.

KV Tensor FileLayer-by-layer key and value states for the static context.

Route MetadataIntent labels, tenant scope, model revision, token count, and TTL.

cache_compile --model llama-3.1-70b \
  --tokenizer rev_2026_04 \
  --source ./knowledge/billing \
  --layout ./layouts/policy_agent.v3.json \
  --acl tenant:acme:billing \
  --out s3://kv-cache/acme/billing/v12/

Compiler invariant: the same input manifest, model revision, tokenizer, and layout must produce the same module checksum.

13. Storage and Paging Design

KV cache modules can be large. Production systems need storage tiers and paging policies instead of assuming every module is always resident in GPU memory. The goal is to keep the hottest modules close to the decoder while preserving cheaper cold storage for less common domains.

GPU ResidentVery hot modules already loaded in the serving process.microseconds

Host RAMWarm cache modules staged for fast transfer to GPU memory.milliseconds

Local NVMeLarge hot set with predictable low-latency reads and checksums.low ms

Object StoreDurable source of truth for module artifacts and rollback versions.cold path

Paging policy

Pin global system and tool-definition modules in GPU memory when capacity allows.
Promote domain modules from NVMe to RAM based on request rate and route confidence.
Evict large tenant modules by cost-aware LFU or weighted recency, not plain LRU.
Keep checksums next to each tier so corrupted modules fail closed before generation.

eviction_score =
  size_gb * reload_cost_ms
  / max(1, hits_last_15m * route_priority)

14. Cache-Aware Scheduler

A normal inference scheduler batches by arrival time and token budget. A cache-aware scheduler also considers which KV modules are already resident, which requests can share modules, and whether loading a module will block higher-priority work.

Incoming Requests

billing query - module B

shipping query - module S

Scheduler Decisions

Group by module

Prefer resident cache

Load warm module async

Execution

Batch shared prefix

Append per-query suffix

Decode independently

Scheduling signals

Module residency: GPU, RAM, NVMe, or object store.
Module size: load time, transfer cost, and GPU memory pressure.
Route confidence: whether a request should use CAG or fallback retrieval.
Tenant priority: per-tenant SLOs, budget, and queue fairness.
Suffix length: dynamic query and session-state tokens appended after cache restore.

15. Versioning, Invalidation, and Governance

CAG is deterministic only when cache lifecycle is deterministic. Every module needs source provenance, model provenance, access scope, test results, and a clear invalidation path.

Source UpdatePolicy, schema, docs, workflow, or tool contract changes.

Compile vNextBuild new KV module from reviewed source manifest.

EvaluateGolden questions, ACL tests, latency and quality checks.

Canary RouteSmall traffic slice or internal tenants only.

PromoteUpdate route table and keep previous module for rollback.

Invalidation triggers

Trigger	Invalidates	Required Action
Source checksum change	Domain module	Recompile and rerun golden tests.
Model revision change	All modules for that model	Full rebuild unless compatibility is proven.
Tokenizer change	All tokenized modules	Full rebuild and route table update.
ACL policy change	Private modules by scope	Reissue manifests and deny old routes.
Prompt layout change	Modules using that layout	Recompile and compare outputs.

16. Evaluation and Benchmarking

CAG should be evaluated against both standard RAG and full-prompt long-context inference. Measure speed, cost, correctness, grounding, route accuracy, and rollback safety. A fast cache that answers from the wrong module is worse than a slower retrieval path.

LatencyTTFT, p50/p95/p99, cache load time, GPU transfer time, decode throughput.

QualityGolden answer match, citation faithfulness, refusal correctness, policy compliance.

RoutingIntent accuracy, fallback rate, wrong-module rate, ACL denial correctness.

CostPrefill avoided, GPU occupancy, storage cost, cache rebuild cost.

FreshnessStale-answer rate, source-to-cache lag, invalidation propagation time.

ReliabilityCache corruption handling, rollback success, fallback behavior under cache outage.

Benchmark matrix

Baseline	Purpose	Expected CAG Advantage
Traditional RAG	Compare against embedding, retrieval, reranking, prompt assembly, and prefill.	Lower TTFT and lower GPU prefill cost on static domains.
Full long-context prompt	Compare against resending all static context every request.	Much lower repeated prefill work.
Pure classifier routing	Check whether routing alone is enough without precompiled context.	Better grounded answers with static knowledge loaded.
Fallback retrieval only	Validate fallback quality and identify module gaps.	CAG handles common paths, fallback handles unknown paths.

17. Limitations and Risks

Risk	Mitigation
Model-specific cache files	Version every module by model, tokenizer, attention implementation, and prompt layout.
Stale knowledge	Use build pipelines, source checksums, TTLs, and invalidation events.
Large cache storage	Use quantization, module boundaries, paging, hot/cold tiers, and cache admission policies.
Router errors	Use deterministic rules where possible, confidence thresholds, tests, and fallback retrieval.
Security leakage	Namespace by tenant and ACL; never share private cache modules across scopes.
Prompt layout fragility	Treat module layout as an API and run regression tests before cache publication.

Critical rule: never load a cache module unless the caller is authorized for every source document used to build that module.

18. Research Direction

This field is moving quickly. The most relevant work focuses on persistent KV cache, modular attention reuse, cache-aware schedulers, memory-paged inference, SSD-backed cache restoration, prefix trees for prompt reuse, and quantized KV storage.

Work	Relevance
TurboRAG	Precomputes KV caches for retrieved text chunks and reports large TTFT reductions versus standard RAG.
Prompt Cache	Defines reusable prompt modules and reuses attention states for low-latency inference.
Agent Memory Below the Prompt	Explores persistent quantized KV cache for multi-agent inference and direct cache restoration.

A useful mental model is compiled inference context: similar to compiled query plans, bytecode, or precomputed indexes in databases, but applied to LLM attention state.

19. Implementation Roadmap

Phase 1: Static module candidates

Identify stable knowledge domains such as policies, API docs, product catalog slices, tool definitions, and workflow manuals.

Phase 2: Cache compiler prototype

Build a pipeline that canonicalizes source content, tokenizes it for one model, computes KV cache files, and records metadata.

Phase 3: Deterministic router

Start with explicit rule-based routing and ACL checks. Add classifier-assisted routing only after route labels are stable.

Phase 4: Runtime integration

Integrate cache restoration into the model serving layer and measure TTFT, throughput, GPU memory pressure, and answer quality.

Phase 5: Hybrid fallback and promotion

Keep retrieval for low-confidence or stale requests. Promote repeated fallback patterns into reviewed cache modules.

Starting recommendation: use Rust for routing and cache orchestration, Redis for module metadata, SSD/RAM for cache storage tiers, and a single model family until the cache lifecycle is proven.