← Back to Learning Hub

Advanced RAG Systems: Cache-Augmented Generation

A technical architecture for precompiled enterprise knowledge using persistent KV caches, deterministic context loading, and cache-aware inference instead of embeddings, vector databases, chunk retrieval, reranking, and runtime prompt assembly.

Version: 1.0 Date: April 2026 Scope: static enterprise knowledge and low-latency AI agents

1. Executive Overview

Cache-Augmented Generation is an emerging alternative to traditional retrieval-augmented generation for mostly static enterprise knowledge. Instead of embedding a query, searching a vector database, retrieving chunks, and rebuilding a prompt for every request, the system pre-tokenizes static knowledge once, precomputes transformer KV attention states, stores those states, and reloads them at inference time.

No retrieval pathNo embeddings, chunking, vector search, reranking, or dynamic chunk assembly for the hot path.
Precompiled contextStatic knowledge is compiled into model-specific attention memory ahead of time.
Low latencyRuntime work focuses on loading the cache, appending the user query, and generating the answer.
Core idea: move the expensive prefill phase from request time to build time. The LLM starts each query with attention memory already loaded.

2. Names and Meaning

NameMeaning
Cache-Augmented GenerationKV cache replaces most retrieval-time context construction.
Vectorless RAGThe architecture avoids embeddings and vector databases on the normal path.
Precompiled Prompt ArchitectureStatic prompt and knowledge modules are compiled ahead of runtime.
Persistent KV CacheTransformer key-value attention states are saved and restored across requests.
Prompt Cache ArchitectureReusable prompt modules are stored as attention states instead of raw text.
Memory-Prefilled InferenceThe model begins generation with preloaded context memory.

These terms overlap. In this document, CAG means a production architecture where static knowledge is represented as reusable KV cache modules and loaded deterministically at runtime.

3. Traditional RAG vs Cache-Augmented Generation

Traditional RAG

Query
Embed query
Vector search
Retrieve chunks
Rerank and assemble prompt
Full prefill plus generation

Cache-Augmented Generation

Static data compiled offline
Load KV cache module
Append user question
Generate response
Fallback retrieval only for misses

The architectural shift is simple but important: context selection and attention-state computation happen before production traffic arrives. Runtime no longer depends on approximate nearest-neighbor retrieval for common cases.

4. Offline and Runtime Architecture

Offline phase

Static KnowledgePolicies, APIs, schemas, workflows, catalog, docs
NormalizeStable ordering, access scope, model target, versions
Tokenize OnceModel-specific tokenizer and prompt layout
Prefill ModelRun transformer over the static context
Save KV CacheStore attention states to SSD, RAM, or object storage
Index ModuleRecord route, ACL, model, checksum, TTL

Runtime phase

User QueryNatural language request
Intent RouterSelect cache module deterministically
Load KV CacheRestore attention memory
Append QueryOnly dynamic tokens are new
GenerateAnswer with precompiled context
static_data
  -> canonicalize()
  -> tokenize(model_tokenizer)
  -> prefill_transformer()
  -> persist_kv_cache()
  -> route_table[module_id]

user_query
  -> classify_intent()
  -> load_kv_cache(module_id)
  -> append_query_tokens()
  -> decode()

5. Why This Is Fast

Long-context systems spend substantial time in prefill: the model must process every input token before it can emit the first answer token. Traditional RAG adds embedding, vector search, reranking, chunk formatting, and then a large prefill. CAG removes most of that hot-path work for static knowledge.

  • No embedding call: fewer network hops and no embedding model capacity requirement.
  • No vector DB query: no approximate nearest-neighbor latency or index tuning on the hot path.
  • No reranking: avoids secondary model calls for common questions.
  • No chunk assembly: avoids brittle context stitching and duplicated prompt formatting.
  • No repeated static prefill: attention states for static knowledge are restored instead of recomputed.
Important distinction: the cache stores transformer attention states, not just raw text. Runtime is loading attention memory, not rereading a document.

6. Best Fit and Poor Fit

Ideal Use CasesWhy CAG Fits
Enterprise policy enginesPolicies are versioned, reviewed, and mostly static.
Code assistantsRepository maps, standards, APIs, and style guides can be compiled per branch or release.
API documentationSpecs and examples are static within a release version.
Workflow agentsOperating procedures and tool contracts are stable and routeable by intent.
Product catalogsCatalog subsets can be compiled by region, brand, product line, or tenant.
Compliance systemsStrong need for deterministic grounding and auditability.
MCP tool registriesTool definitions and capability descriptions are stable prompt modules.
Not ideal: massive internet-scale corpora, rapidly changing documents, billions of objects, or queries that require open-world discovery.

7. Hierarchical Cache Design

L1 System PromptCore agent behavior, safety rules, response format, platform policies.global
L2 Tool DefinitionsFunction schemas, MCP tools, API contracts, execution constraints.agent type
L3 Domain KnowledgePolicies, docs, schemas, product data, workflows, codebase summaries.tenant/domain
L4 Session MemoryUser preferences, active task state, recent conversation summary.session

The runtime appends only the dynamic user query and small turn-specific state. Stable layers are restored as KV modules. This mirrors how advanced coding agents can reuse system instructions, tool definitions, repository context, and session state separately.

8. Deterministic Knowledge Routing

CAG replaces probabilistic semantic retrieval with deterministic module selection. A router maps the user request to a cache module by explicit intent, product area, tenant, permission scope, or workflow state.

User query examples
  • Billing refund policy?
  • Shipping SLA for region?
  • Rust API usage?
Intent Router
plus ACL check
Selected module
  • billing.kvcache
  • shipping.kvcache
  • rust_docs.kvcache
match intent {
  "billing_question"  => load("billing.v12.kvcache"),
  "shipping_question" => load("shipping.us.v7.kvcache"),
  "rust_question"     => load("rust_docs.1_78.kvcache"),
  _                   => fallback_retrieval()
}

This is enterprise friendly because routing decisions can be audited, tested, versioned, and tied to access control.

9. Best Hybrid Design

The strongest production pattern is not pure CAG everywhere. Use deterministic KV cache modules for the common, static, high-volume path and keep retrieval as a rare fallback for unsupported, stale, or exploratory requests.

QueryUser request
Intent RouterRoute plus confidence
KV Cache Store95 percent target hot path
Fallback RetrievalRare edge cases
AnswerValidate and cite if needed
  • Use cache modules for stable policies, tools, schemas, product areas, and docs.
  • Use retrieval for long-tail, recently changed, or unknown-domain queries.
  • Promote repeated fallback results into new cache modules after review.
  • Route by explicit metadata first, classifier second, semantic fallback last.

10. Technical Details

What the KV cache contains

A transformer KV cache stores key and value tensors produced by attention layers for a specific token sequence. It is not portable raw text. It is the model's intermediate attention memory for that exact model, tokenizer, prompt layout, and cache position scheme.

Model specificity

  • A Gemma cache cannot be loaded into Llama.
  • A model version change invalidates cache modules unless compatibility is explicitly guaranteed.
  • A tokenizer, RoPE scaling, attention implementation, or layer layout change can invalidate cache modules.
  • Quantized caches must be validated for answer quality, not only load speed.

Cache metadata

interface KVCacheModule {
  module_id: string;
  tenant_id: string;
  route: string;
  model_name: string;
  model_revision: string;
  tokenizer_revision: string;
  context_checksum: string;
  token_count: number;
  dtype: "fp16" | "bf16" | "q8" | "q4";
  storage_uri: string;
  acl_scope: "public" | "tenant" | "workspace" | "user";
  created_at: string;
  expires_at?: string;
}

Do not break the cache

Cache-aware prompt construction matters. Adding timestamps, random IDs, reordered tool schemas, or inconsistent separators can invalidate reusable prefixes and destroy hit rates. Treat prompt layout as an interface contract.

11. Recommended Rust Architecture

A practical implementation can use Rust for routing, cache metadata, storage orchestration, and serving control, with vLLM or TensorRT-LLM handling optimized model execution where persistent cache support is available or can be integrated.

Rust API GatewayAuth, tenant, request policy
Intent RouterDeterministic module selection
Redis IndexModule metadata and route table
KV Cache LoaderSSD/RAM to GPU cache pool
LLM RuntimevLLM or TensorRT-LLM generation

Core services

  • Router service: maps request to cache module by intent, tenant, product area, and ACL.
  • Cache compiler: builds KV modules from reviewed static knowledge and model revisions.
  • Cache registry: stores metadata, checksums, versions, TTLs, and invalidation state.
  • GPU cache pool: manages resident KV modules, eviction, paging, and reuse across requests.
  • Fallback RAG service: handles uncommon or stale queries and feeds module promotion workflows.
axum_gateway
  -> authorize_tenant()
  -> classify_intent()
  -> redis_route_lookup()
  -> load_kv_module()
  -> append_query_tokens()
  -> stream_generation()

12. Cache Compiler Pipeline

The cache compiler is the build system for precompiled knowledge. It turns reviewed static sources into versioned, model-specific KV cache modules. Treat it like a production compiler: deterministic inputs, reproducible outputs, test fixtures, metadata, and rollback.

Source IntakeDocs, policies, APIs, schemas, workflows, tool definitions
NormalizeStable order, markup cleanup, metadata, ACL scope
LayoutPrompt envelope, separators, module boundaries
TokenizeTokenizer revision and position strategy
PrefillRun target model once for static context
PublishKV tensors, manifest, checksum, route table

Compiler artifacts

Source ManifestExact source files, versions, approvals, ACLs, and checksums.
Prompt LayoutSystem envelope, module separators, route labels, and guardrails.
KV Tensor FileLayer-by-layer key and value states for the static context.
Route MetadataIntent labels, tenant scope, model revision, token count, and TTL.
cache_compile --model llama-3.1-70b \
  --tokenizer rev_2026_04 \
  --source ./knowledge/billing \
  --layout ./layouts/policy_agent.v3.json \
  --acl tenant:acme:billing \
  --out s3://kv-cache/acme/billing/v12/
Compiler invariant: the same input manifest, model revision, tokenizer, and layout must produce the same module checksum.

13. Storage and Paging Design

KV cache modules can be large. Production systems need storage tiers and paging policies instead of assuming every module is always resident in GPU memory. The goal is to keep the hottest modules close to the decoder while preserving cheaper cold storage for less common domains.

GPU ResidentVery hot modules already loaded in the serving process.microseconds
Host RAMWarm cache modules staged for fast transfer to GPU memory.milliseconds
Local NVMeLarge hot set with predictable low-latency reads and checksums.low ms
Object StoreDurable source of truth for module artifacts and rollback versions.cold path

Paging policy

  • Pin global system and tool-definition modules in GPU memory when capacity allows.
  • Promote domain modules from NVMe to RAM based on request rate and route confidence.
  • Evict large tenant modules by cost-aware LFU or weighted recency, not plain LRU.
  • Keep checksums next to each tier so corrupted modules fail closed before generation.
eviction_score =
  size_gb * reload_cost_ms
  / max(1, hits_last_15m * route_priority)

14. Cache-Aware Scheduler

A normal inference scheduler batches by arrival time and token budget. A cache-aware scheduler also considers which KV modules are already resident, which requests can share modules, and whether loading a module will block higher-priority work.

Incoming Requests
billing query - module B
billing query - module B
shipping query - module S
Scheduler Decisions
Group by module
Prefer resident cache
Load warm module async
Execution
Batch shared prefix
Append per-query suffix
Decode independently

Scheduling signals

  • Module residency: GPU, RAM, NVMe, or object store.
  • Module size: load time, transfer cost, and GPU memory pressure.
  • Route confidence: whether a request should use CAG or fallback retrieval.
  • Tenant priority: per-tenant SLOs, budget, and queue fairness.
  • Suffix length: dynamic query and session-state tokens appended after cache restore.

15. Versioning, Invalidation, and Governance

CAG is deterministic only when cache lifecycle is deterministic. Every module needs source provenance, model provenance, access scope, test results, and a clear invalidation path.

Source UpdatePolicy, schema, docs, workflow, or tool contract changes.
Compile vNextBuild new KV module from reviewed source manifest.
EvaluateGolden questions, ACL tests, latency and quality checks.
Canary RouteSmall traffic slice or internal tenants only.
PromoteUpdate route table and keep previous module for rollback.

Invalidation triggers

TriggerInvalidatesRequired Action
Source checksum changeDomain moduleRecompile and rerun golden tests.
Model revision changeAll modules for that modelFull rebuild unless compatibility is proven.
Tokenizer changeAll tokenized modulesFull rebuild and route table update.
ACL policy changePrivate modules by scopeReissue manifests and deny old routes.
Prompt layout changeModules using that layoutRecompile and compare outputs.

16. Evaluation and Benchmarking

CAG should be evaluated against both standard RAG and full-prompt long-context inference. Measure speed, cost, correctness, grounding, route accuracy, and rollback safety. A fast cache that answers from the wrong module is worse than a slower retrieval path.

LatencyTTFT, p50/p95/p99, cache load time, GPU transfer time, decode throughput.
QualityGolden answer match, citation faithfulness, refusal correctness, policy compliance.
RoutingIntent accuracy, fallback rate, wrong-module rate, ACL denial correctness.
CostPrefill avoided, GPU occupancy, storage cost, cache rebuild cost.
FreshnessStale-answer rate, source-to-cache lag, invalidation propagation time.
ReliabilityCache corruption handling, rollback success, fallback behavior under cache outage.

Benchmark matrix

BaselinePurposeExpected CAG Advantage
Traditional RAGCompare against embedding, retrieval, reranking, prompt assembly, and prefill.Lower TTFT and lower GPU prefill cost on static domains.
Full long-context promptCompare against resending all static context every request.Much lower repeated prefill work.
Pure classifier routingCheck whether routing alone is enough without precompiled context.Better grounded answers with static knowledge loaded.
Fallback retrieval onlyValidate fallback quality and identify module gaps.CAG handles common paths, fallback handles unknown paths.

17. Limitations and Risks

RiskMitigation
Model-specific cache filesVersion every module by model, tokenizer, attention implementation, and prompt layout.
Stale knowledgeUse build pipelines, source checksums, TTLs, and invalidation events.
Large cache storageUse quantization, module boundaries, paging, hot/cold tiers, and cache admission policies.
Router errorsUse deterministic rules where possible, confidence thresholds, tests, and fallback retrieval.
Security leakageNamespace by tenant and ACL; never share private cache modules across scopes.
Prompt layout fragilityTreat module layout as an API and run regression tests before cache publication.
Critical rule: never load a cache module unless the caller is authorized for every source document used to build that module.

18. Research Direction

This field is moving quickly. The most relevant work focuses on persistent KV cache, modular attention reuse, cache-aware schedulers, memory-paged inference, SSD-backed cache restoration, prefix trees for prompt reuse, and quantized KV storage.

WorkRelevance
TurboRAG Precomputes KV caches for retrieved text chunks and reports large TTFT reductions versus standard RAG.
Prompt Cache Defines reusable prompt modules and reuses attention states for low-latency inference.
Agent Memory Below the Prompt Explores persistent quantized KV cache for multi-agent inference and direct cache restoration.

A useful mental model is compiled inference context: similar to compiled query plans, bytecode, or precomputed indexes in databases, but applied to LLM attention state.

19. Implementation Roadmap

Phase 1: Static module candidates

Identify stable knowledge domains such as policies, API docs, product catalog slices, tool definitions, and workflow manuals.

Phase 2: Cache compiler prototype

Build a pipeline that canonicalizes source content, tokenizes it for one model, computes KV cache files, and records metadata.

Phase 3: Deterministic router

Start with explicit rule-based routing and ACL checks. Add classifier-assisted routing only after route labels are stable.

Phase 4: Runtime integration

Integrate cache restoration into the model serving layer and measure TTFT, throughput, GPU memory pressure, and answer quality.

Phase 5: Hybrid fallback and promotion

Keep retrieval for low-confidence or stale requests. Promote repeated fallback patterns into reviewed cache modules.

Starting recommendation: use Rust for routing and cache orchestration, Redis for module metadata, SSD/RAM for cache storage tiers, and a single model family until the cache lifecycle is proven.