Precompiled RAG & Cache-Augmented Generation
Run RAG once, offline, to compile your static knowledge into verified answers, contexts, and prompt blocks. Serve production traffic from a deterministic cache with the LLM only doing final formatting — no embeddings, no vector search, no retrieval at request time.
01 Overview
If your knowledge is mostly static (policies, FAQs, product specs, compliance docs, support playbooks), there is no reason to re-do retrieval on every request. Pre-compile the answers and serve them.
The pattern goes by several names — pick whichever your team prefers:
What you get in exchange:
- Sub-100ms responses for cache hits — no vector search, no rerank, no retrieval RTT.
- Order-of-magnitude lower cost per query — most requests bypass the LLM entirely or use a tiny formatting prompt.
- Deterministic answers — the same question gets the same vetted answer every time, which is a hard requirement in regulated, support, and compliance contexts.
- Trivial governance — every served answer is reviewable, versioned, and rollback-able. No surprise hallucinations from a fresh retrieval.
02 The mental shift
Classic RAG treats every request as a fresh retrieval problem. Precompiled RAG flips it: most requests are answer-lookups, and retrieval only happens when the corpus changes.
- Embed query at request time
- Vector search every request
- Rerank candidates
- Stuff context into LLM
- Answer is non-deterministic
- Cost scales with QPS
- Latency 1–5 seconds typical
- Run RAG once during build
- Store compiled answers / contexts
- Lookup by cache key at request time
- LLM only formats (or skip entirely)
- Answer is reviewable and stable
- Cost scales with build frequency, not QPS
- Latency 50–200ms typical
03 Architecture
Two phases, one cache between them. The "compiler" runs offline; the "server" runs online; the cache is the contract.
In production: no vector DB. No embedding. No retrieval. All of that lives in the offline phase.
04 Three cache layers
Layered cache — try the cheapest, fastest path first; fall through to richer layers if it misses. Each layer has a different shape and SLA.
Keyed by canonical question or hash. Returns the final answer text directly — the LLM is bypassed entirely.
Keyed by classified intent + entity slots. Maps thousands of paraphrased questions to one approved answer template; small prompt fills slots.
Keyed by topic. Returns a curated static context block (the "knowledge brief") that the LLM uses to compose a fresh answer.
How the layers compose
- Normalize the query.
- Hash the canonical form → check L1. Hit? Return. Done.
- Classify intent & extract slots → check L2. Hit? Render the template (small prompt) and return.
- Map intent to topic → fetch L3 prompt block, send small completion request, return.
- Miss everything? Fall back to a deterministic safe answer ("I'm not sure, here's how to reach support"), log the miss for the next compile.
05 Offline compiler
The build pipeline. Runs on a schedule (daily / weekly) or on knowledge-source change. Output is a versioned, immutable cache snapshot.
Stages
- Source ingest. Pull raw documents (docs, KB, tickets, policies, transcripts) into object storage.
- Parse & chunk. Layout-aware parsers; structural chunking (heading-based) preserves boundaries.
- Embed & index. Build a temporary vector index — used only during this build.
- Topic discovery. Cluster chunks; identify FAQ-worthy topics, common entity sets, decision points.
- Question synthesis. For each topic, an LLM generates the N most-likely user questions.
- Answer generation. For each question: retrieve top-K, rerank, generate a grounded answer with citations.
- Verification. A second LLM (or rule-based check) verifies the answer is grounded in the retrieved context. Drop unverified answers.
- Human review. Sample or full review of generated answers for high-stakes domains.
- Compile artifacts. Emit three artifact families: exact Q&A pairs (L1), intent-templated answers (L2), topic prompt blocks (L3).
- Snapshot & publish. Tag with a version (e.g.
knowledge:v23:2026-04-26), publish atomically to the cache store.
Compiled artifacts schema
# L1 — exact answer record { "id": "refund.30day_window", "version": "v23", "canonical_question": "Can I get a refund after 30 days?", "answer": "Refunds are available within 30 days of purchase ...", "sources": ["policies/refunds.md#30day"], "approved_by": "support-lead", "approved_at": "2026-04-26T10:00:00Z", } # L2 — intent template { "intent": "refund.eligibility", "version": "v23", "slots": ["order_age_days", "product_category"], "template": "For {product_category} purchased {order_age_days} days ago, ...", "static_context": "Refunds policy v3.2 ...", } # L3 — topic prompt block { "topic": "refunds", "version": "v23", "context": "Refund policy summary, exception list, escalation paths ...", "tokens": 1240, "freshness": "2026-04-26", }
06 Cache store design
The cache is the contract between compile time and serve time. Pick storage by access pattern, durability, and how the snapshot is published.
| Store | Best for | Notes |
|---|---|---|
| Redis | L1 / L2 — sub-ms reads, high QPS | Use as the hot tier; expire by version key prefix on rollout. |
| Postgres | L2 / L3 — relational lookups, slot filling | Good when you need joins on entity slots or to tie into existing app DB. |
| JSON file in object storage | L3 prompt blocks; low-QPS L1 | Versioned snapshot is just an immutable S3 object — trivial rollback. |
| CDN edge | Anonymous / personalization-free L1 | Cloudflare KV / Workers, Fastly Edge Dictionary — single-digit ms global. |
| Provider prompt cache | L3 — large static contexts | OpenAI / Anthropic / Gemini prompt caching reuses tokens across requests. |
| SQLite (embedded) | Small deployments, edge / mobile | Ship the snapshot with the app binary; perfect for offline. |
Recommended hot/cold split
cache_control · Gemini cachedContentThe hot tier handles nearly all production reads; warm and cold exist for cases where Redis can't answer alone (slot joins, very large prompt blocks). The provider tier is orthogonal — it caches the part of the prompt the LLM sees, not what the cache layer returns.
07 Runtime serving
The serving path is intentionally boring. No retrieval, no embedding, no vector store. Just normalize → classify → look up → format.
Steps
- Normalize. Trim, lowercase, strip punctuation, expand contractions, language-detect.
- Hash. Compute the canonical-form hash → L1 exact lookup.
- Classify (only if L1 misses). Run intent classifier (small fine-tuned model, regex grammar, or LLM-based) → returns
{domain, intent, slots}. - Build cache key.
domain + intent + version + language [+ tenant]. - L2 lookup. Hit? Render template with slots; small completion call (formatting only).
- L3 lookup. Otherwise fetch the topic prompt block and call the LLM with that as the entire context.
- Fallback. If everything misses, return a deterministic safe answer + escalate / log.
- Audit. Log every served answer with the layer hit, version, and cache key. Sample for review.
Production guardrails on the runtime path
- Per-layer SLO. L1 <30ms, L2 <300ms, L3 <2s. If you blow SLO, drop to fallback.
- Cap LLM tokens. A formatting prompt should never need 8K input — set hard limits.
- Force determinism.
temperature=0, fixed seed where supported, structured output. - Strip the retrieval surface. The runtime container should not carry vector libraries or embedding clients — there's nothing for them to do.
08 Provider prompt caching
For API-hosted models you cannot persist the model's KV cache yourself, but every major provider exposes server-side prompt caching that does the equivalent — reusing the same large static prefix across requests at large discounts.
| Provider | Mechanism | Typical effect |
|---|---|---|
| OpenAI | Automatic prompt caching on prefixes ≥ 1024 tokens | Up to ~80% lower latency, ~50–90% lower input-token cost on cached portion. Cache TTL ~5–60 minutes. |
| Anthropic (Claude) | Explicit cache_control markers on prompt blocks | Cached input tokens billed at ~10% of normal; up to 90% cost saving on the cached prefix. 5-min default TTL, 1-hour extended. |
| Google (Gemini) | Explicit context caching API; reuse caches across calls | Cached tokens billed separately at lower rates; you control TTL. Designed for very large reusable contexts. |
| Azure OpenAI | Same as OpenAI prompt caching, plus PTU reservations | PTU + prompt cache combine for predictable latency at scale. |
| Self-hosted (vLLM, SGLang, TensorRT-LLM) | Real KV cache + RadixAttention prefix sharing | You can hold full KV state across requests — biggest savings, but you operate the infra. |
Anatomy of a cacheable prompt
Providers cache contiguous prefixes only. Anything that varies per request (timestamps, request IDs, session counters) breaks the cache if it appears before the static portion ends — keep volatile tokens at the very end.
How to structure prompts for max cache hits
- Stable content first. System prompt → tool catalog → static knowledge → conversation history → user input. The provider can only cache a contiguous prefix.
- Avoid timestamps (or any per-request varying token) in the cached region. They invalidate the prefix every call.
- Pin big knowledge blocks via the provider's caching mechanism (Anthropic
cache_control, GeminicachedContent). - Refresh the cache just before TTL expiry on hot prefixes — a single keep-alive request preserves the discount.
# Anthropic — explicit cache_control on a static knowledge block client.messages.create( model="claude-sonnet-4-6", max_tokens=300, system=[ { "type": "text", "text": KNOWLEDGE_BRIEF, # 8K tokens of compiled context "cache_control": {"type": "ephemeral"}, }, {"type": "text", "text": "You are a concise support agent."}, ], messages=[{"role": "user", "content": user_question}], )
09 Cache key design
The cache key is the contract. Wrong key = collisions, leaks, or zero hit rate. Design deliberately and version it like an API.
Recommended key shape
knowledge)Examples
# L1 exact answer knowledge:v23:en:refund.30day_window # L2 intent template (multi-tenant + slot hash) knowledge:v23:en:tenant_acme:refund.eligibility:slot_a8f71c # L3 topic prompt block knowledge:v23:en:topic:refunds
Rules
- Always include the version. Otherwise rollout becomes a flush-and-pray.
- Include language. Or at minimum, route per-language to per-language caches.
- Include tenant when answers vary. Per-tenant policies must not leak across tenants.
- Hash the entity slots — don't put raw user data in a key (PII risk).
- Avoid timestamps in keys — destroys hit rate.
- Treat the key namespace as a public API. Document it, version it, deprecate it intentionally.
10 Versioning & rollout
Snapshots are immutable. You publish a new version side-by-side, shift traffic gradually, and keep the old version warm for instant rollback.
The publish flow
- Compiler produces snapshot
v24; uploads to S3 asknowledge/v24/.... - Cache loader warms Redis under the
knowledge:v24:*namespace (writes only — no read traffic yet). - Smoke tests: regression suite of 200+ canonical questions runs against v24; fail = abort.
- Canary: route 1% → 10% → 50% → 100% via a feature flag on the runtime
active_version. - Bake: keep
v23warm in Redis for 24–72 hours so rollback is instant. - GC: after the bake window, evict old versions by namespace prefix.
knowledge:active at request entry — flipping it from v23 → v24 is the entire rollout step. Rollback is the same flip in reverse, executable in seconds while v23 is still warm.
knowledge:active) at request entry — that's the indirection point for canary, rollback, and per-tenant pinning.11 Determinism & fallback rules
A cache-only system is only as good as its deterministic fallback. When the cache can't answer, what happens must be predictable, safe, and auditable.
Fallback ladder
Try each rung top-to-bottom. The deny row exists in the diagram precisely because it's the most tempting "just let the model handle it" mistake — make it unreachable in code, not just in policy.
Determinism levers
temperature=0on every formatting call.- Fixed seed where the provider supports it.
- Structured outputs (JSON schema) for any answer that gets parsed downstream.
- Stop sequences to bound length.
- Snapshot tests against the regression suite — any output drift across LLM versions blocks the deploy.
12 vs Classic RAG
Side-by-side decision matrix. Both patterns are valid; pick by the shape of your data and the answers you must serve.
Per-request flow timing
L2 hits add a small formatting LLM call (~200ms total). L3 hits look much like classic RAG end-to-end but skip embedding + vector + rerank. The big latency win is concentrated on L1, which is also where the bulk of production traffic lives in well-tuned systems.
Decision matrix
| Dimension | Classic RAG | Precompiled RAG |
|---|---|---|
| When retrieval runs | Every request | Build time, once per snapshot |
| p99 latency | 1–5 s | 30 ms (L1) – 2 s (L3) |
| Cost per query | Embedding + vector + rerank + LLM | Cache lookup + small LLM call (or none) |
| Determinism | Variable per request | Same input → same approved output |
| Reviewability | Audit individual generations | Audit the snapshot once, cover all queries |
| Freshness | Live | Bound by recompile cadence |
| Best for | Open-ended Q&A on changing corpora | FAQ, policy, support, compliance, regulated answers |
| Operational complexity | Vector DB + indexer + retriever + reranker live | Compiler runs offline; runtime is a key-value lookup |
| Failure surface | Bad chunk, bad rerank, bad LLM = bad answer | Bad snapshot is reverted in seconds |
13 When to use this pattern
Not every workload should be precompiled. Match the pattern to the data shape.
Strong fit
- Customer support FAQ & policies
- Compliance and regulatory Q&A
- Product documentation chatbot
- HR / legal / finance internal helpdesk
- Onboarding and how-to assistants
- Voice IVR with bounded intents
- Mobile / edge assistants (offline)
- Healthcare patient-education answers
Weak fit (use classic or hybrid)
- Open-ended research over a live corpus
- Real-time data (prices, balances, sensors)
- Per-user personalized synthesis
- Large unbounded long-tail queries
- Conversational agents that mutate state
- Code generation against changing repos
- Search-as-you-type (use embeddings)
Hybrid: when to mix
The most common production shape is hybrid — precompiled for the head of the distribution (the 80% of repeated questions) and live retrieval for the long tail. Route based on whether the cache layer hits; record long-tail questions for the next compile cycle so the head grows over time.
14 End-to-end example
A minimal but production-shaped cache-only serving path. Each box maps onto the layers above.
# serve.py — cache-only request handler import hashlib, json, redis from openai import OpenAI R = redis.Redis(decode_responses=True) LLM = OpenAI() ACTIVE = R.get("knowledge:active") or "v23" def normalize(q: str) -> str: return " ".join(q.lower().strip().split()) def key_l1(q: str, lang: str) -> str: h = hashlib.sha256(q.encode()).hexdigest()[:12] return f"knowledge:{ACTIVE}:{lang}:exact:{h}" def classify_intent(q: str) -> dict: # small fine-tuned classifier or rule grammar ... return {"domain": "support", "intent": "refund.eligibility", "slots": {...}} def serve(question: str, lang: str = "en") -> str: nq = normalize(question) # L1 — exact answer hit = R.get(key_l1(nq, lang)) if hit: log_serve("L1", key_l1(nq, lang)) return hit # L2 — intent template intent = classify_intent(nq) k2 = f"knowledge:{ACTIVE}:{lang}:intent:{intent['intent']}" record = R.get(k2) if record: rec = json.loads(record) prompt = rec["template"].format(**intent["slots"]) out = LLM.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0, max_tokens=300, ) log_serve("L2", k2) return out.choices[0].message.content # L3 — topic prompt block topic = intent_to_topic(intent["intent"]) k3 = f"knowledge:{ACTIVE}:{lang}:topic:{topic}" block = R.get(k3) if block: out = LLM.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": block}, {"role": "user", "content": question}, ], temperature=0, max_tokens=600, ) log_serve("L3", k3) return out.choices[0].message.content # Fallback — deterministic safe answer + escalate log_miss(question, intent) return SAFE_ANSWER[intent["domain"]]
Notice what's not here: no embedding model, no vector store client, no reranker. The runtime container is small and boring — exactly the goal.
15 Refresh & invalidation
When source content changes, you compile a new snapshot — never patch the live cache in place. Atomic publish via the active-version pointer.
Triggers
- Scheduled. Nightly / weekly recompile from the latest source content.
- Source change. Webhook from CMS / docs repo / Confluence triggers a partial recompile of affected topics.
- Miss-driven. When L1/L2/L3 miss rate exceeds a threshold for any topic, queue a recompile.
- Manual. Compliance / legal-driven hot fix to a specific answer — fast lane bypasses the full compile.
Partial vs full recompile
- Partial: Recompile only the affected topics. Faster, cheaper, but the new snapshot still gets a new version label and goes through the full canary.
- Full: Rebuild the whole snapshot. Slower but guaranteed consistent — required when policies change or a global verifier is updated.
16 Eval & quality gates
The eval surface for precompiled RAG is much friendlier than for live RAG — you evaluate the snapshot once, and the result holds for every query that hits it.
Build-time evals
- Groundedness per generated answer — does it follow from the retrieved source?
- Citation correctness — does every claim cite a real source span?
- Coverage — what fraction of the regression query suite is answerable from the new snapshot?
- Drift — diff against the previous snapshot; flag any answer that changed materially for human review.
- Safety — content filter on every generated artifact; reject before publish.
- Bias / fairness for high-stakes domains — automated probes across protected attributes.
Runtime evals
- Layer hit rate per domain (L1 / L2 / L3 / fallback).
- Miss log classification — what kind of questions are we failing to answer?
- Latency p50/p95/p99 per layer.
- User satisfaction signal (thumbs / resolved-without-escalation rate).
Quality gates before publish
- Regression suite ≥ 99% pass.
- Coverage on production miss log ≥ target (e.g. 90%).
- Drift report manually approved by domain owner.
- Safety scan — zero critical findings.
- Canary smoke at 1% — error rate within tolerance.
17 Cost analysis
The cost shape changes fundamentally — from "scales with QPS" to "scales with rebuild frequency + cache size."
Where the dollars go
| Cost line | Classic RAG | Precompiled RAG |
|---|---|---|
| Per-query embedding | 1× per query | 0 at runtime |
| Per-query vector search | 1× per query | 0 at runtime |
| Per-query rerank | 1× per query | 0 at runtime |
| Per-query LLM tokens | large input + output | tiny (formatting only) |
| Vector DB infra | always-on, scales with QPS | none at runtime |
| Compiler runs | n/a | per-rebuild fixed cost |
| Cache storage | n/a | small, predictable |
Worked example
Suppose 1M queries/month, 70% hit L1, 25% hit L2, 5% hit L3:
- Classic RAG — every query: ~3K input tokens, ~300 output tokens. ~$0.005 each → $5,000/mo, plus vector DB infra.
- Precompiled RAG — L1 free, L2 ~50 input/200 output ($0.0001), L3 ~2K input/300 output ($0.0015). Weighted: ~$100/mo + ~$300/mo compiler runs + cache infra. ~30× cheaper.
The bigger the head of your query distribution, the bigger the savings. For a long-tailed corpus, hybrid splits the difference. The chart uses representative numbers — your mileage varies with model choice, hit-rate distribution, and provider prompt-cache discounts (which can drop the precompiled column further).
18 Anti-patterns
Mistakes that look reasonable but undermine the whole pattern.
| Anti-pattern | Why it hurts | Do this instead |
|---|---|---|
| Shipping a vector store to runtime "just in case" | Defeats the simplicity and operational cost wins. | Strip retrieval from the runtime container; route misses to a deterministic fallback. |
| Patching cache entries in place | Torn reads, inconsistent answers across users, no audit. | Compile new snapshot, publish atomically by switching the active pointer. |
| Cache key without version | Rollouts and rollbacks become destructive flushes. | Always include version in the key; keep ≥2 versions warm. |
| Letting the LLM answer freely on miss | Reintroduces hallucinations into a "deterministic" system. | Deterministic safe answer + escalation, never free generation. |
| Time-of-day in the cached prefix | Invalidates the provider prompt cache every request. | Move time-varying tokens after the cached region or out of the prompt entirely. |
| Compiling without a regression suite | You can't prove a new snapshot is at least as good. | Maintain a labeled regression query set; CI it into the publish gate. |
| Skipping the miss log | The cache stagnates; long tail never gets folded back into the head. | Log every miss with intent + slots; review before each rebuild. |
| Cross-tenant cache without tenant in the key | One tenant's policy answer leaks to another. | Per-tenant key prefix or per-tenant cache namespace. |
| Generating without verification | Hallucinated artifacts get baked into the cache and served to all users. | Second-pass verifier + sample human review before publish. |
| One giant L3 prompt block per topic | Costs more tokens per call; cache eviction churn. | Split topics finely; let intent classifier route to the smallest relevant block. |
19 Production checklist
Before declaring a precompiled-RAG system production-ready.
Compiler
- Idempotent — same source produces the same snapshot.
- Deterministic versioning (date or monotonic counter).
- Verifier rejects ungrounded artifacts.
- Snapshots stored immutably in object storage.
- Runs on schedule + on source-change webhooks.
Cache
- Keyed by
domain + intent + version + language [+ tenant]. - ≥ 2 versions kept warm for instant rollback.
- Active version controlled by a single small config key.
- Per-tenant isolation verified by automated test.
Runtime
- L1 → L2 → L3 → fallback ladder implemented.
- No vector libraries / embedding clients in the runtime image.
- SLOs enforced per layer with auto-degrade.
- Determinism:
temperature=0, structured outputs. - Every served answer is audit-logged with layer hit + version.
Eval & ops
- Regression suite of canonical queries gates every publish.
- Drift report against previous snapshot reviewed before promotion.
- Miss log monitored; long-tail questions feed back into next rebuild.
- Per-layer hit-rate dashboards live.
- Provider prompt-cache hit rate tracked (if applicable).
- Documented runbook for emergency rollback (single config key flip).