Pattern Reference

Precompiled RAG & Cache-Augmented Generation

Run RAG once, offline, to compile your static knowledge into verified answers, contexts, and prompt blocks. Serve production traffic from a deterministic cache with the LLM only doing final formatting — no embeddings, no vector search, no retrieval at request time.

// Use RAG as a compiler, not as a runtime pipeline.

01 Overview

If your knowledge is mostly static (policies, FAQs, product specs, compliance docs, support playbooks), there is no reason to re-do retrieval on every request. Pre-compile the answers and serve them.

The pattern goes by several names — pick whichever your team prefers:

Precompiled RAG Cache-Augmented Generation Static Knowledge Compilation Deterministic Knowledge Cache RAG-as-Compiler

What you get in exchange:

RAG is a build step. Inference is a lookup.
Your pipeline gets sliced into "compile-time" and "serve-time," with very different SLAs, costs, and review processes. Treat them as separate systems.

02 The mental shift

Classic RAG treats every request as a fresh retrieval problem. Precompiled RAG flips it: most requests are answer-lookups, and retrieval only happens when the corpus changes.

Classic RAG (runtime retrieval)
  • Embed query at request time
  • Vector search every request
  • Rerank candidates
  • Stuff context into LLM
  • Answer is non-deterministic
  • Cost scales with QPS
  • Latency 1–5 seconds typical
Precompiled RAG (build-time retrieval)
  • Run RAG once during build
  • Store compiled answers / contexts
  • Lookup by cache key at request time
  • LLM only formats (or skip entirely)
  • Answer is reviewable and stable
  • Cost scales with build frequency, not QPS
  • Latency 50–200ms typical
!
This pattern is for static / slowly-changing knowledge.
If answers depend on real-time data (account balance, inventory, weather), this isn't the pattern — at minimum you'll need a hybrid approach where dynamic facts are looked up live and only the explanatory wrapper comes from the cache.

03 Architecture

Two phases, one cache between them. The "compiler" runs offline; the "server" runs online; the cache is the contract.

Offline · Build time
SourceRaw static data (docs, policies, FAQs, transcripts)
RAG pipelinechunk → clean → embed → retrieve → rank → summarize → verify
GenerateCanonical Q&A · Facts · Policies · Topic prompt blocks
ReviewHuman / automated quality gates · Versioned snapshot
PublishCache store: Redis · Postgres · JSON · CDN · Provider prompt cache
Online · Runtime
RequestUser question
NormalizeLowercase · trim · canonicalize · language detect
ClassifyIntent classifier → domain + intent + entity slots
LookupCache key = domain + intent + version + language
CachedL1 exact answer · L2 intent answer · L3 prompt block
FormatLLM applies tone / persona / variable slots — small prompt only
ResponseFinal answer to user

In production: no vector DB. No embedding. No retrieval. All of that lives in the offline phase.

04 Three cache layers

Layered cache — try the cheapest, fastest path first; fall through to richer layers if it misses. Each layer has a different shape and SLA.

L1
Exact answer cache

Keyed by canonical question or hash. Returns the final answer text directly — the LLM is bypassed entirely.

p99: < 30 ms · cost: ~$0
L2
Intent answer cache

Keyed by classified intent + entity slots. Maps thousands of paraphrased questions to one approved answer template; small prompt fills slots.

p99: 100–300 ms · cost: ~$0.0001
L3
Prompt & context cache

Keyed by topic. Returns a curated static context block (the "knowledge brief") that the LLM uses to compose a fresh answer.

p99: 800 ms – 2 s · cost: ~$0.001

How the layers compose

  1. Normalize the query.
  2. Hash the canonical form → check L1. Hit? Return. Done.
  3. Classify intent & extract slots → check L2. Hit? Render the template (small prompt) and return.
  4. Map intent to topic → fetch L3 prompt block, send small completion request, return.
  5. Miss everything? Fall back to a deterministic safe answer ("I'm not sure, here's how to reach support"), log the miss for the next compile.
Cache misses are training data.
Every L1/L2/L3 miss is a question your knowledge cache should have been able to answer. Log them, review periodically, and feed them back into the offline compiler as new canonical Q&A.

05 Offline compiler

The build pipeline. Runs on a schedule (daily / weekly) or on knowledge-source change. Output is a versioned, immutable cache snapshot.

01Source ingestdocs · KB · tickets
02Parse & chunklayout-aware
03Embed & indextemporary vectors
04Topic discoverycluster · classify
05Question synthesisLLM generates
06Answer generationretrieve · ground · cite
07Verificationgroundedness check
08Human reviewsample / full
09Compile artifactsL1 · L2 · L3
10Snapshot & publishversioned · atomic

Stages

  1. Source ingest. Pull raw documents (docs, KB, tickets, policies, transcripts) into object storage.
  2. Parse & chunk. Layout-aware parsers; structural chunking (heading-based) preserves boundaries.
  3. Embed & index. Build a temporary vector index — used only during this build.
  4. Topic discovery. Cluster chunks; identify FAQ-worthy topics, common entity sets, decision points.
  5. Question synthesis. For each topic, an LLM generates the N most-likely user questions.
  6. Answer generation. For each question: retrieve top-K, rerank, generate a grounded answer with citations.
  7. Verification. A second LLM (or rule-based check) verifies the answer is grounded in the retrieved context. Drop unverified answers.
  8. Human review. Sample or full review of generated answers for high-stakes domains.
  9. Compile artifacts. Emit three artifact families: exact Q&A pairs (L1), intent-templated answers (L2), topic prompt blocks (L3).
  10. Snapshot & publish. Tag with a version (e.g. knowledge:v23:2026-04-26), publish atomically to the cache store.

Compiled artifacts schema

# L1 — exact answer record
{
  "id": "refund.30day_window",
  "version": "v23",
  "canonical_question": "Can I get a refund after 30 days?",
  "answer": "Refunds are available within 30 days of purchase ...",
  "sources": ["policies/refunds.md#30day"],
  "approved_by": "support-lead",
  "approved_at": "2026-04-26T10:00:00Z",
}

# L2 — intent template
{
  "intent": "refund.eligibility",
  "version": "v23",
  "slots": ["order_age_days", "product_category"],
  "template": "For {product_category} purchased {order_age_days} days ago, ...",
  "static_context": "Refunds policy v3.2 ...",
}

# L3 — topic prompt block
{
  "topic": "refunds",
  "version": "v23",
  "context": "Refund policy summary, exception list, escalation paths ...",
  "tokens": 1240,
  "freshness": "2026-04-26",
}

06 Cache store design

The cache is the contract between compile time and serve time. Pick storage by access pattern, durability, and how the snapshot is published.

StoreBest forNotes
RedisL1 / L2 — sub-ms reads, high QPSUse as the hot tier; expire by version key prefix on rollout.
PostgresL2 / L3 — relational lookups, slot fillingGood when you need joins on entity slots or to tie into existing app DB.
JSON file in object storageL3 prompt blocks; low-QPS L1Versioned snapshot is just an immutable S3 object — trivial rollback.
CDN edgeAnonymous / personalization-free L1Cloudflare KV / Workers, Fastly Edge Dictionary — single-digit ms global.
Provider prompt cacheL3 — large static contextsOpenAI / Anthropic / Gemini prompt caching reuses tokens across requests.
SQLite (embedded)Small deployments, edge / mobileShip the snapshot with the app binary; perfect for offline.

Recommended hot/cold split

Hot
Redis — L1 exact answers + L2 intent recordsSub-ms reads · scale-out · per-version namespace
~ 95% of reads
p99 < 5 ms
Warm
Postgres — L2 entity-slot resolution + audit logSlot joins · click-through tracking · serve log
~ 4% of reads
p99 ~ 50 ms
Cold
S3 + CDN — L3 prompt blocks + canonical snapshotsImmutable versioned objects · trivial rollback
~ 1% of reads
p99 ~ 200 ms
Provider
LLM prompt cache — pinned static contexts on the model sideOpenAI auto · Anthropic cache_control · Gemini cachedContent
cuts input cost
50 – 90 %

The hot tier handles nearly all production reads; warm and cold exist for cases where Redis can't answer alone (slot joins, very large prompt blocks). The provider tier is orthogonal — it caches the part of the prompt the LLM sees, not what the cache layer returns.

07 Runtime serving

The serving path is intentionally boring. No retrieval, no embedding, no vector store. Just normalize → classify → look up → format.

Steps

  1. Normalize. Trim, lowercase, strip punctuation, expand contractions, language-detect.
  2. Hash. Compute the canonical-form hash → L1 exact lookup.
  3. Classify (only if L1 misses). Run intent classifier (small fine-tuned model, regex grammar, or LLM-based) → returns {domain, intent, slots}.
  4. Build cache key. domain + intent + version + language [+ tenant].
  5. L2 lookup. Hit? Render template with slots; small completion call (formatting only).
  6. L3 lookup. Otherwise fetch the topic prompt block and call the LLM with that as the entire context.
  7. Fallback. If everything misses, return a deterministic safe answer + escalate / log.
  8. Audit. Log every served answer with the layer hit, version, and cache key. Sample for review.

Production guardrails on the runtime path

08 Provider prompt caching

For API-hosted models you cannot persist the model's KV cache yourself, but every major provider exposes server-side prompt caching that does the equivalent — reusing the same large static prefix across requests at large discounts.

ProviderMechanismTypical effect
OpenAIAutomatic prompt caching on prefixes ≥ 1024 tokensUp to ~80% lower latency, ~50–90% lower input-token cost on cached portion. Cache TTL ~5–60 minutes.
Anthropic (Claude)Explicit cache_control markers on prompt blocksCached input tokens billed at ~10% of normal; up to 90% cost saving on the cached prefix. 5-min default TTL, 1-hour extended.
Google (Gemini)Explicit context caching API; reuse caches across callsCached tokens billed separately at lower rates; you control TTL. Designed for very large reusable contexts.
Azure OpenAISame as OpenAI prompt caching, plus PTU reservationsPTU + prompt cache combine for predictable latency at scale.
Self-hosted (vLLM, SGLang, TensorRT-LLM)Real KV cache + RadixAttention prefix sharingYou can hold full KV state across requests — biggest savings, but you operate the infra.

Anatomy of a cacheable prompt

CACHED PREFIX · ~8 000 tokens
USER · ~50
Stable: system + tool catalog + knowledge brief + history snapshot Varies: question
~ 90%
tokens reused via cache
~ 80%
latency reduction
~ 50–90%
input-token cost savings

Providers cache contiguous prefixes only. Anything that varies per request (timestamps, request IDs, session counters) breaks the cache if it appears before the static portion ends — keep volatile tokens at the very end.

How to structure prompts for max cache hits

# Anthropic — explicit cache_control on a static knowledge block
client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=300,
    system=[
        {
            "type": "text",
            "text": KNOWLEDGE_BRIEF,         # 8K tokens of compiled context
            "cache_control": {"type": "ephemeral"},
        },
        {"type": "text", "text": "You are a concise support agent."},
    ],
    messages=[{"role": "user", "content": user_question}],
)

09 Cache key design

The cache key is the contract. Wrong key = collisions, leaks, or zero hit rate. Design deliberately and version it like an API.

Recommended key shape

prefixknowledge:
versionv23:
languageen:
tenantacme:
domain.intentrefund.eligibility:
slot hasha8f71c
prefix — namespace (always knowledge)
version — snapshot label, mandatory
language — per-language variants
tenant — multi-tenant isolation
domain.intent — what the user is asking
slot hash — entity values (hashed, not raw)

Examples

# L1 exact answer
knowledge:v23:en:refund.30day_window

# L2 intent template (multi-tenant + slot hash)
knowledge:v23:en:tenant_acme:refund.eligibility:slot_a8f71c

# L3 topic prompt block
knowledge:v23:en:topic:refunds

Rules

10 Versioning & rollout

Snapshots are immutable. You publish a new version side-by-side, shift traffic gradually, and keep the old version warm for instant rollback.

The publish flow

  1. Compiler produces snapshot v24; uploads to S3 as knowledge/v24/....
  2. Cache loader warms Redis under the knowledge:v24:* namespace (writes only — no read traffic yet).
  3. Smoke tests: regression suite of 200+ canonical questions runs against v24; fail = abort.
  4. Canary: route 1% → 10% → 50% → 100% via a feature flag on the runtime active_version.
  5. Bake: keep v23 warm in Redis for 24–72 hours so rollback is instant.
  6. GC: after the bake window, evict old versions by namespace prefix.
T0 · publish
T+5m · smoke
T+15m · canary 10%
T+1h · 100%
T+72h · GC
v23
100%
100%
90%
warm
evicted
v24
warming
test
10%
100%
100%
The runtime reads a single config key knowledge:active at request entry — flipping it from v23v24 is the entire rollout step. Rollback is the same flip in reverse, executable in seconds while v23 is still warm.
Active version is a single small key.
The runtime reads one tiny config key (e.g. knowledge:active) at request entry — that's the indirection point for canary, rollback, and per-tenant pinning.

11 Determinism & fallback rules

A cache-only system is only as good as its deterministic fallback. When the cache can't answer, what happens must be predictable, safe, and auditable.

Fallback ladder

1
L1
Exact answer cache hitReturn verbatim approved answer · LLM not called
~ 30 ms
~ $0
2
L2
Intent template + slot fillingRender approved template; tiny formatting LLM call
~ 200 ms
~ $0.0001
3
L3
Topic prompt blockSend curated context to LLM; bounded output
~ 1.5 s
~ $0.001
4
SAFE
Approved fallback answer"I'm not sure — here's how to reach a human." Per-domain, vetted
~ 5 ms
~ $0
5
ESC
Escalate to human / open ticketCarry original question + intent for next compile cycle
async
logged
×
DENY
Free LLM generation from world knowledgeReintroduces hallucinations — defeats the whole pattern
disabled

Try each rung top-to-bottom. The deny row exists in the diagram precisely because it's the most tempting "just let the model handle it" mistake — make it unreachable in code, not just in policy.

Determinism levers

12 vs Classic RAG

Side-by-side decision matrix. Both patterns are valid; pick by the shape of your data and the answers you must serve.

Per-request flow timing

Classic RAG · per request
Receive query5 ms
Embed query80 ms
Vector search top-K120 ms
Rerank candidates200 ms
Assemble context15 ms
LLM generate (3K in / 300 out)2 200 ms
Post-filter / safety90 ms
Total~ 2 710 ms
Precompiled RAG · L1 hit
Receive query5 ms
Normalize2 ms
Hash → Redis lookup3 ms
Total~ 10 ms

L2 hits add a small formatting LLM call (~200ms total). L3 hits look much like classic RAG end-to-end but skip embedding + vector + rerank. The big latency win is concentrated on L1, which is also where the bulk of production traffic lives in well-tuned systems.

Decision matrix

DimensionClassic RAGPrecompiled RAG
When retrieval runsEvery requestBuild time, once per snapshot
p99 latency1–5 s30 ms (L1) – 2 s (L3)
Cost per queryEmbedding + vector + rerank + LLMCache lookup + small LLM call (or none)
DeterminismVariable per requestSame input → same approved output
ReviewabilityAudit individual generationsAudit the snapshot once, cover all queries
FreshnessLiveBound by recompile cadence
Best forOpen-ended Q&A on changing corporaFAQ, policy, support, compliance, regulated answers
Operational complexityVector DB + indexer + retriever + reranker liveCompiler runs offline; runtime is a key-value lookup
Failure surfaceBad chunk, bad rerank, bad LLM = bad answerBad snapshot is reverted in seconds

13 When to use this pattern

Not every workload should be precompiled. Match the pattern to the data shape.

Strong fit

  • Customer support FAQ & policies
  • Compliance and regulatory Q&A
  • Product documentation chatbot
  • HR / legal / finance internal helpdesk
  • Onboarding and how-to assistants
  • Voice IVR with bounded intents
  • Mobile / edge assistants (offline)
  • Healthcare patient-education answers

Weak fit (use classic or hybrid)

  • Open-ended research over a live corpus
  • Real-time data (prices, balances, sensors)
  • Per-user personalized synthesis
  • Large unbounded long-tail queries
  • Conversational agents that mutate state
  • Code generation against changing repos
  • Search-as-you-type (use embeddings)

Hybrid: when to mix

The most common production shape is hybrid — precompiled for the head of the distribution (the 80% of repeated questions) and live retrieval for the long tail. Route based on whether the cache layer hits; record long-tail questions for the next compile cycle so the head grows over time.

14 End-to-end example

A minimal but production-shaped cache-only serving path. Each box maps onto the layers above.

# serve.py — cache-only request handler
import hashlib, json, redis
from openai import OpenAI

R = redis.Redis(decode_responses=True)
LLM = OpenAI()
ACTIVE = R.get("knowledge:active") or "v23"

def normalize(q: str) -> str:
    return " ".join(q.lower().strip().split())

def key_l1(q: str, lang: str) -> str:
    h = hashlib.sha256(q.encode()).hexdigest()[:12]
    return f"knowledge:{ACTIVE}:{lang}:exact:{h}"

def classify_intent(q: str) -> dict:
    # small fine-tuned classifier or rule grammar
    ...
    return {"domain": "support", "intent": "refund.eligibility", "slots": {...}}

def serve(question: str, lang: str = "en") -> str:
    nq = normalize(question)

    # L1 — exact answer
    hit = R.get(key_l1(nq, lang))
    if hit:
        log_serve("L1", key_l1(nq, lang))
        return hit

    # L2 — intent template
    intent = classify_intent(nq)
    k2 = f"knowledge:{ACTIVE}:{lang}:intent:{intent['intent']}"
    record = R.get(k2)
    if record:
        rec = json.loads(record)
        prompt = rec["template"].format(**intent["slots"])
        out = LLM.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0, max_tokens=300,
        )
        log_serve("L2", k2)
        return out.choices[0].message.content

    # L3 — topic prompt block
    topic = intent_to_topic(intent["intent"])
    k3 = f"knowledge:{ACTIVE}:{lang}:topic:{topic}"
    block = R.get(k3)
    if block:
        out = LLM.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": block},
                {"role": "user", "content": question},
            ],
            temperature=0, max_tokens=600,
        )
        log_serve("L3", k3)
        return out.choices[0].message.content

    # Fallback — deterministic safe answer + escalate
    log_miss(question, intent)
    return SAFE_ANSWER[intent["domain"]]

Notice what's not here: no embedding model, no vector store client, no reranker. The runtime container is small and boring — exactly the goal.

15 Refresh & invalidation

When source content changes, you compile a new snapshot — never patch the live cache in place. Atomic publish via the active-version pointer.

Triggers

Partial vs full recompile

!
Don't mutate live cache entries.
Editing in place across a fleet of cache nodes leads to torn reads and inconsistent answers between users. Always publish a new version and switch the active pointer.

16 Eval & quality gates

The eval surface for precompiled RAG is much friendlier than for live RAG — you evaluate the snapshot once, and the result holds for every query that hits it.

Build-time evals

Runtime evals

Quality gates before publish

  1. Regression suite ≥ 99% pass.
  2. Coverage on production miss log ≥ target (e.g. 90%).
  3. Drift report manually approved by domain owner.
  4. Safety scan — zero critical findings.
  5. Canary smoke at 1% — error rate within tolerance.

17 Cost analysis

The cost shape changes fundamentally — from "scales with QPS" to "scales with rebuild frequency + cache size."

Where the dollars go

Cost lineClassic RAGPrecompiled RAG
Per-query embedding1× per query0 at runtime
Per-query vector search1× per query0 at runtime
Per-query rerank1× per query0 at runtime
Per-query LLM tokenslarge input + outputtiny (formatting only)
Vector DB infraalways-on, scales with QPSnone at runtime
Compiler runsn/aper-rebuild fixed cost
Cache storagen/asmall, predictable

Worked example

Suppose 1M queries/month, 70% hit L1, 25% hit L2, 5% hit L3:

Classic RAGper-query LLM + vector + rerank
$5,000 / mo · LLM
$5,000
Vector DB infraalways-on, scales with QPS
$1,200
$1,200
Precompiled · LLMonly L2 + L3 calls
$100
$100
Compiler runsweekly rebuild + topic refreshes
$300
$300
Cache infraRedis + S3 + small Postgres
$150
$150
$6,200
Classic RAG · monthly total
$550
Precompiled · monthly total
~ 11×
cost reduction

The bigger the head of your query distribution, the bigger the savings. For a long-tailed corpus, hybrid splits the difference. The chart uses representative numbers — your mileage varies with model choice, hit-rate distribution, and provider prompt-cache discounts (which can drop the precompiled column further).

18 Anti-patterns

Mistakes that look reasonable but undermine the whole pattern.

Anti-patternWhy it hurtsDo this instead
Shipping a vector store to runtime "just in case"Defeats the simplicity and operational cost wins.Strip retrieval from the runtime container; route misses to a deterministic fallback.
Patching cache entries in placeTorn reads, inconsistent answers across users, no audit.Compile new snapshot, publish atomically by switching the active pointer.
Cache key without versionRollouts and rollbacks become destructive flushes.Always include version in the key; keep ≥2 versions warm.
Letting the LLM answer freely on missReintroduces hallucinations into a "deterministic" system.Deterministic safe answer + escalation, never free generation.
Time-of-day in the cached prefixInvalidates the provider prompt cache every request.Move time-varying tokens after the cached region or out of the prompt entirely.
Compiling without a regression suiteYou can't prove a new snapshot is at least as good.Maintain a labeled regression query set; CI it into the publish gate.
Skipping the miss logThe cache stagnates; long tail never gets folded back into the head.Log every miss with intent + slots; review before each rebuild.
Cross-tenant cache without tenant in the keyOne tenant's policy answer leaks to another.Per-tenant key prefix or per-tenant cache namespace.
Generating without verificationHallucinated artifacts get baked into the cache and served to all users.Second-pass verifier + sample human review before publish.
One giant L3 prompt block per topicCosts more tokens per call; cache eviction churn.Split topics finely; let intent classifier route to the smallest relevant block.

19 Production checklist

Before declaring a precompiled-RAG system production-ready.

Compiler

Cache

Runtime

Eval & ops

If you ship one thing first, ship the cache-key contract.
Get the key shape right and everything downstream — rollouts, multi-tenant safety, audit, eval — falls into place. Get it wrong and you'll be migrating cache schemas under traffic.