Pattern Reference

Precompiled RAG & Cache-Augmented Generation

Run RAG once, offline, to compile your static knowledge into verified answers, contexts, and prompt blocks. Serve production traffic from a deterministic cache with the LLM only doing final formatting — no embeddings, no vector search, no retrieval at request time.

// Use RAG as a compiler, not as a runtime pipeline.

01 Overview

If your knowledge is mostly static (policies, FAQs, product specs, compliance docs, support playbooks), there is no reason to re-do retrieval on every request. Pre-compile the answers and serve them.

The pattern goes by several names — pick whichever your team prefers:

Precompiled RAG Cache-Augmented Generation Static Knowledge Compilation Deterministic Knowledge Cache RAG-as-Compiler

What you get in exchange:

Sub-100ms responses for cache hits — no vector search, no rerank, no retrieval RTT.
Order-of-magnitude lower cost per query — most requests bypass the LLM entirely or use a tiny formatting prompt.
Deterministic answers — the same question gets the same vetted answer every time, which is a hard requirement in regulated, support, and compliance contexts.
Trivial governance — every served answer is reviewable, versioned, and rollback-able. No surprise hallucinations from a fresh retrieval.

▶

RAG is a build step. Inference is a lookup.

Your pipeline gets sliced into "compile-time" and "serve-time," with very different SLAs, costs, and review processes. Treat them as separate systems.

02 The mental shift

Classic RAG treats every request as a fresh retrieval problem. Precompiled RAG flips it: most requests are answer-lookups, and retrieval only happens when the corpus changes.

Classic RAG (runtime retrieval)

Embed query at request time
Vector search every request
Rerank candidates
Stuff context into LLM
Answer is non-deterministic
Cost scales with QPS
Latency 1–5 seconds typical

Precompiled RAG (build-time retrieval)

Run RAG once during build
Store compiled answers / contexts
Lookup by cache key at request time
LLM only formats (or skip entirely)
Answer is reviewable and stable
Cost scales with build frequency, not QPS
Latency 50–200ms typical

This pattern is for static / slowly-changing knowledge.

If answers depend on real-time data (account balance, inventory, weather), this isn't the pattern — at minimum you'll need a hybrid approach where dynamic facts are looked up live and only the explanatory wrapper comes from the cache.

03 Architecture

Two phases, one cache between them. The "compiler" runs offline; the "server" runs online; the cache is the contract.

Offline · Build time

SourceRaw static data (docs, policies, FAQs, transcripts)

RAG pipelinechunk → clean → embed → retrieve → rank → summarize → verify

GenerateCanonical Q&A · Facts · Policies · Topic prompt blocks

ReviewHuman / automated quality gates · Versioned snapshot

PublishCache store: Redis · Postgres · JSON · CDN · Provider prompt cache

Online · Runtime

RequestUser question

NormalizeLowercase · trim · canonicalize · language detect

ClassifyIntent classifier → domain + intent + entity slots

LookupCache key = domain + intent + version + language

CachedL1 exact answer · L2 intent answer · L3 prompt block

FormatLLM applies tone / persona / variable slots — small prompt only

ResponseFinal answer to user

In production: no vector DB. No embedding. No retrieval. All of that lives in the offline phase.

04 Three cache layers

Layered cache — try the cheapest, fastest path first; fall through to richer layers if it misses. Each layer has a different shape and SLA.

L1
Exact answer cache
Keyed by canonical question or hash. Returns the final answer text directly — the LLM is bypassed entirely.
p99: < 30 ms · cost: ~$0
L2
Intent answer cache
Keyed by classified intent + entity slots. Maps thousands of paraphrased questions to one approved answer template; small prompt fills slots.
p99: 100–300 ms · cost: ~$0.0001
L3
Prompt & context cache
Keyed by topic. Returns a curated static context block (the "knowledge brief") that the LLM uses to compose a fresh answer.
p99: 800 ms – 2 s · cost: ~$0.001

How the layers compose

Normalize the query.
Hash the canonical form → check L1. Hit? Return. Done.
Classify intent & extract slots → check L2. Hit? Render the template (small prompt) and return.
Map intent to topic → fetch L3 prompt block, send small completion request, return.
Miss everything? Fall back to a deterministic safe answer ("I'm not sure, here's how to reach support"), log the miss for the next compile.

▶

Cache misses are training data.

Every L1/L2/L3 miss is a question your knowledge cache should have been able to answer. Log them, review periodically, and feed them back into the offline compiler as new canonical Q&A.

05 Offline compiler

The build pipeline. Runs on a schedule (daily / weekly) or on knowledge-source change. Output is a versioned, immutable cache snapshot.

01Source ingestdocs · KB · tickets

02Parse & chunklayout-aware

03Embed & indextemporary vectors

04Topic discoverycluster · classify

05Question synthesisLLM generates

06Answer generationretrieve · ground · cite

07Verificationgroundedness check

08Human reviewsample / full

09Compile artifactsL1 · L2 · L3

10Snapshot & publishversioned · atomic

Stages

Source ingest. Pull raw documents (docs, KB, tickets, policies, transcripts) into object storage.
Parse & chunk. Layout-aware parsers; structural chunking (heading-based) preserves boundaries.
Embed & index. Build a temporary vector index — used only during this build.
Topic discovery. Cluster chunks; identify FAQ-worthy topics, common entity sets, decision points.
Question synthesis. For each topic, an LLM generates the N most-likely user questions.
Answer generation. For each question: retrieve top-K, rerank, generate a grounded answer with citations.
Verification. A second LLM (or rule-based check) verifies the answer is grounded in the retrieved context. Drop unverified answers.
Human review. Sample or full review of generated answers for high-stakes domains.
Compile artifacts. Emit three artifact families: exact Q&A pairs (L1), intent-templated answers (L2), topic prompt blocks (L3).
Snapshot & publish. Tag with a version (e.g. knowledge:v23:2026-04-26), publish atomically to the cache store.

Compiled artifacts schema

# L1 — exact answer record
{
  "id": "refund.30day_window",
  "version": "v23",
  "canonical_question": "Can I get a refund after 30 days?",
  "answer": "Refunds are available within 30 days of purchase ...",
  "sources": ["policies/refunds.md#30day"],
  "approved_by": "support-lead",
  "approved_at": "2026-04-26T10:00:00Z",
}

# L2 — intent template
{
  "intent": "refund.eligibility",
  "version": "v23",
  "slots": ["order_age_days", "product_category"],
  "template": "For {product_category} purchased {order_age_days} days ago, ...",
  "static_context": "Refunds policy v3.2 ...",
}

# L3 — topic prompt block
{
  "topic": "refunds",
  "version": "v23",
  "context": "Refund policy summary, exception list, escalation paths ...",
  "tokens": 1240,
  "freshness": "2026-04-26",
}

06 Cache store design

The cache is the contract between compile time and serve time. Pick storage by access pattern, durability, and how the snapshot is published.

Store	Best for	Notes
Redis	L1 / L2 — sub-ms reads, high QPS	Use as the hot tier; expire by version key prefix on rollout.
Postgres	L2 / L3 — relational lookups, slot filling	Good when you need joins on entity slots or to tie into existing app DB.
JSON file in object storage	L3 prompt blocks; low-QPS L1	Versioned snapshot is just an immutable S3 object — trivial rollback.
CDN edge	Anonymous / personalization-free L1	Cloudflare KV / Workers, Fastly Edge Dictionary — single-digit ms global.
Provider prompt cache	L3 — large static contexts	OpenAI / Anthropic / Gemini prompt caching reuses tokens across requests.
SQLite (embedded)	Small deployments, edge / mobile	Ship the snapshot with the app binary; perfect for offline.

Recommended hot/cold split

Hot

Redis — L1 exact answers + L2 intent recordsSub-ms reads · scale-out · per-version namespace

~ 95% of reads
p99 < 5 ms

Warm

Postgres — L2 entity-slot resolution + audit logSlot joins · click-through tracking · serve log

~ 4% of reads
p99 ~ 50 ms

Cold

S3 + CDN — L3 prompt blocks + canonical snapshotsImmutable versioned objects · trivial rollback

~ 1% of reads
p99 ~ 200 ms

Provider

LLM prompt cache — pinned static contexts on the model sideOpenAI auto · Anthropic cache_control · Gemini cachedContent

cuts input cost
50 – 90 %

The hot tier handles nearly all production reads; warm and cold exist for cases where Redis can't answer alone (slot joins, very large prompt blocks). The provider tier is orthogonal — it caches the part of the prompt the LLM sees, not what the cache layer returns.

07 Runtime serving

The serving path is intentionally boring. No retrieval, no embedding, no vector store. Just normalize → classify → look up → format.

Steps

Normalize. Trim, lowercase, strip punctuation, expand contractions, language-detect.
Hash. Compute the canonical-form hash → L1 exact lookup.
Classify (only if L1 misses). Run intent classifier (small fine-tuned model, regex grammar, or LLM-based) → returns {domain, intent, slots}.
Build cache key. domain + intent + version + language [+ tenant].
L2 lookup. Hit? Render template with slots; small completion call (formatting only).
L3 lookup. Otherwise fetch the topic prompt block and call the LLM with that as the entire context.
Fallback. If everything misses, return a deterministic safe answer + escalate / log.
Audit. Log every served answer with the layer hit, version, and cache key. Sample for review.

Production guardrails on the runtime path

Per-layer SLO. L1 <30ms, L2 <300ms, L3 <2s. If you blow SLO, drop to fallback.
Cap LLM tokens. A formatting prompt should never need 8K input — set hard limits.
Force determinism. temperature=0, fixed seed where supported, structured output.
Strip the retrieval surface. The runtime container should not carry vector libraries or embedding clients — there's nothing for them to do.

08 Provider prompt caching

For API-hosted models you cannot persist the model's KV cache yourself, but every major provider exposes server-side prompt caching that does the equivalent — reusing the same large static prefix across requests at large discounts.

Provider	Mechanism	Typical effect
OpenAI	Automatic prompt caching on prefixes ≥ 1024 tokens	Up to ~80% lower latency, ~50–90% lower input-token cost on cached portion. Cache TTL ~5–60 minutes.
Anthropic (Claude)	Explicit `cache_control` markers on prompt blocks	Cached input tokens billed at ~10% of normal; up to 90% cost saving on the cached prefix. 5-min default TTL, 1-hour extended.
Google (Gemini)	Explicit context caching API; reuse caches across calls	Cached tokens billed separately at lower rates; you control TTL. Designed for very large reusable contexts.
Azure OpenAI	Same as OpenAI prompt caching, plus PTU reservations	PTU + prompt cache combine for predictable latency at scale.
Self-hosted (vLLM, SGLang, TensorRT-LLM)	Real KV cache + RadixAttention prefix sharing	You can hold full KV state across requests — biggest savings, but you operate the infra.

Anatomy of a cacheable prompt

CACHED PREFIX · ~8 000 tokens

USER · ~50

Stable: system + tool catalog + knowledge brief + history snapshot Varies: question

~ 90%

tokens reused via cache

~ 80%

latency reduction

~ 50–90%

input-token cost savings

Providers cache contiguous prefixes only. Anything that varies per request (timestamps, request IDs, session counters) breaks the cache if it appears before the static portion ends — keep volatile tokens at the very end.

How to structure prompts for max cache hits

Stable content first. System prompt → tool catalog → static knowledge → conversation history → user input. The provider can only cache a contiguous prefix.
Avoid timestamps (or any per-request varying token) in the cached region. They invalidate the prefix every call.
Pin big knowledge blocks via the provider's caching mechanism (Anthropic cache_control, Gemini cachedContent).
Refresh the cache just before TTL expiry on hot prefixes — a single keep-alive request preserves the discount.

# Anthropic — explicit cache_control on a static knowledge block
client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=300,
    system=[
        {
            "type": "text",
            "text": KNOWLEDGE_BRIEF,         # 8K tokens of compiled context
            "cache_control": {"type": "ephemeral"},
        },
        {"type": "text", "text": "You are a concise support agent."},
    ],
    messages=[{"role": "user", "content": user_question}],
)

09 Cache key design

The cache key is the contract. Wrong key = collisions, leaks, or zero hit rate. Design deliberately and version it like an API.

Recommended key shape

prefixknowledge:

versionv23:

languageen:

tenantacme:

domain.intentrefund.eligibility:

slot hasha8f71c

prefix — namespace (always knowledge)

version — snapshot label, mandatory

language — per-language variants

tenant — multi-tenant isolation

domain.intent — what the user is asking

slot hash — entity values (hashed, not raw)

Examples

# L1 exact answer
knowledge:v23:en:refund.30day_window

# L2 intent template (multi-tenant + slot hash)
knowledge:v23:en:tenant_acme:refund.eligibility:slot_a8f71c

# L3 topic prompt block
knowledge:v23:en:topic:refunds

Rules

Always include the version. Otherwise rollout becomes a flush-and-pray.
Include language. Or at minimum, route per-language to per-language caches.
Include tenant when answers vary. Per-tenant policies must not leak across tenants.
Hash the entity slots — don't put raw user data in a key (PII risk).
Avoid timestamps in keys — destroys hit rate.
Treat the key namespace as a public API. Document it, version it, deprecate it intentionally.

10 Versioning & rollout

Snapshots are immutable. You publish a new version side-by-side, shift traffic gradually, and keep the old version warm for instant rollback.

The publish flow

Compiler produces snapshot v24; uploads to S3 as knowledge/v24/....
Cache loader warms Redis under the knowledge:v24:* namespace (writes only — no read traffic yet).
Smoke tests: regression suite of 200+ canonical questions runs against v24; fail = abort.
Canary: route 1% → 10% → 50% → 100% via a feature flag on the runtime active_version.
Bake: keep v23 warm in Redis for 24–72 hours so rollback is instant.
GC: after the bake window, evict old versions by namespace prefix.

T0 · publish

T+5m · smoke

T+15m · canary 10%

T+1h · 100%

T+72h · GC

v23

100%

90%

warm

evicted

v24

warming

test

10%

100%

The runtime reads a single config key knowledge:active at request entry — flipping it from v23 → v24 is the entire rollout step. Rollback is the same flip in reverse, executable in seconds while v23 is still warm.

⚐

Active version is a single small key.

The runtime reads one tiny config key (e.g. knowledge:active) at request entry — that's the indirection point for canary, rollback, and per-tenant pinning.

11 Determinism & fallback rules

A cache-only system is only as good as its deterministic fallback. When the cache can't answer, what happens must be predictable, safe, and auditable.

Fallback ladder

Exact answer cache hitReturn verbatim approved answer · LLM not called

~ 30 ms
~ $0

Intent template + slot fillingRender approved template; tiny formatting LLM call

~ 200 ms
~ $0.0001

Topic prompt blockSend curated context to LLM; bounded output

~ 1.5 s
~ $0.001

SAFE

Approved fallback answer"I'm not sure — here's how to reach a human." Per-domain, vetted

~ 5 ms
~ $0

ESC

Escalate to human / open ticketCarry original question + intent for next compile cycle

async
logged

DENY

Free LLM generation from world knowledgeReintroduces hallucinations — defeats the whole pattern

disabled

Try each rung top-to-bottom. The deny row exists in the diagram precisely because it's the most tempting "just let the model handle it" mistake — make it unreachable in code, not just in policy.

Determinism levers

temperature=0 on every formatting call.
Fixed seed where the provider supports it.
Structured outputs (JSON schema) for any answer that gets parsed downstream.
Stop sequences to bound length.
Snapshot tests against the regression suite — any output drift across LLM versions blocks the deploy.

12 vs Classic RAG

Side-by-side decision matrix. Both patterns are valid; pick by the shape of your data and the answers you must serve.

Per-request flow timing

Classic RAG · per request

Receive query5 ms

Embed query80 ms

Vector search top-K120 ms

Rerank candidates200 ms

Assemble context15 ms

LLM generate (3K in / 300 out)2 200 ms

Post-filter / safety90 ms

Total~ 2 710 ms

Precompiled RAG · L1 hit

Receive query5 ms

Normalize2 ms

Hash → Redis lookup3 ms

Embed—

Vector search—

Rerank—

LLM generate—

Total~ 10 ms

L2 hits add a small formatting LLM call (~200ms total). L3 hits look much like classic RAG end-to-end but skip embedding + vector + rerank. The big latency win is concentrated on L1, which is also where the bulk of production traffic lives in well-tuned systems.

Decision matrix

Dimension	Classic RAG	Precompiled RAG
When retrieval runs	Every request	Build time, once per snapshot
p99 latency	1–5 s	30 ms (L1) – 2 s (L3)
Cost per query	Embedding + vector + rerank + LLM	Cache lookup + small LLM call (or none)
Determinism	Variable per request	Same input → same approved output
Reviewability	Audit individual generations	Audit the snapshot once, cover all queries
Freshness	Live	Bound by recompile cadence
Best for	Open-ended Q&A on changing corpora	FAQ, policy, support, compliance, regulated answers
Operational complexity	Vector DB + indexer + retriever + reranker live	Compiler runs offline; runtime is a key-value lookup
Failure surface	Bad chunk, bad rerank, bad LLM = bad answer	Bad snapshot is reverted in seconds

13 When to use this pattern

Not every workload should be precompiled. Match the pattern to the data shape.

Strong fit

Customer support FAQ & policies
Compliance and regulatory Q&A
Product documentation chatbot
HR / legal / finance internal helpdesk
Onboarding and how-to assistants
Voice IVR with bounded intents
Mobile / edge assistants (offline)
Healthcare patient-education answers

Weak fit (use classic or hybrid)

Open-ended research over a live corpus
Real-time data (prices, balances, sensors)
Per-user personalized synthesis
Large unbounded long-tail queries
Conversational agents that mutate state
Code generation against changing repos
Search-as-you-type (use embeddings)

Hybrid: when to mix

The most common production shape is hybrid — precompiled for the head of the distribution (the 80% of repeated questions) and live retrieval for the long tail. Route based on whether the cache layer hits; record long-tail questions for the next compile cycle so the head grows over time.

14 End-to-end example

A minimal but production-shaped cache-only serving path. Each box maps onto the layers above.

# serve.py — cache-only request handler
import hashlib, json, redis
from openai import OpenAI

R = redis.Redis(decode_responses=True)
LLM = OpenAI()
ACTIVE = R.get("knowledge:active") or "v23"

def normalize(q: str) -> str:
    return " ".join(q.lower().strip().split())

def key_l1(q: str, lang: str) -> str:
    h = hashlib.sha256(q.encode()).hexdigest()[:12]
    return f"knowledge:{ACTIVE}:{lang}:exact:{h}"

def classify_intent(q: str) -> dict:
    # small fine-tuned classifier or rule grammar
    ...
    return {"domain": "support", "intent": "refund.eligibility", "slots": {...}}

def serve(question: str, lang: str = "en") -> str:
    nq = normalize(question)

    # L1 — exact answer
    hit = R.get(key_l1(nq, lang))
    if hit:
        log_serve("L1", key_l1(nq, lang))
        return hit

    # L2 — intent template
    intent = classify_intent(nq)
    k2 = f"knowledge:{ACTIVE}:{lang}:intent:{intent['intent']}"
    record = R.get(k2)
    if record:
        rec = json.loads(record)
        prompt = rec["template"].format(**intent["slots"])
        out = LLM.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0, max_tokens=300,
        )
        log_serve("L2", k2)
        return out.choices[0].message.content

    # L3 — topic prompt block
    topic = intent_to_topic(intent["intent"])
    k3 = f"knowledge:{ACTIVE}:{lang}:topic:{topic}"
    block = R.get(k3)
    if block:
        out = LLM.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": block},
                {"role": "user", "content": question},
            ],
            temperature=0, max_tokens=600,
        )
        log_serve("L3", k3)
        return out.choices[0].message.content

    # Fallback — deterministic safe answer + escalate
    log_miss(question, intent)
    return SAFE_ANSWER[intent["domain"]]

Notice what's not here: no embedding model, no vector store client, no reranker. The runtime container is small and boring — exactly the goal.

15 Refresh & invalidation

When source content changes, you compile a new snapshot — never patch the live cache in place. Atomic publish via the active-version pointer.

Triggers

Scheduled. Nightly / weekly recompile from the latest source content.
Source change. Webhook from CMS / docs repo / Confluence triggers a partial recompile of affected topics.
Miss-driven. When L1/L2/L3 miss rate exceeds a threshold for any topic, queue a recompile.
Manual. Compliance / legal-driven hot fix to a specific answer — fast lane bypasses the full compile.

Partial vs full recompile

Partial: Recompile only the affected topics. Faster, cheaper, but the new snapshot still gets a new version label and goes through the full canary.
Full: Rebuild the whole snapshot. Slower but guaranteed consistent — required when policies change or a global verifier is updated.

Don't mutate live cache entries.

Editing in place across a fleet of cache nodes leads to torn reads and inconsistent answers between users. Always publish a new version and switch the active pointer.

16 Eval & quality gates

The eval surface for precompiled RAG is much friendlier than for live RAG — you evaluate the snapshot once, and the result holds for every query that hits it.

Build-time evals

Groundedness per generated answer — does it follow from the retrieved source?
Citation correctness — does every claim cite a real source span?
Coverage — what fraction of the regression query suite is answerable from the new snapshot?
Drift — diff against the previous snapshot; flag any answer that changed materially for human review.
Safety — content filter on every generated artifact; reject before publish.
Bias / fairness for high-stakes domains — automated probes across protected attributes.

Runtime evals

Layer hit rate per domain (L1 / L2 / L3 / fallback).
Miss log classification — what kind of questions are we failing to answer?
Latency p50/p95/p99 per layer.
User satisfaction signal (thumbs / resolved-without-escalation rate).

Quality gates before publish

Regression suite ≥ 99% pass.
Coverage on production miss log ≥ target (e.g. 90%).
Drift report manually approved by domain owner.
Safety scan — zero critical findings.
Canary smoke at 1% — error rate within tolerance.

17 Cost analysis

The cost shape changes fundamentally — from "scales with QPS" to "scales with rebuild frequency + cache size."

Where the dollars go

Cost line	Classic RAG	Precompiled RAG
Per-query embedding	1× per query	0 at runtime
Per-query vector search	1× per query	0 at runtime
Per-query rerank	1× per query	0 at runtime
Per-query LLM tokens	large input + output	tiny (formatting only)
Vector DB infra	always-on, scales with QPS	none at runtime
Compiler runs	n/a	per-rebuild fixed cost
Cache storage	n/a	small, predictable

Worked example

Suppose 1M queries/month, 70% hit L1, 25% hit L2, 5% hit L3:

Classic RAG — every query: ~3K input tokens, ~300 output tokens. ~$0.005 each → $5,000/mo, plus vector DB infra.
Precompiled RAG — L1 free, L2 ~50 input/200 output ($0.0001), L3 ~2K input/300 output ($0.0015). Weighted: ~$100/mo + ~$300/mo compiler runs + cache infra. ~30× cheaper.

Classic RAGper-query LLM + vector + rerank

$5,000 / mo · LLM

$5,000

Vector DB infraalways-on, scales with QPS

$1,200

Precompiled · LLMonly L2 + L3 calls

$100

Compiler runsweekly rebuild + topic refreshes

$300

Cache infraRedis + S3 + small Postgres

$150

$6,200

Classic RAG · monthly total

$550

Precompiled · monthly total

~ 11×

cost reduction

The bigger the head of your query distribution, the bigger the savings. For a long-tailed corpus, hybrid splits the difference. The chart uses representative numbers — your mileage varies with model choice, hit-rate distribution, and provider prompt-cache discounts (which can drop the precompiled column further).

18 Anti-patterns

Mistakes that look reasonable but undermine the whole pattern.

Anti-pattern	Why it hurts	Do this instead
Shipping a vector store to runtime "just in case"	Defeats the simplicity and operational cost wins.	Strip retrieval from the runtime container; route misses to a deterministic fallback.
Patching cache entries in place	Torn reads, inconsistent answers across users, no audit.	Compile new snapshot, publish atomically by switching the active pointer.
Cache key without version	Rollouts and rollbacks become destructive flushes.	Always include version in the key; keep ≥2 versions warm.
Letting the LLM answer freely on miss	Reintroduces hallucinations into a "deterministic" system.	Deterministic safe answer + escalation, never free generation.
Time-of-day in the cached prefix	Invalidates the provider prompt cache every request.	Move time-varying tokens after the cached region or out of the prompt entirely.
Compiling without a regression suite	You can't prove a new snapshot is at least as good.	Maintain a labeled regression query set; CI it into the publish gate.
Skipping the miss log	The cache stagnates; long tail never gets folded back into the head.	Log every miss with intent + slots; review before each rebuild.
Cross-tenant cache without tenant in the key	One tenant's policy answer leaks to another.	Per-tenant key prefix or per-tenant cache namespace.
Generating without verification	Hallucinated artifacts get baked into the cache and served to all users.	Second-pass verifier + sample human review before publish.
One giant L3 prompt block per topic	Costs more tokens per call; cache eviction churn.	Split topics finely; let intent classifier route to the smallest relevant block.

19 Production checklist

Before declaring a precompiled-RAG system production-ready.

Compiler

Idempotent — same source produces the same snapshot.
Deterministic versioning (date or monotonic counter).
Verifier rejects ungrounded artifacts.
Snapshots stored immutably in object storage.
Runs on schedule + on source-change webhooks.

Cache

Keyed by domain + intent + version + language [+ tenant].
≥ 2 versions kept warm for instant rollback.
Active version controlled by a single small config key.
Per-tenant isolation verified by automated test.

Runtime

L1 → L2 → L3 → fallback ladder implemented.
No vector libraries / embedding clients in the runtime image.
SLOs enforced per layer with auto-degrade.
Determinism: temperature=0, structured outputs.
Every served answer is audit-logged with layer hit + version.

Eval & ops

Regression suite of canonical queries gates every publish.
Drift report against previous snapshot reviewed before promotion.
Miss log monitored; long-tail questions feed back into next rebuild.
Per-layer hit-rate dashboards live.
Provider prompt-cache hit rate tracked (if applicable).
Documented runbook for emergency rollback (single config key flip).

★

If you ship one thing first, ship the cache-key contract.

Get the key shape right and everything downstream — rollouts, multi-tenant safety, audit, eval — falls into place. Get it wrong and you'll be migrating cache schemas under traffic.