Sections
01 Home 02 Overview 03 Architecture 04 Ingestion 05 Chunking 06 Embedding 07 Vector DB 08 Query Transform 09 Retrieval 10 Reranking 11 Cross-Encoder 12 Generation 13 Response Eval 14 Self-Correction 15 Evaluation 16 Benchmarking 17 Guardrails 18 Threat Model 19 Grounding 20 Observability 21 Scaling 22 Advanced 23 Deployment 24 Cost 25 Failure Modes 26 Roadmap 27 Checklist 28 Context Compression 29 RAG Taxonomy 30 Naive RAG 31 Advanced RAG 32 Modular RAG 33 Agentic RAG 34 Self-RAG 35 Adaptive RAG 36 Multimodal RAG 37 Hybrid RAG 38 CAG 39 Graph RAG 40 Vectorless RAG 41 Distillation Overview 42 Distill Techniques 43 Distill Models 44 Quantization 45 Domain Distillation 46 Distill Implementation 47 Distill Summary 48 Fine-Tune vs RAG 49 Prompt Engineering 50 Caching Strategies 51 Metadata Filtering 52 Multi-Turn RAG 53 Feedback Loops 54 Structured Data RAG 55 Data Lifecycle 56 Authorization 57 Human Review 58 Reliability 59 Citation UX 60 Multilingual 61 Personalization 62 Secrets 63 Frameworks 64 Glossary

Advanced Production-Grade
RAG Pipeline Implementation

Building enterprise-ready retrieval-augmented generation systems with semantic search, adaptive policies, and self-correction loops

LLM Orchestration Vector Search Semantic Retrieval
Embedding Models Self-Correction Loops MLOps

22 comprehensive sections covering architecture, implementation, and production deployment

01 / Cover

What is RAG?

RAG is not a single model but a complete AI application architecture that combines retrieval systems with language models to ground responses in external knowledge.

RAG = AI Application Architecture Knowledge Sources Docs, databases, APIs, vector stores Retrieval System Search, rank, rerank, filter LLM Generator Answer synthesis, grounding, citations Self-Correction / Reflection

Three Pillars

  • Multi-stage Retrieval — Hybrid, sparse, dense, and ranked retrieval
  • Adaptive Policies — Context-aware retrieval strategies
  • Self-Correction Loops — Reflection and iterative refinement

Enterprise Risks

Beyond hallucinations, consider:

  • ⚠ Permission leakage
  • ⚠ Prompt injection attacks
  • ⚠ Data poisoning
  • ⚠ Unbounded cost/latency
  • ⚠ Silent quality regressions

Use Cases

  • 📄 Employee Knowledge Work — Internal docs, wikis
  • 🤝 Customer Support — FAQs, tickets, logs
  • 📊 Structured+Unstructured — Reports, databases, forms
  • 🎨 Multimodal Knowledge — PDFs, images, videos

Indexing Plane

Off-line: Data ingestion, parsing, chunking, embedding, and vector storage. Built once, queried many times.

Serving Plane

On-line: Query processing, retrieval, reranking, LLM inference, and safety checks. Low-latency, high-throughput.

OWASP LLM Top 10: This presentation addresses major threats including prompt injection (LLM01), insecure output handling (LLM02), training data poisoning (LLM03), and model denial of service (LLM04).
02 / Overview

Full Architecture: Two Planes + Governance

Complete end-to-end RAG architecture with indexing, serving, and governance layers

INDEXING PLANE (Offline) Data Sources Connectors Parser/OCR (Unstructured) Chunking + Metadata Embedding Jobs Vector Store Object Store SERVING PLANE (Online) User Query API Gateway Auth + Query Proc Sparse Ret. Dense Ret. Fusion (RRF) Reranker Context Bldr LLM Orchestrator (Prompt) LLM Inference (Gen) Response + Citations + Safety Checks GOVERNANCE Tracing + Metrics Policies (PII/ACL) Eval (RAGAS)

IndexingPipeline

class IndexingPipeline: def ingest(self, source): docs = self.connector.fetch(source) chunks = self.chunker.split(docs) embeddings = self.model.embed(chunks) self.vector_db.upsert(embeddings) # Update metadata, manage versions

QueryPipeline

class QueryPipeline: def query(user_q): sparse = bm25(user_q) dense = vector_search(user_q) fused = rrf_fusion(sparse, dense) ranked = reranker.score(fused) return llm.generate(ranked)
03 / Architecture

Document Ingestion: Connectors + Contracts

Reliable data onboarding with standardized contracts, multi-format support, and three-speed indexing

Canonical Document Contract

Every document must conform to this schema for consistent retrieval and governance:

{ "id": "doc-uuid", # Unique identifier "tenant_id": "org-123", # Multi-tenancy support "acl": ["user-1", "group-2"], # Access control list "source": "salesforce", # Provenance "timestamp": "2025-03-17", # Ingestion time "version": 2, # Document version "content": "...", # Raw/parsed text "metadata": {...} # Custom fields }

Connector Types & Data Sources

Enterprise Document Parsing

Apache Tika, Unstructured.io — Parse PDFs, DOCX, images with layout preservation and OCR support

parsed = tika.extract( filename, ocr=True, extract_tables=True )

Structured Data

CRM, ERP, Databases — Direct queries or treat as "tool use" for on-demand retrieval; knowledge views

records = fetch_from_salesforce( "Contact", filters={"updated_at": last_sync} )

Streaming & CDC

Debezium → Kafka — Real-time event streams from databases; capture inserts, updates, deletes

stream = kafka.subscribe( "postgres.public.documents" )

Web Content

Crawlers with Compliance — Respect robots.txt, rate limits, GDPR; extract HTML/JSON with link tracking

docs = crawl( seed_urls, max_depth=3, respect_robots=True )

Multimodal Sources

OCR + Image Embeddings — Extract text from images, create vision embeddings; preserve layout

text, img_emb = extract_image( img_path, vision_model="CLIP" )

Custom Connectors

Plugin API — Implement standardized interface for proprietary systems, internal APIs, legacy apps

class MyConnector(Connector): def fetch(...): pass

Three-Speed Indexing Model

1

Batch Rebuilds

Full reindexing of large datasets weekly/monthly; highest throughput, controlled resources. Use for bulk imports, historical data.

2

Incremental Upserts

Append new chunks, update modified docs via change detection; moderate latency (seconds). Triggered by scheduled jobs or webhooks.

3

Real-Time Streams

Event-driven CDC or message queue ingestion; sub-second latency for hot data. Use for live chat logs, sensor feeds, user events.

Nightly/Weekly Minutes-Hours Seconds Batch Rebuilds Full re-embed • Nightly/Weekly Incremental Upserts Delta updates • Minutes-Hours Real-Time Streams CDC / Events • Seconds Frequency

Element-Aware PDF Parsing

Extract with positional metadata (page, bbox, reading order). Preserve table structure, preserve images. Enables citation anchoring.

Dead Letter Queue Pattern

Send unparseable docs to DLQ for manual inspection; enable retry with fallback parsers or human review. Never silently drop data.

ProductionIngester Example

class ProductionIngester: def __init__(self, config): self.connectors = config.connectors # Multi-source self.parser = TikaParser(ocr=True) self.chunker = SemanticChunker() self.vector_db = PineconeDB() self.dlq = DeadLetterQueue() self.metrics = PrometheusMetrics() def ingest_batch(self, source, docs): for doc in docs: try: parsed = self.parser.extract(doc) chunks = self.chunker.split(parsed) self.vector_db.upsert(chunks) self.metrics.inc("ingested_docs") except ParsingError as e: self.dlq.enqueue(doc, error=e) self.metrics.inc("dlq_docs")
04 / Ingestion
Preprocessing

Chunking Strategies

Chunk for retrieval (findability) and store separate representations for generation (readability)

Document Fixed Recursive Semantic Parent-Child
Strategy Description Best For Trade-offs
Fixed-Size Split at token/word boundary Predictable, simple baseline May split sentences; low semantic coherence
Recursive Split recursively by delimiters (newline, paragraph, sentence) Structured documents, code Still may cut semantically important boundaries
Semantic Embed sentences, split at embedding distance threshold Narrative text, research papers Expensive; latency + cost for embedding all chunks
Document-Structure Respect sections, headings, tables, code blocks Mixed-format documents (PDFs, Markdown) Requires parser awareness
Agentic/LLM Use LLM to decide breaks and chunk metadata Complex domain logic, multilingual High cost and latency; not real-time
Sliding Window Overlapping fixed-size chunks with stride Preserve local context, boundary queries Higher storage; redundant retrieval
Parent-Child (Sentence-Window) Store fine-grained chunks; expand with surrounding context at retrieval Precision + context balance Requires two-stage retrieval; complex indexing

SemanticChunker: Embedding Similarity Breakpoints

class SemanticChunker: def __init__(self, embedding_model, threshold=0.5): self.embed = embedding_model self.threshold = threshold def split(self, text): sentences = self.sent_tokenize(text) embeddings = self.embed.batch_embed(sentences) chunks, current = [], [] for i, (sent, emb) in enumerate(zip(sentences, embeddings)): if i > 0: # Cosine similarity to previous sim = cosine_similarity(emb, embeddings[i-1]) if sim < self.threshold and current: # Semantic break detected chunks.append(" ".join(current)) current = [] current.append(sent) if current: chunks.append(" ".join(current)) return chunks
Production Best Practices:
  • Chunk size: 256–512 tokens (optimal for retrieval + generation trade-off)
  • Overlap: 10–15% to preserve boundary context
  • Metadata inheritance: Propagate doc_id, section, source to every chunk
  • Context enrichment: Prepend section headers or document title to chunk
  • Element-aware parsing: Preserve tables, code blocks, images as intact units in PDFs

Recommended: Hybrid Multi-Layer Chunking for Production

No single chunking strategy works for all document types. Production systems use a document-type router that selects the best chunking strategy per document, combined with a parent-child indexing pattern that stores small chunks for precise retrieval but returns larger context windows for generation.

RECOMMENDED: HYBRID MULTI-LAYER CHUNKING PIPELINE Document Router Classify doc type PDF → Structure-aware Markdown → Heading split Code → AST-aware Element Parser Extract typed elements Title, NarrativeText Table, CodeBlock Image, ListItem Semantic Splitter Embedding breakpoints 256–512 token target cosine distance splits keep tables/code intact Parent-Child Index Dual storage Child → embed + search Parent → full section Return parent on match Enrich + Store Metadata + embed prepend heading add source metadata embed → vector DB QUERY TIME: Parent-Child Retrieval Pattern Query embed query Search Child Chunks precise semantic match (top-K) Expand to Parent retrieve full section context Dedup + Rerank remove overlaps, score LLM Context coherent, complete Why this works: small chunks give precise retrieval (high recall@K), parent expansion gives coherent context (high faithfulness) Child chunk = 128–256 tokens (search unit) → Parent window = 512–1024 tokens (context unit) → Sent to LLM with full surrounding context Result: 15–25% higher faithfulness vs flat chunking at same recall, per RAGAS benchmarks

Production Chunking Pipeline

class ProductionChunkingPipeline: """Route-aware, parent-child, metadata-enriched.""" def __init__(self): self.router = DocTypeRouter() self.parsers = { "pdf": UnstructuredParser(strategy="hi_res"), "markdown": MarkdownHeaderSplitter(), "html": HTMLSectionSplitter(), "code": ASTChunker(), # tree-sitter "plaintext": SemanticChunker(), } self.semantic = SemanticChunker( model="all-MiniLM-L6-v2", max_tokens=384, # child chunk size threshold=0.5, ) def process(self, doc: Document) -> list[Chunk]: # Step 1: Route to parser doc_type = self.router.classify(doc) elements = self.parsers[doc_type].parse(doc) # Step 2: Create parent sections parents = self.group_into_sections(elements) # Step 3: Split parents into child chunks all_chunks = [] for parent in parents: children = self.semantic.split(parent.text) for i, child_text in enumerate(children): chunk = Chunk( text=child_text, parent_id=parent.id, parent_text=parent.text, # stored separately position=i, metadata=self.enrich(doc, parent, child_text), ) all_chunks.append(chunk) return all_chunks def enrich(self, doc, parent, text): """Prepend context + propagate metadata.""" return { "doc_id": doc.id, "source": doc.source, "section": parent.heading, "page": parent.page_num, "doc_type": doc.content_type, "indexed_at": datetime.utcnow(), # Prepended for better retrieval: "enriched_text": ( f"{doc.title} > {parent.heading}\n" f"{text}" ), }

Parent-Child Retrieval at Query Time

class ParentChildRetriever: def search(self, query, top_k=5): # 1. Search CHILD chunks (precise) children = self.vector_db.search( query, top_k=top_k * 3 # over-fetch ) # 2. Expand to PARENT sections parent_ids = set(c.parent_id for c in children) parents = self.doc_store.get_parents(parent_ids) # 3. Deduplicate + rank parents by # best child match score scored = {} for child in children: pid = child.parent_id if pid not in scored or child.score > scored[pid]: scored[pid] = child.score ranked = sorted( parents, key=lambda p: scored[p.id], reverse=True )[:top_k] return ranked # full parent context

Chunk Size Guide by Document Type

Doc TypeChild (Search)Parent (Context)Strategy
Product docs128–256 tok512–1024 tokHeading-based + semantic
Legal / Policy256–384 tok1024–2048 tokSection-based, keep clauses intact
Research papers256–512 tok1024–2048 tokSemantic breakpoints
FAQ / KBWhole Q&A pairSame (no parent)Question-Answer as unit
CodeFunction/classFile or moduleAST-aware (tree-sitter)
Chat logsSingle turnFull conversationTurn-based splitting
Tables / CSVRow groupFull table + headerKeep header with every chunk

Why Parent-Child Wins

Problem: Small chunks retrieve precisely but lose context. Large chunks give context but pollute retrieval with irrelevant text.

Solution: Index small (128–256 tok) for search precision. At retrieval time, expand to the parent section (512–1024 tok) for coherent LLM context. Best of both worlds.

Context Enrichment (Prepending)

Prepend the document title and section heading to each chunk before embedding. This dramatically improves retrieval for ambiguous queries.

# Without enrichment: "Returns are accepted within 30 days." # With enrichment: "Product Policy > Returns & Refunds\n" "Returns are accepted within 30 days." # Now retrieves for "return policy" queries

Tools for Production

Parsing: unstructured.io (hi_res), LlamaParse, Docling
Splitting: LangChain RecursiveCharacterTextSplitter, LlamaIndex SentenceWindowNodeParser
Semantic: Sentence Transformers + custom breakpoint
Code: tree-sitter (AST), CodeSplitter
Parent-Child: LlamaIndex ParentDocumentRetriever, custom doc_store + vector_db combo

Production Recommendation: Start with recursive character splitting (LangChain default) as your baseline — it's simple and works surprisingly well. Add parent-child retrieval once you have evaluation metrics showing context gaps. Only add semantic chunking when recursive splitting provably fails on narrative/research content. Measure every change against your eval set — chunking changes can regress retrieval quality in unexpected ways.
05 / Chunking Strategies
Core ML

Embedding Models & Strategies

Dimension 2 Dimension 1 Query Query Relevant docs Irrelevant docs
Model Family Type Notable Capabilities Operational Considerations Cost/Latency
OpenAI text-embedding-3 API small/large variants; dimension shortening; multilingual Quota limits; regional latency $0.02/M tokens (small); higher for large
Cohere Embed v3/v4 API Multilingual + multimodal (text+image); fine-tuning available Document and query encoding modes $0.10/1M tokens
BGE-M3 Open-source (HuggingFace) Multi-lingual multi-function (dense+sparse+multi-vector) 8192 token context; self-hosted overhead Free; requires GPU infrastructure
Multilingual E5 Open-source Strong multilingual; published training/eval methodology Community-maintained; good reproducibility Free; 2–5ms per chunk on A100
GTE-Qwen2 (7B) Open-source State-of-the-art; 131K context window Larger model; requires more VRAM Free; ~20ms/chunk on A100
voyage-3-large API Long-context (128K) + code understanding Premium pricing; excellent for code RAG $0.15/1M tokens
nomic-embed-text-v1.5 Open-source Matryoshka embeddings; dimension flexibility Efficient storage; truncation-stable Free; 3–4ms latency on CPU

EmbeddingService: Caching, Rate-Limiting, Batch Processing

class EmbeddingService: def __init__(self, model, redis_cache, rate_limiter): self.model = model self.cache = redis_cache # Cache embeddings by text hash self.limiter = rate_limiter # Token/sec quota def embed_batch(self, texts: List[str]) -> List[ndarray]: results = [] missing = [t for t in texts if not self.cache.get(hash(t))] if missing: self.limiter.wait(len(missing)) # Rate limit embeds = self.model.encode(missing, batch_size=32) for text, emb in zip(missing, embeds): self.cache.set(hash(text), emb, ttl=7*24*3600) return [self.cache.get(hash(t)) for t in texts]

Matryoshka Embeddings

Truncate high-dimensional embeddings to lower dimensions without retraining. Trade-off: smaller vectors (10–20% storage savings) vs. slight accuracy loss.

Fine-Tuning with Contrastive Learning

Train embeddings on domain-specific relevance pairs using triplet loss. Improves domain-specific retrieval by 15–30% with 5K–50K labeled pairs.

Instruction-Tuned Embeddings

Prepend task instructions ("Retrieve document for query: ") to asymmetrically encode queries vs. documents. Boosts retrieval by leveraging prompt tuning.

How Semantic Search Works

Semantic search retrieves content by meaning rather than keyword overlap. Both the query and every document chunk are encoded into high-dimensional vectors using the same embedding model. Similar meanings map to nearby points in that vector space, so ranking by vector distance surfaces conceptually relevant chunks — even when they share no words with the query.

1. Encode

The embedding model transforms text into a fixed-length dense vector (typically 384–3072 dims). Each dimension captures a latent semantic feature learned during pre-training on billions of text pairs.

2. Index

Document vectors are stored in an ANN index (HNSW, IVF-PQ, ScaNN). The index trades a small amount of recall for sub-linear search across millions to billions of vectors.

3. Score & Rank

At query time, the query vector is compared against candidates using cosine similarity, dot product, or Euclidean distance. Top-K nearest neighbors are returned as the retrieval set.

Similarity Metrics at a Glance

Metric Formula (intuition) When to Use Notes
Cosine Similarity angle between vectors; magnitude-invariant Default for most text embeddings (OpenAI, BGE, E5) Robust to varying text length; values in [-1, 1]
Dot Product sum of element-wise products Models trained with normalized vectors; fastest on GPU Equivalent to cosine when vectors are L2-normalized
Euclidean (L2) straight-line distance in vector space Image embeddings; some classical IR models Sensitive to magnitude; rarely optimal for text

Minimal Semantic Search Loop

class SemanticSearcher: def __init__(self, encoder, index): self.encoder = encoder # Embedding model (e.g. BGE-M3) self.index = index # ANN index (HNSW / IVF-PQ) def index_corpus(self, docs): vectors = self.encoder.encode(docs, normalize=True) self.index.add(vectors, ids=[d.id for d in docs]) def search(self, query: str, top_k=10): q_vec = self.encoder.encode([query], normalize=True)[0] # Cosine == dot product on normalized vectors scores, ids = self.index.search(q_vec, k=top_k) return [(id_, score) for id_, score in zip(ids, scores)]
Semantic vs Lexical vs Hybrid: Lexical search (BM25) excels at exact terms, codes, and rare tokens. Semantic search excels at paraphrase, synonyms, and cross-lingual intent. Production RAG systems typically run both in parallel and fuse results with Reciprocal Rank Fusion (RRF) to get the best of both worlds.

Different Embedding Model Families

Not all embeddings are created equal. Model choice depends on modality, context window, language coverage, latency budget, and deployment constraints. The landscape breaks down into a handful of architectural families.

Dense Bi-Encoders

Encode query and document independently into a single dense vector. Fast retrieval via ANN. Examples: text-embedding-3, BGE-large, E5, GTE, nomic-embed.

Sparse / Learned Sparse

Produce high-dimensional sparse vectors over vocabulary terms with learned term weights. Combines keyword precision with neural context. Examples: SPLADE++, BGE-M3 sparse, uniCOIL.

Multi-Vector (ColBERT-style)

Emit one vector per token and score with MaxSim late-interaction. Higher recall on fine-grained queries at the cost of storage. Examples: ColBERTv2, Jina-ColBERT, BGE-M3 multi-vec.

Cross-Encoders (Rerankers)

Jointly encode (query, document) pairs and output a relevance score. Too slow for first-stage retrieval but ideal for reranking top-100 candidates. Examples: bge-reranker-v2, Cohere Rerank 3, Jina Reranker.

Multilingual Models

Trained on 100+ languages so queries in one language retrieve documents in another. Examples: multilingual-e5-large, BGE-M3, Cohere embed-multilingual-v3, LaBSE.

Multimodal & Code

Share a vector space across text, images, audio, or source code for cross-modal retrieval. Examples: CLIP, SigLIP, Cohere Embed v4, voyage-code-3, jina-embeddings-v3.

Choosing an Embedding Model — Decision Checklist

Requirement Recommended Family Concrete Options
Fastest time-to-value, managed API dense bi-encoder OpenAI text-embedding-3-small, Cohere embed-v4, voyage-3
On-prem / data residency Open-source dense BGE-large-en, E5-large-v2, GTE-Qwen2, nomic-embed-v1.5
Multilingual corpus (50+ languages) Multilingual dense / hybrid BGE-M3, multilingual-e5, Cohere embed-multilingual-v3
Keyword-heavy (legal, medical codes) Sparse + dense hybrid SPLADE++ + BGE, BGE-M3 (dense+sparse+multi-vec)
Highest accuracy, storage available Multi-vector + reranker ColBERTv2 / BGE-M3 + bge-reranker-v2
Source code retrieval Code-tuned dense voyage-code-3, jina-embeddings-v3-code, CodeSage
Images + text together Multimodal bi-encoder CLIP, SigLIP, Cohere Embed v4, Nomic Embed Vision
Very long context (>32K tokens) Long-context dense voyage-3-large (128K), GTE-Qwen2 (131K), jina-v3 (8K+)
Compliance Note: Where data is processed matters. API-based embeddings (OpenAI, Cohere) send data to third-party servers. Self-hosted models (BGE, E5, GTE-Qwen) keep data in-house. Evaluate data residency, privacy, and contractual requirements before choosing.
06 / Embedding Models
Storage Layer

Vector Database Selection

FAISS is a similarity search library, not a networked vector database. Production RAG requires distributed, replicable systems.

Layer 0 (Base) Layer 1 Layer 2 (Entry) Query
Database Architecture Key Features Scaling Model Ops Burden
FAISS In-memory library Highest performance; no persistence Single-node only High (build/rebuild cycles)
Milvus Distributed (K8s native) Multi-replica, auto-sharding, metadata filtering Horizontal (scale nodes) High (K8s expertise required)
Pinecone Managed SaaS Serverless, metadata filtering, pod-type scaling Serverless (auto) Low (fully managed)
Weaviate Hybrid (vector+BM25) Combined dense/sparse search, replication controls Cluster-based Medium
Chroma Lightweight SQLite/in-memory Simple API; good for prototypes Single-node Low (dev only)
Elasticsearch Existing infra (if already deployed) Dense vectors + BM25 + analytics Cluster-based Medium
pgvector PostgreSQL extension SQL + vectors; ACID transactions Postgres replication Medium

Qdrant/Milvus Production Config: HNSW + Quantization + Replication

# Qdrant production config (YAML) collection: name: rag_documents vectors: size: 1536 # OpenAI embedding dim distance: "cosine" hnsw_config: m: 16 # graph connectivity ef_construct: 500 # trade-off precision/speed ef: 100 quantization_config: scalar: type: "int8" # 8-bit quantization quantile: 0.99 replication_factor: 3 # high availability sharding: "auto" ttl: 2592000 # 30-day retention

Key Design Decisions

  • HNSW vs IVF: HNSW faster recall, IVF better for billion-scale; prefer HNSW for sub-100M datasets
  • Quantization: 8-bit scalar quantization saves 4x memory with <2% recall loss; essential for cost control
  • Namespaces/Partitions: Isolate indices by tenant, project, or time period for multi-tenancy and retention
  • Replication: RF=3 minimum for production SLA; prevents single-point failures
  • TTL & Garbage Collection: Auto-expire old chunks; configure cleanup policies for cost
  • Backup & Point-in-Time Recovery: Daily snapshots; test restore procedures quarterly
07 / Vector Database
Query Pipeline

Query Transformation — From One Query to Many

Users ask vague, ambiguous, or narrowly-worded questions. A single embedding of that raw query often misses relevant chunks. Query transformation rewrites, decomposes, and expands the user's query into multiple targeted search queries — dramatically improving chunk filtering and retrieval quality.

QUERY TRANSFORMATION PIPELINE — RECOMMENDED PRODUCTION STRATEGY User Query "how do I fix auth?" ① Classify Intent Simple? → skip transform Complex? → full pipeline ② Transform Strategies (Parallel) Original: "how do I fix auth?" Rewrite: "troubleshoot authentication" Technical: "401 403 OAuth token error" Step-back: "auth system architecture" HyDE: [hypothetical answer doc] Decompose: Q1 + Q2 + Q3 sub-queries ③ Parallel Retrieve 6 queries × top-K each asyncio.gather() → RRF merge → deduplicate → unique chunks pool ④ Rerank Against ORIGINAL query (not the variants) → Top-5 to LLM Key insight: variants improve RECALL (find more relevant chunks), reranking against original restores PRECISION (filter to the best) Typical improvement: Recall@5 goes from 62% (single query) → 85% (multi-query) → 94% (multi-query + rerank)

Six Query Transformation Strategies

1. Query Rewriting

LLM rewrites the query to be clearer and more search-friendly. Fixes typos, expands abbreviations, makes implicit context explicit.

# Input: "how 2 fix auth" # Output: "How to troubleshoot and fix # authentication errors" prompt = f"""Rewrite this query to be clearer for a search engine. Fix typos, expand abbreviations. Query: {query}"""

When: Always. First step in every pipeline. Cheap and fast (~50ms with Haiku).

2. Multi-Query Expansion

Generate 3–5 diverse reformulations targeting different vocabulary, specificity levels, and perspectives.

# Input: "fix auth errors" # Output: # - "authentication failure troubleshoot" # - "401 403 OAuth token expired" # - "login session invalid API key" # - "how to debug access denied"

When: Ambiguous or broad queries. Biggest recall improvement (15–30%). See deep-dive in Retrieval section.

3. Step-Back Prompting

Generate a higher-level abstract query to retrieve foundational context, then the specific query for details.

# Input: "why does JWT expire in 15min" # Step-back: "JWT token lifecycle and # security best practices" # Then search BOTH queries: # → foundational + specific chunks

When: "Why" questions, conceptual queries. Provides background context the LLM needs to reason.

4. HyDE (Hypothetical Document)

Ask the LLM to generate a hypothetical answer, embed THAT, and search for similar real documents. Bridges the query-document embedding gap.

# Input: "fix auth errors" # LLM generates hypothetical doc: hypo = "To fix authentication errors, first check if your OAuth token has expired. Refresh using the /auth/refresh endpoint..." # Embed hypo → search → find real docs # that are SIMILAR to this answer

When: Technical queries where query language differs from document language. Adds ~300ms latency.

5. Query Decomposition

Break multi-part or complex questions into atomic sub-queries, retrieve for each independently, then merge.

# Input: "compare pricing of Plan A # vs Plan B and which has # better support" # Decompose into: # Q1: "Plan A pricing details" # Q2: "Plan B pricing details" # Q3: "Plan A support features" # Q4: "Plan B support features"

When: Compound questions, comparisons, multi-entity queries. Critical for completeness.

6. Metadata Filter Extraction

Extract structured filters (date, category, product, region) from the query to narrow the search pool BEFORE vector search.

# Input: "2024 return policy for EU" # Extract: # - filter: year=2024 # - filter: region=EU # - query: "return policy" # → pre-filter chunks THEN embed search

When: Queries with temporal, geographic, or categorical constraints. Dramatically reduces search pool.

Recommended Production Strategy: Adaptive Query Transform

Don't apply all strategies to every query — that's wasteful and slow. Instead, classify the query complexity and apply the minimum transformation needed. Simple factual queries need only rewriting; complex multi-part queries need decomposition + expansion.

AdaptiveQueryTransformer — Production Implementation

class AdaptiveQueryTransformer: """Classify query → apply minimum transform. Simple queries: just rewrite (50ms). Complex queries: full pipeline (200-400ms).""" def __init__(self, llm, fast_llm): self.llm = llm # strong model self.fast = fast_llm # Haiku / mini self.classifier = QueryClassifier() self.cache = TransformCache(ttl=3600) async def transform(self, query: str) -> TransformResult: # Check cache first cached = self.cache.get(query) if cached: return cached # Step 1: Classify query complexity qtype = self.classifier.classify(query) # Step 2: Route to appropriate strategy if qtype == "simple_factual": # "What's the return policy?" → just rewrite queries = [await self.rewrite(query)] filters = self.extract_filters(query) elif qtype == "ambiguous": # "fix auth" → rewrite + expand rewritten = await self.rewrite(query) expanded = await self.expand(query, n=3) queries = [rewritten] + expanded filters = self.extract_filters(query) elif qtype == "compound": # "compare A vs B pricing + support" sub_queries = await self.decompose(query) queries = sub_queries filters = self.extract_filters(query) elif qtype == "conceptual": # "why does X happen?" → step-back + specific abstract = await self.step_back(query) queries = [query, abstract] filters = {} elif qtype == "technical": # Technical jargon → HyDE + expand hyde_doc = await self.generate_hyde(query) expanded = await self.expand(query, n=2) queries = [query] + expanded hyde_queries = [hyde_doc] # separate embed filters = self.extract_filters(query) else: # fallback: rewrite + 2 expansions queries = [query] + await self.expand(query, 2) filters = {} result = TransformResult( original=query, queries=queries, filters=filters, strategy=qtype, ) self.cache.set(query, result) return result

Query Classification — Route to Strategy

Query TypeExampleStrategyLatency
Simple factual"What's the return policy?"Rewrite only~50ms
Ambiguous"fix auth errors"Rewrite + Expand(3)~200ms
Compound"compare A vs B pricing + support"Decompose into sub-Qs~250ms
Conceptual"why does JWT expire?"Step-back + specific~150ms
Technical"CORS preflight 403 nginx"HyDE + Expand(2)~400ms
Lookup"order #12345 status"Extract ID → direct DB~5ms

Query Classifier Implementation

class QueryClassifier: """Fast classifier: embedding + rules. ~5ms. No LLM call needed.""" def classify(self, query: str) -> str: # Rule-based fast path if re.match(r"(order|tracking|#)\s*\d+", query): return "lookup" if "vs" in query or "compare" in query: return "compound" if query.startswith(("why", "how does", "explain")): return "conceptual" if len(query.split()) <= 6: return "ambiguous" # Embedding-based classifier for rest emb = self.encoder.encode(query) pred = self.classifier_model.predict(emb) return pred # SetFit / fine-tuned
Production Tip: Use a simple rule-based classifier for 70% of queries (lookups, simple factual, "why" questions). Only call the embedding classifier for the remaining 30%. This keeps classification under 5ms for most queries.

Metadata Filter Extraction — Pre-Filter Before Vector Search

Extract structured constraints from the query to narrow the chunk pool BEFORE embedding search. This dramatically improves precision for queries with temporal, categorical, or entity-specific constraints.

class FilterExtractor: """Extract structured filters from query. Runs in parallel with query expansion.""" def extract(self, query: str) -> dict: filters = {} # Temporal: "2024", "last month", "recent" date = self.parse_date(query) if date: filters["date_after"] = date # Category: "pricing", "support", "API" category = self.classify_topic(query) if category: filters["doc_type"] = category # Entity: product names, plan names entities = self.ner.extract(query) if entities: filters["entities"] = entities # Region: "EU", "US", "APAC" region = self.detect_region(query) if region: filters["region"] = region return filters # Applied to vector search: # db.search(query_emb, filters=filters) # → searches ONLY chunks matching filters

Why this matters:

Without filters, "2024 EU return policy" searches ALL chunks and relies on the embedding to distinguish 2024 EU docs from 2023 US docs. Embeddings are bad at temporal and geographic precision. Pre-filtering narrows the pool from 10M chunks to maybe 50K — making vector search both faster and more accurate.

Filter TypeExampleExtraction Method
Temporal"2024", "this week", "latest"Regex + dateparser
Category"pricing", "API docs", "FAQ"Topic classifier
EntityProduct names, plan namesNER (spaCy / custom)
Region"EU", "US", "Germany"Regex + geo lookup
LanguageQuery language detectionlangdetect / fasttext
Access levelUser's role / permissionsSession context (ACL)
Benchmark: Query Transform Impact on Retrieval Quality
Raw single query: Recall@5 = 62%  |  + Rewrite: 68% (+6%)  |  + Multi-Query Expand: 82% (+20%)  |  + Metadata Filters: 87% (+5%)  |  + Cross-Encoder Rerank: 94% (+7%)  |  Total lift: +32 percentage points
Production Recommendation: Start with rewrite + 3-query expansion as your default pipeline — it gives the best cost/quality tradeoff (~200ms, ~$0.001/query). Add HyDE only for technical domains where query-document vocabulary gap is large. Add decomposition only if your eval shows multi-part questions are a common failure mode. Always cache transformations (1h TTL) — the same query pattern produces the same expansions.

Latency Strategy — Generating 5 Queries in <50ms

The naive approach — call an LLM to generate 5 queries — takes 200–400ms. That's unacceptable for real-time voice agents or low-latency search. Here are four production strategies to get multi-query expansion down to <50ms.

LATENCY COMPARISON — 5 QUERY GENERATION STRATEGIES STRATEGY LATENCY BAR QUALITY Naive: Single LLM call 300–500ms ❌ Too slow Best quality Cached LLM expansion 0ms / 300ms cache miss = slow Best quality Template + Synonym expansion 5–15ms Good quality Fine-tuned small model (local) 10–30ms Very good ★ Hybrid: Template + Cache + Async LLM 5–15ms P95 Best tradeoff ✓ Recommended: serve template-generated queries instantly, then async-upgrade with LLM-expanded queries if cache misses

Strategy 1: Template-Based Expansion (5ms)

No LLM call at all. Use rule-based templates that generate query variants from the original query using synonym dictionaries, regex patterns, and structural transformations.

class TemplateExpander: """Zero-LLM query expansion. ~5ms. Generates 5 variants using rules.""" def __init__(self): self.synonyms = SynonymDict.load("domain_synonyms.json") self.stopwords = set(["the", "a", "is", "how", "do", "I"]) def expand(self, query: str) -> list[str]: tokens = query.lower().split() keywords = [t for t in tokens if t not in self.stopwords] variants = [query] # always include original # V1: Synonym swap (most impactful) for kw in keywords: if kw in self.synonyms: syn = self.synonyms[kw][0] variants.append(query.replace(kw, syn)) break # one swap per variant # V2: Keyword-only (drop question words) variants.append(" ".join(keywords)) # V3: Reversed keyword order variants.append(" ".join(reversed(keywords))) # V4: Add domain context prefix variants.append(f"documentation: {query}") return variants[:5] # Example: # Input: "how do I fix auth errors" # Output: [ # "how do I fix auth errors", # original # "how do I fix authentication errors", # synonym # "fix auth errors", # keywords-only # "errors auth fix", # reversed # "documentation: how do I fix auth errors" # prefixed # ]

Pros: Zero latency, zero cost, deterministic. Cons: Limited diversity, no semantic understanding. Best for: First-pass expansion while LLM results are pending.

Strategy 2: Fine-Tuned Small Model (10–30ms)

Distill a large LLM's query expansion capability into a small local model (T5-small, FLAN-T5-base, or a 60M-param custom model). Runs on CPU in 10–30ms.

from transformers import T5ForConditionalGeneration class LocalQueryExpander: """Fine-tuned T5-small for query expansion. ~15ms on CPU. No API call.""" def __init__(self): self.model = T5ForConditionalGeneration.from_pretrained( "./models/query-expander-t5-small" ) self.tokenizer = AutoTokenizer.from_pretrained( "./models/query-expander-t5-small" ) def expand(self, query: str, n=5) -> list[str]: prompt = f"expand query: {query}" inputs = self.tokenizer(prompt, return_tensors="pt") outputs = self.model.generate( **inputs, num_return_sequences=n, num_beams=n, max_new_tokens=64, do_sample=False, ) return [ self.tokenizer.decode(o, skip_special_tokens=True) for o in outputs ] # Training data: 50K (query, expansion) pairs # generated by GPT-4/Claude from prod logs. # Fine-tune T5-small for 3 epochs. ~2hrs on 1 GPU.

Pros: Fast, free at inference, semantic-aware. Cons: Requires training, model maintenance. Best for: High-QPS production systems.

Strategy 3: Pre-Computed Cache (0ms hit / 300ms miss)

Cache LLM-generated expansions by normalized query. First request is slow; all subsequent identical or near-identical queries are instant. Use semantic similarity for fuzzy cache matching.

class SemanticExpansionCache: """Cache LLM expansions. 0ms on hit. Semantic fuzzy matching for near-dupes.""" def __init__(self, redis, encoder, llm): self.redis = redis # exact cache self.encoder = encoder # for fuzzy match self.index = FAISSIndex() # query embedding index self.llm = llm # fallback generator async def get_expansions(self, query: str) -> list[str]: # L1: Exact match (Redis, ~0.1ms) key = hashlib.md5(query.lower().encode()).hexdigest() cached = self.redis.get(key) if cached: return json.loads(cached) # L2: Semantic fuzzy match (~2ms) q_emb = self.encoder.encode(query) hits = self.index.search(q_emb, top_k=1) if hits and hits[0].score > 0.95: # "fix auth errors" ≈ "fix authentication errors" return self.redis.get(hits[0].id) # L3: Cache miss → generate (async, don't block) expansions = await self.llm.expand(query) self.redis.setex(key, 3600, json.dumps(expansions)) self.index.add(q_emb, key) return expansions

Hit rate: 40–70% for most production systems (users ask similar questions). Semantic matching pushes this to 60–85%.

★ Strategy 4: Hybrid — The Recommended Approach

Combine all three: serve template-generated queries instantly (5ms), check cache for LLM-quality expansions (0ms if hit), and fire-and-forget an async LLM call to upgrade the cache for next time.

★ HYBRID QUERY EXPANSION — RECOMMENDED PRODUCTION FLOW User Query "how do I fix auth errors?" SYNCHRONOUS — ALL THREE RUN IN PARALLEL (5ms total) Phase 1: Template Expand synonym swap + keyword extract + domain prefix + reversed ~2ms → 5 template variants Phase 2: Cache Lookup L1: Redis exact match (0.1ms) L2: FAISS semantic fuzzy (2ms) ~2ms → HIT or MISS Phase 2b: Filter Extract date, category, entity (NER) region, language, ACL scope ~5ms → metadata filters Cache Hit? YES (60–85%) Merge template + cached LLM → 5 best variants (0ms extra) MISS Immediate: return template variants User gets results in 5ms — no waiting asyncio.create_task(backfill) ← fire & forget Background: LLM generates 5 variants ~300ms → writes to cache → next request = 0ms Return 5 Queries → Start Parallel Retrieve Total user-facing: 5ms (hit) | 7ms (miss) 1st request → template quality (5ms)   |   2nd identical request → LLM quality from cache (0ms) Quality improves over time as cache fills. Latency NEVER degrades — always ≤7ms user-facing.
class HybridQueryExpander: """5ms P95 response. Best quality over time. Template → Cache → Async LLM backfill.""" def __init__(self): self.template = TemplateExpander() # 5ms self.cache = SemanticExpansionCache() # 0ms hit self.llm = LLMExpander() # 300ms async def expand(self, query: str) -> list[str]: # Phase 1: Instant (5ms) — always available template_variants = self.template.expand(query) # Phase 2: Cache check (0–2ms) cached = await self.cache.get(query) if cached: # Merge template + cached LLM variants return self.dedupe(cached + template_variants)[:5] # Phase 3: Return templates NOW, # fire async LLM to backfill cache asyncio.create_task( self._async_backfill(query) ) return template_variants # 5ms total async def _async_backfill(self, query): """Runs in background. Next identical query will get LLM-quality expansions.""" try: expansions = await self.llm.expand(query) await self.cache.set(query, expansions) except Exception: pass # template fallback is fine

Result: First request gets template variants in 5ms. Second request gets LLM-quality variants from cache in 0ms. No user ever waits for the LLM.

Complete Latency Breakdown — Query Transform Pipeline

StepOperationLatencyRunsCan Parallelize?
ClassifyRule-based + embedding classifier~3msAlways
Template expandSynonym swap, keyword extract, prefix~2msAlways
Cache lookupRedis exact + FAISS semantic~2msAlways✓ parallel with templates
Filter extractRegex + NER for metadata~5msAlways✓ parallel with above
LLM expandHaiku/mini generate 5 variants~300msCache miss onlyAsync (fire-and-forget)
HyDE generateHypothetical doc generation~400msTechnical queries onlyAsync (fire-and-forget)
Total user-facing latency (hybrid)5–15ms P95LLM runs async, result cached for next request

Warm-Up Strategy

Pre-populate the expansion cache by running your top 10,000 queries from production logs through the LLM expander offline. This gives instant cache hits for the most common queries from day one.

# Offline warm-up script for query in top_10k_queries: expansions = await llm.expand(query) cache.set(query, expansions) # Run nightly. ~$3 for 10K queries.

Batch LLM Calls

If you must call an LLM synchronously, batch multiple queries into a single request. Generate all 5 variants in one prompt (not 5 separate calls). This cuts 5×300ms to 1×350ms.

# One call, 5 variants: prompt = f"""Generate 5 diverse search queries for: "{query}" Return as JSON array.""" # → 1 API call ≈ 300ms # NOT: 5 calls × 300ms = 1.5s ❌

Streaming + Speculative

Start retrieval with template queries immediately. If LLM expansions arrive (from cache or async), merge them into the result set before reranking. The LLM expansions enrich, never block.

# Speculative parallel execution template_results = retrieve(template_qs) # If LLM expansions arrive in time: llm_results = retrieve(llm_qs) # bonus merged = rrf_merge(template_results, llm_results) # If not: template results alone are fine
The 5ms Query Transform — Summary Recipe:
① Classify query type (3ms, rule-based)
② Generate template variants (2ms, synonym + keyword)
③ Check semantic cache for LLM variants (2ms, Redis + FAISS) — in parallel with ②
④ Extract metadata filters (5ms, regex + NER) — in parallel with ②③
⑤ If cache miss: fire-and-forget async LLM call to backfill cache for next time
⑥ Return template+cached variants immediately → start retrieval
Total: 5–15ms P95. User never waits for LLM. Quality improves over time as cache fills.
08 / Query Transformation
Query Time

Advanced Retrieval Strategies

Sparse, dense, and hybrid retrieval each encode different failure modes; hybrid retrieval fuses signals.

Query Transform (rewrite/expand) BM25 Vector Fusion (RRF/Weighted) Reranker (cross-encoder) Context Builder LLM

Retrieval Strategy Patterns

Hybrid Search (Dense + Sparse)

Run BM25 and vector search in parallel; fuse results via Reciprocal Rank Fusion (RRF) or weighted sum.

score = (bm25_score * w1) + (vector_score * w2) rank_bm25 = 1 / (k + position_bm25) rank_vector = 1 / (k + position_vector) final = rank_bm25 + rank_vector

Multi-Query Expansion

LLM generates 3–5 diverse rephrased queries targeting different aspects of the user's question. Retrieve for each, then merge and deduplicate. Detailed deep-dive below.

HyDE (Hypothetical Document Embeddings)

LLM generates hypothetical document for query; embed it; search nearest neighbors. Bridges intent-execution gap.

Query Routing

Classify query intent; route to specialized indices (e.g., FAQ vs. technical docs). Faster and more precise.

Parent Document Retrieval

Retrieve fine-grained child chunks; expand with parent (full section). Balance precision + context.

Step-Back Prompting

Ask "What high-level concept does this question ask?"; retrieve abstract info first; then detailed.

Metadata Filtering

Pre-filter chunks by date, source, or category before vector search. Reduce retrieval pool; improve relevance.

Contextual Compression

Retrieve top-K; use LLM to extract relevant sentences. Reduce context window; increase token efficiency.

Learned Sparse (SPLADE)

SPLADE-family models: learned sparse vectors; interpretable term weights; combines dense + sparse strengths.

HybridRetriever Example

class HybridRetriever: def retrieve(self, query: str, top_k=5): # Sparse: BM25 bm25_hits = self.bm25.retrieve(query, top_k=top_k*2) # Dense: Vector search query_emb = self.embed.encode(query) vector_hits = self.vector_db.search(query_emb, top_k=top_k*2) # Reciprocal Rank Fusion (RRF) fused = self.reciprocal_rank_fusion( bm25_hits, vector_hits, weights=(0.4, 0.6) ) return fused[:top_k]

Multi-Query Expansion — Deep Dive

A single user query often captures only one perspective of what they need. Multi-Query Expansion uses an LLM to generate diverse reformulations that target different angles, vocabulary, and levels of specificity — then retrieves for each and merges results. This typically improves Recall@K by 15–30%.

MULTI-QUERY EXPANSION FLOW Original Query "How do I fix auth errors?" LLM Expander Generate 4 diverse reformulations Q0: "How do I fix auth errors?" Q1: "authentication failure troubleshoot" Q2: "401 403 unauthorized error resolve" Q3: "OAuth token expired refresh flow" Q4: "login session invalid API key fix" Parallel Retrieve 5 queries × top-K each → merge + deduplicate → RRF rank fusion Merged Results Higher recall Broader coverage Each query variant captures different vocabulary (auth/401/OAuth/token/login), specificity levels, and error types. Documents matching ANY variant are surfaced — dramatically improving recall for ambiguous or technical queries.

Production MultiQueryExpander

class MultiQueryExpander: PROMPT = """Generate {n} diverse search queries for the user question below. Each query should target a DIFFERENT aspect: - One using technical terms / error codes - One using simple plain language - One asking the "why" behind the issue - One focused on the solution / fix User question: {query} Return as JSON array of strings.""" def __init__(self, llm, n_queries=4): self.llm = llm self.n = n_queries self.cache = QueryExpansionCache(ttl=3600) async def expand(self, query: str) -> list[str]: # Check cache first (same query = same expansions) cached = self.cache.get(query) if cached: return cached result = await self.llm.generate( self.PROMPT.format(n=self.n, query=query), model="claude-haiku-4-5-20251001", # fast + cheap temperature=0.7, # some diversity ) variants = json.loads(result) # Always include original query all_queries = [query] + variants[:self.n] self.cache.set(query, all_queries) return all_queries

Multi-Query Retriever with RRF Fusion

class MultiQueryRetriever: def __init__(self, expander, retriever, reranker): self.expander = expander self.retriever = retriever self.reranker = reranker async def search(self, query, top_k=5): # Step 1: Expand query queries = await self.expander.expand(query) # Step 2: Parallel retrieval for all variants all_results = await asyncio.gather(*[ self.retriever.search(q, top_k=top_k * 2) for q in queries ]) # Step 3: Reciprocal Rank Fusion fused = self.rrf_merge(all_results, k=60) # Step 4: Rerank against ORIGINAL query # (not the variants!) reranked = self.reranker.rerank( query=query, # original intent candidates=fused[:top_k * 3], top_k=top_k ) return reranked def rrf_merge(self, result_lists, k=60): """Reciprocal Rank Fusion across all query variants.""" scores = {} for results in result_lists: for rank, doc in enumerate(results): if doc.id not in scores: scores[doc.id] = 0 scores[doc.id] += 1.0 / (k + rank) return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Expansion Strategies

Synonym expansion: Replace key terms with alternatives ("auth" → "authentication", "login")
Specificity ladder: Abstract ("security issue") + specific ("OAuth 2.0 token expired 401")
Perspective shift: Problem ("auth fails") + solution ("fix authentication") + cause ("why does token expire")
Domain injection: Add domain context ("in Kubernetes" or "for REST API")

When NOT to Use

Exact-match queries: Order ID lookups, SKU searches, specific error codes — expansion adds noise
Low-latency paths: Adds ~200–400ms for LLM expansion. Use only when retrieval quality matters more than speed
Small corpus (<1K docs): Expansion just returns the same docs repeatedly. Not worth the cost

Production Optimizations

Cache expansions: Same query → same variants. 1-hour TTL covers repeated queries
Use cheapest LLM: Haiku/GPT-4o-mini for expansion (~$0.001 per query)
Parallel retrieval: Run all variant searches simultaneously with asyncio.gather
Rerank against original: Always rerank using the ORIGINAL query, not variants — variants help recall, reranking restores precision

Benchmark Impact: Multi-Query Expansion typically improves Recall@5 by 15–30% and Recall@20 by 10–20% compared to single-query retrieval. Combined with cross-encoder reranking, the full pipeline achieves ~94% Recall@5 vs ~62% for vector-search-only. The cost is ~200ms added latency (cacheable) and ~$0.001/query for the expansion LLM call.

Low-Latency Retrieval — Hitting <30ms P95

For real-time chat and voice agents, the entire retrieval pipeline (query → embed → search → filter → return chunks) must complete in under 30ms P95. Here's how production systems achieve this.

RETRIEVAL LATENCY BUDGET — TARGET: <30ms P95 OPERATION TIME Embed Query ~3ms HNSW Vector Search (top-50) ~5ms BM25 Sparse Search (top-50) parallel ↕ Metadata Filter ~2ms RRF Merge + Dedup ~1ms FlashRank (top-20→10) ~5ms Return Top-5 Total: ~16ms P50 | ~28ms P95 | ~45ms P99 ✓ <30ms

Embedding Latency — <5ms

The query embedding step is on the critical path. Every millisecond counts.

# Strategy: Pre-warm + GPU + small model class FastEmbedder: def __init__(self): # Use small model: 384d, ~3ms on GPU self.model = SentenceTransformer( "all-MiniLM-L6-v2", device="cuda" ) # Pre-warm: run dummy inference self.model.encode("warmup") # ONNX quantized for CPU-only deploys: # self.model = ORTModel("model.onnx") # → ~5ms on CPU vs ~15ms PyTorch def encode(self, text: str): return self.model.encode( text, normalize_embeddings=True )

Options: all-MiniLM-L6 (3ms GPU), ONNX quantized (5ms CPU), Matryoshka 256d (2ms, -1% quality), API (10-20ms + network).

Vector Search — <10ms at Scale

HNSW indexes deliver sub-10ms search even at 10M+ vectors. Key: tune ef_search, keep quantized index in RAM.

# Qdrant: tuned for low latency collection_config = { "vectors": { "size": 384, # small = faster "distance": "Cosine", }, "hnsw_config": { "m": 16, # graph density "ef_construct": 200, # build quality }, "quantization_config": { "scalar": { "type": "int8", # 4x smaller "always_ram": True, # no disk IO } }, "on_disk_payload": True, # metadata on disk } # Search params: search_params = { "hnsw_ef": 64, # lower = faster (vs 128) "exact": False, # ANN, not brute force } # Result: ~5ms for 10M vectors, int8 quantized

Parallel Hybrid Search

Run BM25 and vector search simultaneously. Both return in ~5ms. RRF merge takes ~1ms. Total hybrid: ~6ms vs 10ms serial.

# Parallel hybrid: 6ms total dense, sparse = await asyncio.gather( vector_db.search(q_emb, top_k=50), bm25_index.search(q_text, top_k=50), ) fused = rrf_merge(dense, sparse) # NOT: dense = await ...; sparse = await ... # That's serial: 5+5 = 10ms ❌

Retrieval Result Cache

Cache the final retrieved chunks by normalized query hash. 30–50% hit rate for production systems. 0ms on hit.

# Redis retrieval cache key = md5(normalize(query) + user_acl) cached = redis.get(key) if cached: return json.loads(cached) # 0ms # ACL in key prevents cross-user leakage # TTL: 15min (balance freshness vs speed)

Connection Pooling

Cold connections to vector DB add 20–50ms. Pool connections and keep them warm. Use gRPC over HTTP for lower overhead.

# Qdrant gRPC connection pool client = QdrantClient( url="qdrant:6334", prefer_grpc=True, # not REST grpc_options={ "grpc.keepalive_time_ms": 10000, }, ) # Pre-warm: send dummy search on startup
Low-Latency Retrieval Checklist:
Small embedding model (384d, GPU or ONNX quantized) — saves 10ms vs large model
Int8 quantized HNSW index, always in RAM — saves 5–20ms vs disk
Parallel BM25 + vector search with asyncio.gather — saves 5ms vs serial
gRPC connection pooling to vector DB — saves 20–50ms cold start
Retrieval result cache with ACL-aware keys (15min TTL) — 0ms on 30–50% of queries
FlashRank fast reranker instead of cross-encoder for first pass — 5ms vs 50ms
Metadata pre-filtering to reduce search pool before vector search
Lower ef_search (64 vs 128) for HNSW — ~2ms savings, <1% recall drop

HyDE (Hypothetical Document Embeddings) — Deep Dive

The core insight behind HyDE: user queries and documents live in different embedding spaces. A question like "fix auth errors" embeds very differently from a document paragraph that explains how to fix auth errors. HyDE bridges this gap by generating a hypothetical answer first, then using THAT as the search query — because a hypothetical answer embeds much closer to the real answer documents.

HyDE — HYPOTHETICAL DOCUMENT EMBEDDINGS FLOW PROBLEM: Direct Query Embedding Query: "fix auth errors" embed(query) → vec_q Vector DB Search cosine(vec_q, doc_vecs) ❌ Low similarity — query language ≠ document language "fix auth errors" embeds far from "To resolve authentication failures, navigate to Settings > OAuth..." SOLUTION: HyDE — Generate Hypothetical Document, Then Search User Query "fix auth errors" LLM Generates Hypothetical Answer "To fix auth errors, check your OAuth token expiry. Use /auth/refresh..." Embed Hypothetical embed(hypo_doc) → vec_h Vector Search cosine(vec_h, doc_vecs) ✓ High Match sim=0.92 Why it works: The hypothetical answer uses the SAME language and structure as real docs — so it embeds in the same neighborhood The LLM doesn't need to be correct — it just needs to sound like the documents in your corpus. Even a wrong answer finds the right docs.

Production HyDE Implementation

class HyDERetriever: """Hypothetical Document Embeddings. Generates a fake answer, embeds it, searches for real docs that match.""" PROMPT = """Write a short paragraph that directly answers this question. Write as if it's from a technical doc. Do NOT say "I don't know." Question: {query} Answer paragraph:""" def __init__(self, llm, embedder, vector_db): self.llm = llm # cheap/fast model self.embedder = embedder self.db = vector_db self.cache = HyDECache(ttl=3600) async def search(self, query, top_k=10): # Check cache (same query = same hypo doc) cached = self.cache.get(query) if cached: hypo_emb = cached else: # Step 1: Generate hypothetical answer hypo_doc = await self.llm.generate( self.PROMPT.format(query=query), model="claude-haiku-4-5-20251001", max_tokens=150, # short paragraph temperature=0.0, # deterministic ) # Step 2: Embed the hypothetical doc hypo_emb = self.embedder.encode(hypo_doc) self.cache.set(query, hypo_emb) # Step 3: Search using hypo embedding results = self.db.search( vector=hypo_emb, top_k=top_k ) return results

When HyDE Helps vs Hurts

ScenarioHyDE ImpactWhy
Technical jargon query+15–25% recallQuery uses informal terms; docs use formal language. HyDE bridges the gap.
Short/vague query+10–20% recall"fix auth" → hypothetical doc expands to "authentication, OAuth, token, refresh"
Cross-lingual+20–30% recallQuery in English, docs in mixed languages. HyDE generates in target language.
Simple factual query~0% change"What's the return policy?" already matches doc language. No gap to bridge.
Exact-match lookup-5–10% recallOrder IDs, error codes — HyDE adds noise. Skip it for lookups.
Multi-part queryMixedHyDE generates one doc; may miss second topic. Combine with decomposition.

HyDE + Multi-Query: Best of Both

In production, don't choose between HyDE and Multi-Query — combine them. Use the original query + 3 expansions + 1 HyDE embedding. Five search queries total, fused with RRF.

async def hybrid_retrieve(query, top_k=5): # Run ALL in parallel orig, expanded, hyde = await asyncio.gather( vector_search(embed(query), top_k=20), multi_query_search(query, n=3, top_k=20), hyde_search(query, top_k=20), ) # Fuse all results via RRF fused = rrf_merge([orig, *expanded, hyde]) # Rerank against ORIGINAL query return rerank(query, fused[:top_k*3])[:top_k]

LLM Choice for HyDE

Claude Haiku / GPT-4o-mini: Best cost/quality. ~$0.001/query. 100–200ms.
Llama 3.1 8B (local): Zero API cost. ~50ms on GPU. Slightly lower quality.
T5-small fine-tuned: ~10ms CPU. Train on (query → doc paragraph) pairs from your corpus. Best latency.

Prompt Design Matters

DO: "Write as if from a technical document." This makes the output style match your corpus.
DO: "Do NOT say I don't know." Force the LLM to generate content even if unsure.
DON'T: Ask for long answers. 1–2 paragraphs max. More text = more embedding noise.

Latency Optimization

Cache aggressively: Same query → same hypothetical doc. 1h TTL. 50–70% hit rate.
Async generation: Start HyDE in parallel with template-based retrieval. If HyDE finishes in time, merge results. If not, template results are fine alone.
Conditional: Only run HyDE for queries classified as "technical" or "ambiguous" (~20% of traffic). Skip for simple factual queries.

Key Insight: HyDE doesn't need to generate a correct answer — it needs to generate an answer that sounds like your documents. Even a factually wrong hypothetical will embed near the right topic cluster, because it uses the same vocabulary and sentence structure as real docs. This is why HyDE works even with small, cheap models.
09 / Retrieval
Precision

Reranking & Relevance Scoring

100 Candidates FlashRank ~5ms 30 Candidates Cross-Encoder ~50ms 10 Candidates MMR Diversity ~2ms 5 Final Results Ranked & Diverse Ready to generate Stage 1 Stage 2 Stage 3 Output

Quality Improvement Pipeline

Vector Only 62%
+Hybrid Search 74%
+Cross-Encoder Reranker 89%
+Multi-Query + Rerank 94%
Research Note (BEIR Benchmark): Cross-encoder reranking is powerful but expensive (milliseconds per query). Use selectively when retrieval confidence is low. Avoid reranking 1000s of results; pre-filter to top-50.
Reranker Type Latency Accuracy Cost / Deployment
Cohere Rerank v3.5 API cross-encoder 50–150ms SOTA $0.001 / 1000 queries
Jina Reranker v2 API 100–200ms Excellent $0.0005 / query
cross-encoder/ms-marco Open-source HF 5–20ms (A100) Good (BERT-base) Free; self-hosted
BGE Reranker v2.5 Open-source HF 10–30ms (A100) Very Good Free; self-hosted
RankGPT (LLM-based) LLM proxy 200ms–1s SOTA (model-dependent) API cost; slow
FlashRank (tiny) Open-source distilled 2–5ms (CPU) Acceptable (70–80%) Free; ultra-fast

MultiStageReranker: Cascade Strategy

class MultiStageReranker: __init__(self): self.fast = FlashRank() self.strong = CohereRerank() self.diversity = MMRDiversifier() def rerank(self, query, candidates, top_k=5): # Stage 1: Fast filter (FlashRank) stage1 = self.fast.rerank(query, candidates, top_k=20) # Stage 2: Cross-encoder (Cohere) stage2 = self.strong.rerank(query, stage1, top_k=10) # Stage 3: Diversity (MMR) final = self.diversity.diversify(stage2, top_k=top_k) return final
10 / Reranking
Confidence

Cross-Encoder Confidence Scoring

Cross-encoders don't just rerank — they produce calibrated relevance scores that serve as the foundation for retrieval confidence. These scores drive critical downstream decisions: should the LLM answer or refuse? Should we retrieve more chunks? Is the context sufficient?

CROSS-ENCODER CONFIDENCE SCORING — FROM RERANKING TO DECISION MAKING Query + Chunks 20 candidates from hybrid retrieval Cross-Encoder Model model([query, chunk]) → score Processes each (query, chunk) pair jointly Outputs calibrated relevance probability Relevance Scores chunk_1: 0.94 ✓ chunk_2: 0.87 ✓ chunk_3: 0.52 ~ chunk_4: 0.18 ✗ Confidence-Driven Decisions Top score >0.8 → Answer confidently Top score 0.5–0.8 → Answer with caveat Top score <0.5 → "I don't know" Score gap large → Trim low chunks All scores low → Re-retrieve/expand SCORE CALIBRATION — RAW LOGITS TO USABLE PROBABILITIES Raw Logits [-2.1, 1.8, 0.3, -1.5] Uncalibrated — not comparable Sigmoid / Softmax σ(logit) → [0, 1] probability Sigmoid for per-chunk, softmax for ranking Temperature Scaling σ(logit / T) with T=1.5 Calibrate on held-out eval set Without calibration, a score of 0.7 from Model A means something different than 0.7 from Model B. Temperature scaling on a held-out eval set makes scores comparable and threshold-friendly. Cross-encoder scores are the MOST reliable signal in the entire RAG pipeline — more reliable than embedding similarity or LLM self-assessment

CrossEncoderScorer — Production Implementation

from sentence_transformers import CrossEncoder import numpy as np class CrossEncoderScorer: """Calibrated cross-encoder confidence scorer. Extracts per-chunk relevance + aggregate retrieval confidence for downstream decisions.""" def __init__(self, model_name, temperature=1.5): self.model = CrossEncoder(model_name) self.T = temperature # calibrated on eval set def score_chunks(self, query: str, chunks: list) -> list: # Score each (query, chunk) pair pairs = [[query, c.text] for c in chunks] raw_logits = self.model.predict(pairs) # Calibrate: sigmoid with temperature scores = 1 / (1 + np.exp(-raw_logits / self.T)) # Attach scores to chunks for chunk, score in zip(chunks, scores): chunk.relevance_score = float(score) chunk.is_relevant = score > 0.5 # Sort by score descending return sorted(chunks, key=lambda c: c.relevance_score, reverse=True) def retrieval_confidence(self, scored_chunks: list) -> RetrievalConfidence: """Aggregate chunk scores into a single retrieval confidence signal.""" scores = [c.relevance_score for c in scored_chunks] return RetrievalConfidence( # Best chunk score — primary signal top_score=scores[0], # Mean of top-3 — stability signal top3_mean=np.mean(scores[:3]), # Score gap: top vs 4th — diversity signal score_gap=scores[0] - scores[3] if len(scores) > 3 else 0, # Count above threshold — coverage signal relevant_count=sum(1 for s in scores if s > 0.5), # Overall retrieval quality tier tier=self._classify_tier(scores), ) def _classify_tier(self, scores): top = scores[0] if top > 0.85 and sum(1 for s in scores if s > 0.7) >= 2: return "high" # confident answer elif top > 0.5: return "medium" # answer with caveat else: return "low" # refuse / re-retrieve

Confidence Signals Explained

SignalWhat It MeasuresHow to Use
top_scoreBest single chunk relevance>0.85 = answer confidently. <0.5 = refuse.
top3_meanConsistency of top resultsIf top1=0.9 but top3_mean=0.5 → only one good chunk. Context may be thin.
score_gapDrop from best to 4thLarge gap (>0.3) = clear winner. Small gap = ambiguous topic, may need more context.
relevant_countHow many chunks are useful0 = can't answer. 1–2 = thin context. 3–5 = good coverage.
tierAggregate quality classDrives LLM prompt strategy: high→concise, medium→cautious, low→refuse.

Cross-Encoder Models for Scoring

ModelLatencyQualityBest For
cross-encoder/ms-marco-MiniLM-L-6~8msGoodHigh-QPS, latency-critical
cross-encoder/ms-marco-MiniLM-L-12~15msBetterBalanced speed/quality
BAAI/bge-reranker-v2-m3~30msVery goodMulti-lingual
cross-encoder/nli-deberta-v3-large~50msExcellentNLI + grounding check
Cohere Rerank v3.5~60msExcellentAPI-based, no GPU needed
Jina Reranker v2~40msVery goodLong context support

Confidence-Driven RAG — Adapting Behavior by Score

The most powerful use of cross-encoder scores is dynamically adapting the RAG pipeline behavior based on retrieval confidence — not just ranking chunks.

ConfidenceDrivenRAG — Adaptive Pipeline

class ConfidenceDrivenRAG: """Adapts RAG behavior based on cross-encoder confidence. High confidence → fast answer. Low confidence → expand search or refuse.""" async def answer(self, query: str) -> Response: # Step 1: Retrieve + Rerank + Score chunks = await self.retriever.search(query, top_k=20) scored = self.cross_encoder.score_chunks(query, chunks) confidence = self.cross_encoder.retrieval_confidence(scored) # Step 2: Adapt strategy by confidence tier if confidence.tier == "high": # ✓ Strong context — answer directly context = scored[:3] # top 3 only (less noise) prompt = self.prompts.confident(query, context) return await self.llm.generate(prompt) elif confidence.tier == "medium": # ~ Partial context — try harder first # Strategy A: Expand retrieval expanded = await self.multi_query.expand_and_retrieve(query) re_scored = self.cross_encoder.score_chunks(query, expanded) new_conf = self.cross_encoder.retrieval_confidence(re_scored) if new_conf.tier == "high": # Expanded search worked context = re_scored[:5] prompt = self.prompts.confident(query, context) return await self.llm.generate(prompt) else: # Answer cautiously with hedge context = re_scored[:5] prompt = self.prompts.cautious(query, context) # "Based on available information..." return await self.llm.generate(prompt) else: # tier == "low" # ✗ No good context — refuse gracefully if confidence.top_score < 0.2: # Completely off-topic return Response( text="I don't have information on this topic.", confidence=confidence.top_score, action="refused" ) else: # Some relevance but not enough return Response( text="I found some related information but " "can't give a confident answer. " "Here's what I found: ...", confidence=confidence.top_score, action="hedged", sources=scored[:2] )

Dynamic Chunk Filtering

Instead of always sending top-5 chunks, use scores to decide how many. If top-3 are all >0.8 but chunks 4–5 are <0.3, drop them. Including low-relevance chunks actually hurts faithfulness.

# Adaptive chunk count relevant = [c for c in scored if c.relevance_score > 0.5] context = relevant[:5] # max 5, but only relevant # If 0 relevant → refuse/expand # If 1–2 → thin context warning # If 3–5 → good coverage

Prompt Strategy Switching

Use confidence tier to select different prompt templates. High confidence → concise, direct answer. Medium → "Based on available docs..." Low → "I don't have enough info to..."

PROMPTS = { "high": "Answer directly from context.", "medium": "Based on available info, " "answer carefully. Note gaps.", "low": "Context is limited. State " "what you found and what's missing.", }

Feedback to Retrieval

If cross-encoder scores are consistently low for a topic, it signals a gap in your knowledge base — not just a bad query. Log and alert on repeated low-confidence topics.

# Track low-confidence topics if confidence.tier == "low": self.topic_tracker.record( query=query, top_score=confidence.top_score ) # Weekly: report topics with >10 # low-confidence queries → content gap

Score Calibration — Making Thresholds Reliable

Raw cross-encoder logits are NOT probabilities. A score of 0.7 doesn't mean "70% chance this is relevant." You must calibrate scores so that your thresholds (0.5, 0.85) actually mean what you think they mean.

class ScoreCalibrator: """Learn temperature T on held-out eval set so that score=0.5 means 50% of chunks with that score are actually relevant.""" def calibrate(self, eval_set): # eval_set: [(query, chunk, is_relevant)] logits = [] labels = [] for q, c, rel in eval_set: logit = self.model.predict([(q, c)])[0] logits.append(logit) labels.append(rel) # Optimize temperature T from scipy.optimize import minimize_scalar def nll(T): probs = 1 / (1 + np.exp(-np.array(logits) / T)) return -np.mean( np.array(labels) * np.log(probs + 1e-8) + (1 - np.array(labels)) * np.log(1 - probs + 1e-8) ) result = minimize_scalar(nll, bounds=(0.1, 5.0)) self.T = result.x print(f"Calibrated T={self.T:.2f}") # Recalibrate monthly or when model changes

Why calibration matters:

Without calibration, the same threshold (0.5) behaves differently across models. MiniLM-L-6 might output 0.8 for a mediocre match, while DeBERTa-v3 outputs 0.6 for a great match. Temperature scaling normalizes this.

Without CalibrationWith Calibration (T=1.5)
Score 0.7 = maybe relevant?Score 0.7 = 70% are truly relevant
Threshold 0.5 = different per modelThreshold 0.5 = consistent meaning
Can't compare models fairlyApples-to-apples comparison
Must tune per deploymentOne threshold works across models

How often to recalibrate: Monthly, or whenever you change the cross-encoder model, update the embedding model, or significantly change the corpus. Use 500+ labeled (query, chunk, relevant?) pairs.

Key Insight: Cross-encoder confidence is the single most valuable signal in a RAG system. It answers the question every user implicitly asks: "Should I trust this answer?" A well-calibrated confidence score lets you build a system that says "I don't know" when appropriate — which is far more trustworthy than one that confidently hallucinates.
11 / Cross-Encoder Confidence
LLM Layer

Prompt Engineering & Generation

"Prompting is a contract between retrieval and generation" — context discipline, citations, and answer modes matter.

Production RAG Prompt Template

"""You are a helpful assistant. Answer ONLY from the provided context. Context: {context} Question: {question} Rules: 1. Base your answer ONLY on the provided context. 2. Cite sources using [Source N] for each fact. 3. If the context does not contain the answer, say: "I don't have information on this." 4. Do NOT guess, speculate, or add outside knowledge. 5. Be concise; use bullet points for clarity. Answer: """

RAGGenerator: Streaming, Fallback, Confidence Gating

class RAGGenerator: def __init__(self, primary_model, fallback_model): self.primary = primary_model # GPT-4 self.fallback = fallback_model # GPT-3.5 (cheaper) def generate(self, context, query, stream=True): retrieval_confidence = self.assess_confidence(context) if retrieval_confidence < 0.5: return "Insufficient context. Please refine your query." model = self.primary if retrieval_confidence > 0.8 else self.fallback for chunk in model.stream(prompt=self.prompt.format(context, query)): yield chunk def assess_confidence(self, context): # Score based on retrieval rank, citation density, recency return (0.7 * avg_rank) + (0.2 * citation_count) + (0.1 * recency)

Streaming (SSE/WebSocket)

Return tokens as they arrive, not end-to-end. Target <500ms TTFT (time-to-first-token). Improves perceived latency and UX.

Citation Extraction

Parse [Source N] references; validate against retrieved chunks. Enable user verification; prevent hallucinated citations.

Fallback Strategy

Route to cheaper/faster model if retrieval confidence is low. Use strong model only when context is rich. Optimize cost/quality.

Research: Self-RAG
Self-RAG decides whether retrieval is needed per token; critiques its own outputs. Studies show 10–15% improvements in factuality by selective retrieval. Implement using token-level confidence scores from the LLM.

LLM Orchestration Policies

  • Model Routing: Classify query complexity; route simple queries to fast model (GPT-3.5), complex to strong model (GPT-4). Save 50%+ on inference cost.
  • Caching Layers: Cache responses by normalized query + context hash. ACL-sensitive keys (per-user); 24-72h TTL. Reduces latency and cost for repeated queries.
  • Hallucination Mitigation Toolbox: Use retrieval-augmented verification (CTRL), confidence thresholds, structured output format (JSON schema), and post-generation fact-checking against context.
12 / Generation
Validators

Response Evaluation Layer

In production, the LLM alone is NOT trusted. Every response passes through a parallel evaluation layer — grounding verification, intent alignment, safety moderation, and confidence scoring — all within 50–200ms.

LLM Response Parallel Evaluation Layer (50–200ms) Grounding Check Answer supported by context? Embedding Similarity Cross-Encoder Verify LLM-Based Grounding Intent Check Response matches user intent? Intent Classification Embedding Similarity threshold > 0.8 Safety Check Content policy compliance? Moderation Models Rule Engine NeMo / Guardrails AI Confidence Score Weighted aggregate 0.4×ground + 0.3×retrieval + 0.2×intent + 0.1×safety threshold: 0.85 Decision Engine ✓ PASS ↻ RETRY ⚠ FALLBACK

1. Grounding Check — Deep Dive

The grounding check is the single most important validator in a production RAG system. It verifies that every claim in the LLM's response is actually supported by the retrieved context — catching hallucinations before they reach the user.

GROUNDING VERIFICATION PIPELINE — CASCADING TIERS TIER 1: Embedding ~5–15ms | All responses cosine_sim(answer, ctx) >0.75 → PASS (skip T2/T3) 0.5–0.75 → escalate to T2 TIER 2: Cross-Encoder ~30–80ms | Ambiguous only NLI(premise=ctx, hyp=claim) entailment → PASS neutral/contradiction → T3 TIER 3: LLM-as-Judge ~300–800ms | Disputed only claim_by_claim verify supported → PASS unsupported → REJECT Grounded → Deliver Hallucinated → Retry

Tier 1: Embedding Similarity

Fastest check (~5–15ms). Runs on every response. Converts answer + context chunks to embeddings, measures cosine similarity.

from sentence_transformers import SentenceTransformer import numpy as np class EmbeddingGrounder: def __init__(self): self.model = SentenceTransformer( "all-MiniLM-L6-v2" # 384d, fast ) def check(self, answer, chunks): a_emb = self.model.encode(answer) c_embs = self.model.encode( [c.text for c in chunks] ) # Max similarity across chunks scores = np.dot(c_embs, a_emb) / ( np.linalg.norm(c_embs, axis=1) * np.linalg.norm(a_emb) ) score = float(scores.max()) if score > 0.75: return Grounded(score) elif score > 0.5: return Ambiguous(score) # → T2 else: return Hallucinated(score)

Tools: FAISS, pgvector, Sentence Transformers, HuggingFace Embeddings, OpenAI text-embedding-3-small

Tier 2: Cross-Encoder / NLI

More accurate (~30–80ms). Only runs on ambiguous T1 results. Uses Natural Language Inference to classify each claim as entailed, neutral, or contradicted by context.

from transformers import pipeline class NLIGrounder: def __init__(self): self.nli = pipeline( "text-classification", model="cross-encoder/" "nli-deberta-v3-large" ) def check(self, answer, context): # Split answer into claims claims = self.extract_claims(answer) results = [] for claim in claims: pred = self.nli( f"{context} [SEP] {claim}" ) label = pred[0]["label"] # entailment/neutral/contradiction results.append((claim, label)) contradictions = [ c for c, l in results if l == "contradiction" ] return NLIResult( grounded=len(contradictions)==0, flagged_claims=contradictions )

Models: DeBERTa-v3-large-NLI, cross-encoder/nli-MiniLM, BART-large-MNLI, Cohere Rerank v3.5

Tier 3: LLM-as-Judge

Most flexible (~300–800ms). Only runs on disputed claims from T2. Performs claim-by-claim verification with explicit reasoning.

class LLMGrounder: PROMPT = """Verify each claim against the context. For each claim, respond: SUPPORTED / NOT SUPPORTED / PARTIAL Context: {context} Claims to verify: {claims} Respond as JSON: [{{"claim": "...", "verdict": "...", "evidence": "...", "confidence": 0.0}}] """ async def check(self, claims, ctx): result = await self.llm.generate( self.PROMPT.format( context=ctx, claims="\n".join(claims) ), model="claude-haiku-4-5-20251001", # Use cheap fast model temperature=0, ) verdicts = json.loads(result) unsupported = [ v for v in verdicts if v["verdict"] != "SUPPORTED" ] return LLMVerdict( grounded=len(unsupported)==0, unsupported_claims=unsupported )

Models: Claude Haiku (cheapest), GPT-4o-mini, Gemini Flash, Llama 3.1 8B (self-hosted)

Production Grounding Service (Cascading)

class ProductionGroundingService: """Cascading grounding: fast → accurate → LLM P95 latency: ~20ms (80% exit at T1)""" def __init__(self): self.t1 = EmbeddingGrounder() # ~10ms self.t2 = NLIGrounder() # ~50ms self.t3 = LLMGrounder() # ~500ms self.metrics = GroundingMetrics() async def verify(self, answer, chunks, query): # Tier 1: Embedding (always runs) t1 = self.t1.check(answer, chunks) self.metrics.record("t1", t1.score) if t1.score > 0.75: return GroundingResult( grounded=True, tier=1, score=t1.score ) if t1.score < 0.4: return GroundingResult( grounded=False, tier=1, score=t1.score, action="regenerate" ) # Tier 2: NLI (ambiguous zone 0.4–0.75) claims = self.extract_claims(answer) t2 = self.t2.check(answer, chunks) self.metrics.record("t2", t2) if t2.grounded: return GroundingResult( grounded=True, tier=2 ) if len(t2.flagged_claims) == 0: return GroundingResult( grounded=True, tier=2 ) # Tier 3: LLM judge (disputed claims only) t3 = await self.t3.check( t2.flagged_claims, "\n".join(c.text for c in chunks) ) self.metrics.record("t3", t3) return GroundingResult( grounded=t3.grounded, tier=3, unsupported=t3.unsupported_claims, action="regenerate" if not t3.grounded else None )

Tools & Libraries Comparison

ToolTypeLatencyBest For
Sentence TransformersEmbedding~5msT1 — fast similarity
FAISSVector index~1msBatch embedding lookup
pgvectorPostgres ext~5msSQL-native similarity
DeBERTa-v3 NLICross-encoder~50msT2 — NLI classification
BART-large-MNLINLI model~40msT2 — zero-shot NLI
Cohere RerankAPI reranker~60msT2 — relevance scoring
Claude HaikuLLM API~400msT3 — claim verification
GPT-4o-miniLLM API~500msT3 — claim verification
Guardrails AIFrameworkvariesOrchestrate all tiers
RAGASEval frameworkofflineMeasure faithfulness
TruLensEval+traceofflineGroundedness monitoring
DeepEvalCI evalofflineHallucination CI gate

How the 80 / 15 / 5 Cascading Exit Works

In production, you do NOT run all three tiers on every response. Instead, you cascade: the fast cheap check runs first, and only ambiguous results escalate to the next tier. This is why 80% of requests cost ~10ms and only 5% ever hit the expensive LLM judge.

CASCADING EXIT — DECISION FLOW WITH TRAFFIC DISTRIBUTION 100% of LLM Responses TIER 1: Embedding Similarity (~10ms) cosine_sim(answer_emb, context_emb) score > 0.75 80% EXIT ✓ score < 0.4 ~5% REJECT ✗ ~15% ambiguous (0.4–0.75) TIER 2: Cross-Encoder NLI (~50ms) NLI(premise=context, hypothesis=claim) entailment ~10% EXIT ✓ ~5% disputed claims TIER 3: LLM-as-Judge (~500ms) claim-by-claim verdict: SUPPORTED / NOT ~3% EXIT ✓ ~2% REJECT → Retry Weighted avg latency: (0.80 × 10ms) + (0.15 × 50ms) + (0.05 × 500ms) = 40.5ms P50 — vs 500ms if every response hit LLM judge

Tier 1 Exit (80%) — Clear Match

Most RAG answers closely paraphrase the retrieved context. Embedding similarity catches these trivially.

# Example: clear grounding Context: "Returns accepted within 30 days of purchase with original receipt." Answer: "You can return items within 30 days if you have the original receipt." cosine_similarity = 0.91 # > 0.75 # → PASS at Tier 1. No further checks. # Latency: ~10ms. Cost: $0.00.

This covers: direct paraphrasing, factual restatement, simple summarization, exact quotes, and minor rewording. The embedding model captures semantic equivalence without needing deeper reasoning.

Tier 2 Escalation (15%) — Ambiguous Zone

When the answer uses different vocabulary or adds inference, embeddings give a middling score. NLI resolves the ambiguity.

# Example: inference from context Context: "Premium members get free shipping on orders over $50." Answer: "As a premium member, your $75 order qualifies for free shipping." cosine_similarity = 0.62 # ambiguous zone # → Escalate to Tier 2 NLI("Premium members get free shipping on orders over $50", "$75 order qualifies for free shipping") # → entailment (0.94 confidence) # → PASS at Tier 2. Latency: ~60ms.

This covers: logical inference, numerical reasoning ("$75 > $50"), conditional application, combining info from multiple chunks, and contextual deduction.

Tier 3 Escalation (5%) — Disputed Claims

When NLI returns "neutral" (neither entailed nor contradicted) or there are mixed verdicts across claims, the LLM judge arbitrates.

# Example: mixed/complex claim Context: "The product is available in blue and red. Ships within 3-5 days." Answer: "The product comes in blue, red, and green. Usually arrives in a week." T1 cosine_similarity = 0.58 # ambiguous T2 NLI: "blue and red" → entailment ✓ "green"neutral ⚠️ # not in ctx "arrives in a week"neutral ⚠️ # → Escalate disputed claims to Tier 3 LLM Judge: "green": NOT SUPPORTED # hallucination! "week": PARTIAL # 3-5 days ≈ week # → REJECT "green", accept "week" # → Strip hallucinated claim, regenerate

Why This Works — The Math

The cascade works because most RAG answers are well-grounded (the retrieval pipeline already found relevant context). Only edge cases need expensive verification.

MetricAll T3CascadeSavings
Avg latency500ms40ms12.5x faster
P50 latency500ms10ms50x faster
P95 latency800ms60ms13x faster
Cost / 1K queries$0.50$0.0316x cheaper
Hallucination catch~98%~96%-2% (acceptable)

Key insight: You trade ~2% hallucination detection rate for a 12x latency reduction and 16x cost reduction. For the remaining 2%, user feedback loops and offline evaluation catch regressions.

Tuning the Thresholds — Production Guidance

T1 Pass Threshold (default: 0.75)

Raise to 0.80–0.85 for high-stakes domains (medical, legal, financial). Lower to 0.65–0.70 for casual Q&A where speed matters more. Tune by measuring T2/T3 escalation rate — if <5% escalate, threshold is too low.

T1 Reject Threshold (default: 0.4)

Below this, the answer is clearly unrelated to context — skip T2/T3 and regenerate immediately. Raise to 0.5 for stricter domains. Monitor false-rejection rate via user feedback.

T2→T3 Escalation (default: any contradiction)

Only escalate if T2 finds "contradiction" (not just "neutral"). Neutral means the context doesn't address the claim — which might be acceptable for partial answers. Tune per use case.

# Threshold config per use case GROUNDING_CONFIG = { "default": { "t1_pass": 0.75, "t1_reject": 0.40, "t2_escalate_on": ["contradiction"], }, "medical": { "t1_pass": 0.85, "t1_reject": 0.50, # stricter "t2_escalate_on": ["contradiction", "neutral"], # always verify }, "casual_qa": { "t1_pass": 0.65, "t1_reject": 0.35, # faster "t2_escalate_on": ["contradiction"], # only clear issues }, }
What to Monitor: Track the tier exit distribution over time. If T2 escalation rises above 25%, your embedding model may be drifting (retrain or upgrade). If T3 rejects rise above 5%, your retrieval pipeline quality may be degrading. Dashboard these in Grafana/Datadog alongside your standard RAG metrics.

2. Intent Check — Response Matches User Intent

Verifies the response actually addresses what the user asked. Catches drift where the model answers a different question entirely.

Example Problem:
User: "Track my order" → Answer: "Here are some shoes you may like" — intent mismatch!
# Intent alignment pipeline class IntentAlignmentChecker: def check(self, query, response): # Classify both through intent model query_intent = self.intent_model.predict(query) response_intent = self.intent_model.predict(response) # Or use embedding similarity q_emb = self.encoder.encode(query) r_emb = self.encoder.encode(response) similarity = cosine_similarity(q_emb, r_emb) if similarity < 0.8: return IntentResult( aligned=False, query_intent=query_intent, response_intent=response_intent )

Common intent models: Rasa, SetFit, fine-tuned classifiers. For production voice agents, embedding-based intent similarity with threshold >0.8 is fastest.

3. Safety Check — Content Moderation

Prevents unsafe or policy-violating responses: illegal instructions, abusive content, financial advice risks, policy violations.

A. Moderation Models

# Dedicated safety classifiers result = moderation_api.classify(response) # Output: {"violence": false, "hate": false, "self_harm": false} if any(result.values()): return block_response()

B. Rule Engine

Hard rules for regulated domains: refund policies, medical/financial advice, guaranteed outcomes. Example: if answer contains "guaranteed profit" → reject.

C. Guardrail Frameworks

Production libraries: Guardrails AI, NeMo Guardrails. Enforce content policies, structured outputs, and safe responses declaratively.

4. Confidence Score — Final Decision Engine

Aggregates all evaluator scores into a weighted confidence signal. Detailed deep-dive below.

Confidence Score — How It's Calculated

The confidence engine is the final gate before a response reaches the user. It takes raw scores from every evaluator, normalizes them, applies domain-specific weights, and produces a single decision: pass, retry, or fallback.

CONFIDENCE SCORE CALCULATION — WEIGHTED AGGREGATION + VETO LOGIC RAW SIGNALS grounding_score: 0.91 retrieval_score: 0.85 intent_score: 0.95 safety_score: 1.00 citation_score: 0.88 freshness_score: 1.00 WEIGHTED AGGREGATION 0.30 × grounding = 0.273 0.25 × retrieval = 0.213 0.15 × intent = 0.143 0.10 × safety = 0.100 0.10 × citation = 0.088 0.10 × freshness = 0.100 weighted_sum = 0.917 + VETO CHECK (any hard fail?) VETO OVERRIDES (any one triggers immediate reject) × safety_score < 0.5 × grounding_score < 0.3 × pii_detected == True × citation_fraud == True × blocked_content == True Veto = instant REJECT regardless > 0.85 PASS ✓ 0.60 – 0.85 RETRY ↻ < 0.60 or VETO FALLBACK ⚠ Weights configurable per domain / use case

Production ConfidenceEngine Implementation

class ConfidenceEngine: def __init__(self, config: DomainConfig): self.weights = config.weights self.thresholds = config.thresholds self.veto_rules = config.veto_rules def calculate(self, scores: EvalScores) -> Decision: # Step 1: Check hard veto rules first for rule in self.veto_rules: if rule.triggered(scores): return Decision( action="REJECT", reason=rule.name, confidence=0.0, vetoed=True ) # Step 2: Weighted aggregation raw_score = sum( self.weights[k] * getattr(scores, k) for k in self.weights ) # Step 3: Apply penalty for low-scoring # individual signals (even if weighted # average is high) penalty = 0.0 for k, threshold in self.thresholds.min_per_signal.items(): val = getattr(scores, k) if val < threshold: gap = threshold - val penalty += gap * 0.5 # 50% of gap final = max(0.0, raw_score - penalty) # Step 4: Map to decision if final >= self.thresholds.pass_threshold: return Decision("PASS", final) elif final >= self.thresholds.retry_threshold: return Decision("RETRY", final) else: return Decision("FALLBACK", final)

Why These Weights?

SignalWeightRationale
Grounding0.30Highest — a hallucinated answer is the #1 failure mode. If grounding fails, nothing else matters.
Retrieval0.25If retrieval quality is low, the LLM is working with bad context. Garbage in → garbage out.
Intent0.15Answering the wrong question is bad but less dangerous than hallucinating facts.
Safety0.10Low weight in formula BUT has a hard veto — any safety flag = instant reject regardless of score.
Citation0.10Verifies source attribution. Important for trust but not critical for correctness.
Freshness0.10Only matters for temporal queries. Many questions are time-independent.

Veto Rules — Hard Overrides

Certain conditions bypass the weighted score entirely and force an immediate reject. No amount of high scores elsewhere can compensate.

VETO_RULES = [ VetoRule("unsafe_content", lambda s: s.safety < 0.5), VetoRule("severe_hallucination", lambda s: s.grounding < 0.3), VetoRule("pii_leakage", lambda s: s.pii_detected), VetoRule("citation_fraud", lambda s: s.citation_valid_pct < 0.5), VetoRule("blocked_topic", lambda s: s.blocked_content), ] # If ANY veto fires → instant REJECT # regardless of weighted score

Worked Examples — Three Scenarios

Scenario A — PASS

"What's your return policy?"

Grounding: 0.91 × 0.30 = 0.273
Retrieval: 0.88 × 0.25 = 0.220
Intent: 0.95 × 0.15 = 0.143
Safety: 1.00 × 0.10 = 0.100
Citation: 0.90 × 0.10 = 0.090
Fresh: 1.00 × 0.10 = 0.100
─────────────────
Total: 0.926 → Penalty: 0
Decision: PASS ✓

Scenario B — RETRY

"Compare Plan A vs Plan B pricing"

Grounding: 0.62 × 0.30 = 0.186
Retrieval: 0.70 × 0.25 = 0.175
Intent: 0.90 × 0.15 = 0.135
Safety: 1.00 × 0.10 = 0.100
Citation: 0.40 × 0.10 = 0.040
Fresh: 1.00 × 0.10 = 0.100
─────────────────
Raw: 0.736 | Penalty: -0.05
Final: 0.686 → RETRY ↻
Retry with more context chunks

Scenario C — VETO REJECT

"Show me other users' orders"

Grounding: 0.85 × 0.30 = 0.255
Retrieval: 0.80 × 0.25 = 0.200
Intent: 0.92 × 0.15 = 0.138
Safety: 0.20 × 0.10 = 0.020
─── VETO TRIGGERED ───
safety < 0.5 → unsafe_content
Decision: REJECT ⚠
Even though weighted=0.85 would pass,
veto overrides. Response blocked.

Domain-Specific Weight Profiles

Different use cases need different weight distributions. A medical chatbot prioritizes grounding above all else; a casual Q&A bot prioritizes speed and intent alignment.

DomainGroundingRetrievalIntentSafetyCitationFreshPassRetry
General Q&A0.300.250.150.100.100.10>0.85>0.60
Medical / Legal0.400.200.100.150.100.05>0.90>0.70
E-commerce0.250.200.200.100.100.15>0.82>0.55
Voice Agent0.300.250.200.100.050.10>0.80>0.55
Internal Docs0.250.300.150.050.150.10>0.80>0.55
Financial0.350.200.100.150.100.10>0.92>0.75
# Config per domain DOMAIN_CONFIGS = { "medical": DomainConfig( weights={"grounding": 0.40, "retrieval": 0.20, "intent": 0.10, "safety": 0.15, "citation": 0.10, "freshness": 0.05}, thresholds=Thresholds(pass_threshold=0.90, retry_threshold=0.70), min_per_signal={"grounding": 0.7, "safety": 0.8}, # strict mins veto_rules=VETO_RULES + [ VetoRule("medical_disclaimer_missing", lambda s: s.has_medical_claim and not s.has_disclaimer), ] ), "ecommerce": DomainConfig( weights={"grounding": 0.25, "retrieval": 0.20, "intent": 0.20, "safety": 0.10, "citation": 0.10, "freshness": 0.15}, thresholds=Thresholds(pass_threshold=0.82, retry_threshold=0.55), min_per_signal={"grounding": 0.5}, veto_rules=VETO_RULES # standard vetos ), }
Calibration Tip: Don't guess at weights — measure them. Run your eval dataset through each evaluator independently, then use logistic regression on user feedback (thumbs up/down) to learn optimal weights for your domain. Recalibrate quarterly as your corpus and model evolve.

Production Microservice Architecture

Many companies deploy the evaluation layer as separate microservices for scalability and independent deployment.

class ResponseEvaluationService: """Runs all checks in parallel. Target: 50-200ms.""" async def evaluate(self, query, response, context): # Run all checks in parallel grounding, intent, safety = await asyncio.gather( self.grounding_svc.check(response, context), self.intent_svc.check(query, response), self.safety_svc.check(response), ) # Compute weighted confidence confidence = self.confidence_engine.score( grounding=grounding.score, retrieval=context.retrieval_score, intent=intent.score, safety=safety.score, ) # Decision if confidence.decision == Decision.PASS: return EvalResult(approved=True, response=response) elif confidence.decision == Decision.RETRY: return await self.regenerate(query, context) else: return EvalResult( approved=True, response="I'm not completely sure. " "Let me check that for you." )

Latency Optimization

Voice systems and real-time apps run all checks in parallel to keep total evaluation under 200ms.

CheckMethodLatencyAccuracy
GroundingEmbedding similarity~10msGood
GroundingCross-encoder~50msBetter
GroundingLLM-as-judge~500msBest
IntentEmbedding similarity~10msGood
IntentClassifier model~20msBetter
SafetyModeration API~50msGood
SafetyRule engine~1msExact
ConfidenceScore aggregation~1ms
Key Insight: Use the fastest tier (embedding similarity) as default. Escalate to cross-encoder or LLM-judge only when the fast check is ambiguous (score 0.5–0.75). This cascading approach keeps P95 under 100ms while catching edge cases.
Tools: FAISS, pgvector, Sentence Transformers, BERT cross-encoders, Rasa, SetFit, Guardrails AI, NeMo Guardrails, OpenAI Moderation, LangChain validators.

Additional Production Checks (Often Missed)

The four core checks (grounding, intent, safety, confidence) cover ~80% of failure modes. These additional checks close the remaining gaps that surface at scale.

5. Citation Verification

Validates that [Source N] references in the response actually match the claims they support. Catches "citation hallucination" where the model invents or misattributes sources.

class CitationVerifier: def verify(self, response, sources): citations = self.extract_citations(response) for cite in citations: # Does [Source N] exist? if cite.index >= len(sources): cite.valid = False continue # Does the claim match the source? sim = cosine_sim( cite.claim, sources[cite.index] ) cite.valid = sim > 0.6 return citations

Tools: Regex extraction + embedding verification. Run in parallel with grounding check (~5ms overhead).

6. Completeness Check

Did the answer address ALL parts of a multi-part question? Users often ask compound questions and the LLM may only answer part of it.

# Example problem: Query: "What's the return policy AND do you offer exchanges?" Answer: "Returns within 30 days." # Missing: exchange info! class CompletenessChecker: def check(self, query, answer): # Decompose query into sub-questions sub_qs = self.decomposer.split(query) addressed = [] for sq in sub_qs: sim = cosine_sim(sq, answer) addressed.append(sim > 0.5) return CompletenessResult( complete=all(addressed), missing=[sq for sq, a in zip(sub_qs, addressed) if not a] )

Tools: LLM query decomposer or spaCy clause splitting + embedding comparison.

7. PII Leakage Detection

The retrieved context may contain sensitive data (emails, SSNs, account numbers) that the LLM inadvertently surfaces in its response. Scan output before delivery.

from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine class PIIGuard: def __init__(self): self.analyzer = AnalyzerEngine() self.anonymizer = AnonymizerEngine() def scan(self, response): results = self.analyzer.analyze( text=response, entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "CREDIT_CARD", "US_SSN"], language="en" ) if results: return self.anonymizer.anonymize( text=response, analyzer_results=results ) return response # clean

Tools: Microsoft Presidio (open-source), AWS Comprehend PII, Google DLP API. Run on every response (~10ms).

8. Freshness / Staleness Check

Verify that the retrieved context is still current. An answer about "current pricing" from a 6-month-old document could be dangerously wrong.

class FreshnessChecker: def check(self, chunks, query): # Does query need fresh data? needs_fresh = self.classify_temporal(query) # "current", "latest", "now", "today" if not needs_fresh: return Fresh() # skip check for chunk in chunks: age = now() - chunk.indexed_at if age > timedelta(days=30): return Stale( chunk=chunk, age_days=age.days, action="warn_user" )

Store indexed_at and source_updated_at in chunk metadata. Define TTL per document type (pricing: 7d, policy: 30d, FAQ: 90d).

9. Retry & Regeneration Strategy

When evaluation fails, how do you regenerate differently? Simply re-running the same prompt gets the same bad answer. Production systems modify the generation strategy on retry.

class RetryStrategy: def regenerate(self, fail_reason, attempt): if fail_reason == "hallucination": # Add explicit "ONLY use context" return self.stricter_prompt( temp=0.0 # zero creativity ) elif fail_reason == "incomplete": # Retrieve MORE chunks return self.expand_context( top_k=10 # was 5 ) elif fail_reason == "stale": # Force fresh retrieval return self.re_retrieve( freshness="7d" ) elif attempt >= 2: return self.fallback_response()

Key: Max 2 retries. Each retry changes strategy (stricter prompt, more context, different model). After 2 fails → graceful fallback.

10. Human Feedback Loop

User feedback (thumbs up/down, corrections, follow-up queries) is the ultimate ground truth. Feed it back into evaluation thresholds and training data.

class FeedbackCollector: def record(self, query_id, signal): # Signals: # thumbs_up, thumbs_down, # correction(text), follow_up, # escalate_to_human self.store.save(query_id, signal) # If thumbs_down → add to eval set if signal == "thumbs_down": self.eval_builder.add_negative( query_id ) # Weekly: retune thresholds from # feedback distribution # Monthly: retrain intent/NLI models # Quarterly: full eval set refresh

Track feedback rate (aim for >5% of responses). Negative feedback → auto-add to adversarial eval set. Positive feedback → confidence calibration.

11. Evaluation Layer Monitoring — What to Dashboard

The evaluation layer itself needs monitoring. If your grounding check drifts, it will silently let hallucinations through.

Tier Exit Distribution

T1: 80% / T2: 15% / T3: 5% is baseline. Alert if T2 rises above 25% (embedding model drift) or T3 above 8% (retrieval quality degradation).

False Positive / Negative Rate

Sample 100 responses/week. Human-label as grounded or not. Compare against evaluator verdicts. Target: <3% false-positive (passes hallucination) and <8% false-negative (rejects good answer).

Retry & Fallback Rate

If retry rate exceeds 10% or fallback exceeds 3%, something upstream is broken — likely retrieval quality, prompt template, or LLM model regression. Investigate immediately.

Evaluator Latency P95

Track per-tier latency. If T1 P95 exceeds 30ms, the embedding model may need optimization or the batch size is too large. T2 P95 above 150ms → model serving issue.

PII Detection Rate

Track how often PII is found in responses. If rate spikes, investigate the retrieval pipeline — it may be pulling in documents with unredacted personal data.

User Feedback Correlation

Correlate confidence scores with user feedback. If high-confidence responses get thumbs-down, your evaluator is miscalibrated. Retune weights quarterly.

Complete Response Evaluation Checklist (Production):
Grounding Check (cascading T1/T2/T3)   Intent Alignment   Safety / Content Moderation   Confidence Score Engine   Citation Verification   Completeness Check   PII Leakage Detection   Freshness / Staleness Check   Retry / Regeneration Strategy   Human Feedback Loop   Evaluator Monitoring & Alerting
13 / Response Evaluation Layer
Adaptive

Self-Correction & Reflection Loops

Modern advanced RAG adds self-checking loops that detect and recover when retrieval quality is poor—rather than blindly stuffing top-k passages into prompts

Query Retrieval Needed? Yes Retrieve Quality Good? Yes Generate No: Correct & Retry No: Direct Generate Self-Check Output

Core Self-Correction Techniques

Self-RAG

Model decides whether retrieval is needed per token. Generates reflection tokens (IsRel, IsSup, IsUse) to critique its own outputs. Targets factuality and citation accuracy—10–15% improvement in studies.

CRAG

Corrective RAG evaluates retrieved documents. Triggers corrective actions (alternative retrieval, filtering, web search fallback) when retrieval quality is poor.

Adaptive Retrieval

Retrieve fewer documents when confidence is high; more when needed. Avoids indiscriminate retrieval via confidence-gated document selection.

Adaptive Retrieval with Confidence Gating

class AdaptiveRetriever: def retrieve(self, query, min_confidence=0.7): # Query embedding + confidence score embedding, confidence = self.embed_and_score(query) # Adaptive k: higher confidence → fewer docs k = 3 if confidence > 0.8 else (5 if confidence > 0.6 else 10) docs = self.index.search(embedding, top_k=k) # Evaluate doc relevance and content quality scored_docs = [ {'doc': d, 'relevance': self.relevance_score(d, query)} for d in docs ] # Filter by min_confidence threshold filtered = [d for d in scored_docs if d['relevance'] > min_confidence] return filtered if filtered else docs # Fallback to top docs if none pass
Key Insight: Self-correction loops transform RAG from a one-shot pipeline into a closed-loop system. The model detects when retrieval is noisy or insufficient and can trigger alternative strategies (re-retrieve, summarize context differently, web search) without external intervention.
14 / Self-Correction & Reflection Loops
Quality

RAG Evaluation Framework

Three metric categories measure retrieval effectiveness, generation quality, and system performance

Answer Relevance Context Relevance Groundedness/ Faithfulness
Category Metrics Measurement Method Tools
Retrieval Metrics Precision, Recall, NDCG, MRR, MAP Rank quality against gold standard passages BEIR, Trec-eval
Generation Metrics BLEU, ROUGE, METEOR, BERTScore, Context Relevance Automated scoring, LLM judges, embedding similarity TruLens, RAGAS
System Metrics Latency, throughput, cost per query, user satisfaction Production logs, user feedback, A/B tests OpenTelemetry, Datadog, custom instrumentation

Building Your Eval Dataset

Synthetic

LLM-generated QA from corpus. Fast, cheap. Risk of false positives.

Human-Curated

Gold standard. Expensive, slow. High quality baseline.

Production Logs

Real queries & answers. Most realistic. Requires filtering.

Adversarial

Edge cases, tricky queries. Surfaced via user feedback.

RAGAS: Automated RAG Evaluation

from ragas import evaluate from ragas.metrics import ( context_relevance, answer_relevance, faithfulness, answer_correctness ) # Eval dataset: [{"question": q, "answer": a, "context": c}] result = evaluate( eval_dataset, metrics=[ context_relevance, # Is context relevant to Q? answer_relevance, # Does answer address Q? faithfulness, # Is answer grounded in context? ] ) # Aggregate scores print(f"Context Relevance: {result['context_relevance'].mean():.2f}") print(f"Faithfulness: {result['faithfulness'].mean():.2f}")
BEIR Research Insight: BM25 remains a strong baseline for retrieval. Reranking can outperform but is costly. On evaluation: 30% of eval set should be hard edge cases to catch real-world failure modes. Synthetic eval is 3–5× cheaper than human labeling but systematically biased.
15 / RAG Evaluation Framework
Benchmarking

RAG Benchmarking & Performance Testing

Systematic benchmarking with shared metrics ensures consistent quality and enables confident deployment decisions

Benchmarking Framework Flow

BENCHMARKING FRAMEWORK — END-TO-END FLOW Data Sources Eval Datasets Golden QA Pairs Prod Query Logs Adversarial Set Hard Edge Cases min 500 QA pairs Benchmark Suite Retrieval Accuracy Generation Quality Latency / Throughput Context Relevance Faithfulness Hallucination Rate Citation Accuracy Cost per Query IR Metrics Recall@K NDCG@K MRR, Hit Rate BEIR Zero-Shot Latency P95 Generation Metrics Faithfulness Answer Relevance Hallucination Rate Citation Accuracy Quality Gates Score Dashboard Regression Detection Baseline Comparison CI/CD Gate Check ✓ PASS / ✗ FAIL Deploy ✓ Block + Fix Tools: RAGAS • BEIR • MTEB • TruLens • DeepEval • LangSmith • Custom Harness

Retrieval Benchmarks

Metric Target
Recall@5 > 85%
Recall@20 > 95%
NDCG@10 > 0.70
MRR > 0.75
Hit Rate > 95%
BEIR Zero-Shot Baseline
MTEB Rank Top 10%
Latency P95 < 100ms

Generation Benchmarks

Metric Target
Faithfulness > 0.90
Answer Relevance > 0.85
Context Relevance > 0.80
Hallucination Rate < 5%
Citation Accuracy > 90%
Refusal Rate > 80%
Completeness > 85%
TTFT P95 < 500ms

End-to-End System

Test full pipeline: retrieval → generation → post-processing. Measure user-facing latency, cost per query, success rate.

A/B Testing & Online

Shadow deploy changes. Compare metrics vs. baseline with statistical significance. Catch production surprises before full rollout.

Adversarial & Stress

Test with typos, out-of-domain queries, adversarial prompts. Load test at 10× peak. Measure robustness.

RAG Benchmark Suite Code

class RAGBenchmarkSuite:
    def __init__(self):
        self.thresholds = {
            "recall@5": 0.85,
            "faithfulness": 0.90,
            "latency_p95_ms": 100,
        }

    def run_benchmark(self, model, dataset):
        results = {}
        for q, ctx, golden_answer in dataset:
            retrieved = self.retrieve(q)
            generated = model.generate(q, ctx)

            results["recall"] = self.compute_recall(retrieved, ctx)
            results["faithfulness"] = self.compute_faithfulness(generated, ctx)

        return self.check_regressions(results, self.thresholds)

Benchmark Workflow Pipeline

1
Establish Baseline
Run suite on current production system
2
Change & Measure
Make code change, rerun benchmarks
3
Regression Detection
Compare vs. baseline, flag regressions
4
Gate Deployment
Pass gates → deploy; fail → iterate
5
Continuous Monitoring
Monitor metrics post-deploy, alert on drift

Benchmarking Tools Comparison

Tool Metrics LLM-Based Reference Best For
RAGAS Faithfulness, Answer Rel., Context Rel. Paper Gen. + Retrieval
BEIR Recall@K, NDCG@K, MRR, RMSE Yes Retrieval (IR)
MTEB Cross-lingual Retrieval, Ranking Yes Multilingual
TruLens LLM-based evals + feedback No Custom logic
DeepEval Hallucination, Answer Rel., RAGAS Optional LLM Evals
LangSmith Custom evals, tracing, logging Partial Optional Development + Monitoring
Arize Phoenix Evals + Production Observability Optional End-to-End
Custom Harness Org-specific metrics & logic Optional Org Control + Integration
Benchmarking Best Practice: Shared metrics across team prevent siloed evaluation. Deploy benchmarks as part of CI/CD to catch regressions early. Use production logs as continuous test set. Golden answers should be updated quarterly as product evolves.
16 / Benchmarking
Safety

Guardrails & Safety

Multi-stage guardrails prevent harmful input, retrieval, generation, and post-processing risks

Guardrail Pipeline: Four Stages

1. Input

  • • Prompt injection detection
  • • PII masking
  • • Content profanity filter

2. Retrieval

  • • ACL enforcement
  • • Source validation
  • • Freshness checks

3. Generation

  • • Hallucination checker
  • • Citation validator
  • • Token budget limits

4. Post-Processing

  • • PII scrubbing
  • • Toxicity filtering
  • • Output validation

GuardrailPipeline Implementation

from microsoft_presidio import AnalyzerEngine, AnonymizerEngine class GuardrailPipeline: def __init__(self): self.prompt_injector = PromptInjectionDetector() self.pii_analyzer = AnalyzerEngine() # Presidio self.topic_clf = TopicClassifier() self.rate_limiter = RateLimiter(max_qps=100) self.hallucination_checker = HallucinationChecker() self.toxicity_filter = ToxicityFilter() self.citation_validator = CitationValidator() async def process(self, user_input, context): # 1. Input guardrails if self.prompt_injector.is_injection(user_input): raise PromptInjectionError() pii_results = self.pii_analyzer.analyze(user_input) cleaned = anonymize(user_input, pii_results) # Redact/hash # 2. Rate limiting await self.rate_limiter.check(user_id=context['user_id']) # 3. Generation & post-processing answer = await self.llm.generate(cleaned, context) if not self.hallucination_checker.check(answer, context): answer = "I cannot answer based on available context." if not self.citation_validator.validate(answer): answer = "Citations missing or invalid." return answer
PII Detection with Microsoft Presidio: Identifies PII entities (email, phone, SSN, credit card). Actions: redact (remove), replace (substitute with category), hash (irreversible), encrypt (reversible with key). Use different strategies per context: remove sensitive PII from logs, hash for deduplication, redact for display.

Production Guardrails Architecture — Models, Tools & Design

A production guardrail system is NOT a single checkpoint. It's a layered defense architecture with specialized models at each stage — some rule-based (0ms), some ML-based (~10ms), some LLM-based (~200ms). The key is running them in parallel and using the cheapest effective check first.

PRODUCTION GUARDRAILS ARCHITECTURE — DEFENSE IN DEPTH User Input (query / message) LAYER 1: INPUT GUARDS Parallel — fastest first Prompt Injection Detector DeBERTa classifier | ~15ms PII Scanner Presidio NER | ~10ms Topic / Intent Classifier SetFit / fine-tuned | ~8ms Rate Limiter + Auth Redis token bucket | ~1ms Content Policy (Regex) blocklist + regex | ~0.5ms LAYER 2: RETRIEVAL ACL Filter (per-user) Source Trust Score Freshness Validator ~2ms (metadata check) LAYER 3: OUTPUT GUARDS Parallel — before delivery Grounding Verifier Cascading T1/T2/T3 | ~10-50ms Toxicity / Safety Filter OpenAI Moderation API | ~30ms PII Output Scrubber Presidio on response | ~10ms Citation Validator Embedding match check | ~5ms Intent Alignment Check Embedding cosine | ~10ms Policy Rule Engine Domain rules + regex | ~1ms DECISION Confidence Engine ✓ PASS ↻ RETRY ⚠ BLOCK OBSERVABILITY SPINE — Every check emits traces, metrics, and audit logs OpenTelemetry spans • Prometheus counters • Immutable audit log (query_id, check_name, verdict, latency, timestamp) MODELS & TOOLS USED AT EACH LAYER Prompt Injection deberta-v3-injection protectai/rebuff LakeraGuard API PII Detection MS Presidio AWS Comprehend Google DLP API Toxicity / Safety OpenAI Moderation Perspective API toxic-bert Grounding DeBERTa-v3-NLI Sentence Transformers Claude Haiku (T3) Frameworks Guardrails AI NeMo Guardrails LangChain Guards Observability OpenTelemetry Datadog / Grafana TruLens / Phoenix Red Team Promptfoo Garak InjecAgent

Guardrail Models — What to Use at Each Stage

CheckModel / ToolTypeLatencyAccuracyCostBest For
Prompt Injectiondeberta-v3-prompt-injectionFine-tuned classifier~15ms92% F1Free (self-hosted)Primary injection defense
Prompt InjectionLakera Guard APIManaged API~50ms95%+ F1$0.001/reqHigher accuracy, no infra
Prompt InjectionProtectAI / RebuffMulti-layer (heuristic+LLM)~80msHighFree OSSDefense-in-depth
PII DetectionMicrosoft PresidioNER + regex~10msHighFree (OSS)Default PII choice
PII DetectionAWS Comprehend PIIManaged API~40msVery high$0.01/unitAWS-native stacks
ToxicityOpenAI ModerationManaged API~30msVery highFreeDefault safety check
ToxicityPerspective API (Google)Managed API~50msHighFree (quota)Multi-language toxicity
Toxicityunitary/toxic-bertSelf-hosted BERT~12msGoodFree (GPU)Air-gapped / self-hosted
Topic / IntentSetFit (fine-tuned)Few-shot classifier~8msHighFreeDomain-specific blocking
GroundingDeBERTa-v3-NLICross-encoder~50msVery highFree (GPU)Tier 2 grounding
GroundingClaude Haiku / GPT-4o-miniLLM-as-judge~400msBest~$0.001/reqTier 3 disputed claims
FrameworkGuardrails AIOrchestrationvariesFree OSSDeclarative guard chains
FrameworkNeMo Guardrails (NVIDIA)Dialog managementvariesFree OSSConversational safety flows
Red TeamPromptfooTesting frameworkofflineFree OSSCI/CD injection testing
Red TeamGarak (NVIDIA)Vulnerability scannerofflineFree OSSAutomated LLM probing

Production GuardrailOrchestrator

class GuardrailOrchestrator: """Run all guards in parallel per layer. Total latency = max(layer checks), not sum.""" def __init__(self, config: GuardConfig): # Layer 1: Input (parallel) self.input_guards = [ PromptInjectionGuard( model="deberta-v3-injection" ), PIIScanner(engine="presidio"), TopicBlocker(topics=config.blocked), RateLimiter(redis=config.redis), ContentPolicy(rules=config.rules), ] # Layer 3: Output (parallel) self.output_guards = [ GroundingVerifier(cascade=True), ToxicityFilter(api="openai"), PIIScrubber(engine="presidio"), CitationValidator(), IntentAligner(), PolicyRuleEngine(config.rules), ] async def check_input(self, query, ctx): # Run ALL input guards in parallel results = await asyncio.gather(*[ g.check(query, ctx) for g in self.input_guards ], return_exceptions=True) # Any hard block = reject immediately for r in results: if isinstance(r, BlockVerdict): return r # blocked return PassVerdict() async def check_output(self, response, ctx): results = await asyncio.gather(*[ g.check(response, ctx) for g in self.output_guards ], return_exceptions=True) # Aggregate into confidence score return self.confidence.calculate(results)

Design Principles

1. Parallel by default: Run all checks within a layer simultaneously. Latency = max(check), not sum(checks). Input layer: ~15ms. Output layer: ~50ms.

2. Cheapest first: Regex rules (0.5ms) → ML classifiers (10ms) → API calls (30ms) → LLM judges (400ms). Exit at the cheapest layer that gives a confident verdict.

3. Fail-open vs fail-closed: Safety and injection checks = fail-closed (block if check fails). PII and grounding = fail-open with degraded response (still answer, but warn).

4. Never block the user silently: Every block must include a reason. "I can't answer that because..." is better than a generic error.

5. Audit everything: Every guard verdict → immutable log with query_id, guard_name, verdict, score, latency, timestamp. Required for compliance and debugging.

Latency Budget: Total guardrail overhead must stay under 100ms P95 for real-time apps. Input layer: ~15ms (parallel). Retrieval: ~2ms (metadata). Output layer: ~50ms (parallel). Confidence scoring: ~1ms. Total: ~68ms P95 — well within budget.
Minimum Viable Guardrail Stack: If you can only deploy 4 checks, choose these: (1) Prompt injection classifier (deberta-v3, 15ms), (2) PII scanner (Presidio, 10ms), (3) OpenAI Moderation API (30ms, free), (4) Grounding verifier embedding check (10ms). Total: ~30ms parallel. Covers 90% of production failure modes. Add more checks as you scale.
17 / Guardrails & Safety
Security

Enterprise Threat Model & OWASP LLM Top 10

Map RAG attack surfaces to OWASP LLM Top 10 categories with mitigations

RAG System Prompt Injection Critical Data Exfiltration Critical Permission Leakage High Data Poisoning High DoS / Cost Attack Medium Supply Chain Medium

1. Prompt Injection

Risk: Malicious prompts override system instructions or exfiltrate data.

Mitigations: Content sanitization, instruction stripping, system prompt dominance, tool-call allowlists.

2. Data Exfiltration

Risk: Model leaks sensitive data (PII, secrets) in responses.

Mitigations: Output filtering, PII scrubbing, redaction at generation time.

3. Permission Leakage

Risk: Weak retrieval filters expose unauthorized content.

Mitigations: ACL-aware retrieval, auth-sensitive cache keys, audit trails.

4. Data Poisoning

Risk: Malicious docs inserted into corpus, spread misinformation.

Mitigations: Ingestion validation, source trust scoring, content integrity checks.

5. DoS (Expensive Prompts)

Risk: Very long contexts, recursive tool calls exhaust resources.

Mitigations: Token budgets, hard timeouts, rate limits per user.

6. Supply Chain

Risk: Compromised embedding models or dependencies.

Mitigations: Model provenance, dependency scanning, vendor security audit.

CRITICAL SECURITY INVARIANT:

A user must never receive retrieved context (or generated content derived from it) that they are not authorised to access.

Permission-Aware Retrieval Requirements:
  • Ingest-time ACL assignment: Tag every chunk with owner/org/role ACLs
  • Query-time filter enforcement: Filter retrieved docs by user's ACL before context assembly
  • ACL-sensitive cache keys: Include user_id/org_id in cache key to prevent cross-user leakage
  • Audit trails: Log all access (who queried, what docs were retrieved, timestamps)
18 / Enterprise Threat Model & OWASP LLM Top 10
Quality & Safety

Grounding & Faithfulness

Ensure every generated claim is traceable to retrieved evidence — reduce hallucinations by 42-68%, enable inline citations, and build verifiable trust in production RAG systems.

What is Grounding?

Grounding is the process of anchoring every claim in the LLM's response to specific evidence from retrieved documents. An answer is grounded when each statement can be traced back to a source passage. An answer is faithful when it does not add information beyond what the context supports. Together, grounding + faithfulness are the primary defenses against hallucination in RAG systems.

Retrieved Context [Doc1] [Doc2] [Doc3] LLM Generation + citation instruction "Cite [DocN] for claims" Grounding Check NLI entailment verify Citation accuracy check Self-consistency vote Grounded Response Claims + [citations] Confidence scores 3-10% hallucination Retry ungrounded claims

Grounding Techniques

1. Prompt-Based Grounding

Instruct the LLM to cite sources inline. Simplest approach — no extra models needed.

  • Inline citations: "Answer using [1], [2] notation"
  • Quote extraction: "Include exact quotes from context"
  • Abstain instruction: "Say 'I don't know' if context lacks answer"
  • Confidence tagging: "Rate confidence [HIGH/MED/LOW] per claim"

Effectiveness: Reduces hallucination by 30-45%. Easy to implement but relies on LLM compliance.

2. NLI Verification

Use Natural Language Inference models to verify each claim is entailed by retrieved context.

  • Claim decomposition: Split response into atomic claims
  • Entailment check: DeBERTa-MNLI / TRUE model per claim
  • Verdict: Entailed, Contradicted, or Neutral
  • Action: Remove/flag unentailed claims

Effectiveness: Reduces hallucination by 50-68%. Gold standard for post-hoc verification.

3. Self-Consistency Voting

Generate multiple answers and keep only claims that appear consistently across samples.

  • Sample N responses (temperature > 0)
  • Extract atomic claims from each response
  • Majority vote: Keep claims in ≥60% of samples
  • Consensus answer: Reconstruct from agreed claims

Effectiveness: 40-55% hallucination reduction. Costs N× more tokens. Best for high-stakes queries.

4. Citation-Aware Generation

Fine-tune or prompt models to generate answers with verifiable citation markers in a structured format.

  • ALCE framework: Train citation generation with NLI feedback
  • AGREE approach: Tune LLM to include citations, verify with NLI
  • Post-hoc attribution: Match generated sentences to source chunks
  • CARGO: Citation-aware routing + grounded optimization

Effectiveness: 55-70% reduction. Requires fine-tuning or structured prompting. Best quality.

Techniques Comparison

Technique Hallucination Reduction Latency Impact Cost Impact Implementation
Prompt-based citation 30-45% None None Trivial (prompt change)
Abstain instruction 20-35% None None Trivial (prompt change)
NLI post-verification 50-68% +50-100ms (DeBERTa) Low ($0.001/query) Medium (NLI model)
Self-consistency (N=5) 40-55% 5x generation time 5x token cost Easy (sampling)
RAGAS faithfulness Eval metric (not mitigation) +200ms 1 extra LLM call Medium (pipeline)
Citation-aware fine-tune 55-70% None at inference $2-5K training High (SFT + NLI)
Combined (prompt + NLI + retry) 65-80% +100-300ms Low-Medium Medium

Grounding Metrics

Faithfulness (RAGAS)

Fraction of claims in the answer that are supported by the retrieved context. Computed via LLM or NLI entailment.

Target: ≥ 0.85 for production

Citation Precision

Fraction of inline citations that actually support the claim they're attached to. Measured via NLI on (claim, cited_passage) pairs.

Target: ≥ 0.80

Citation Recall

Fraction of claims that have at least one valid citation. Missing citations = unverifiable claims, even if correct.

Target: ≥ 0.75

Production Implementation

# === Grounding Pipeline: Prompt + NLI Verification + Retry === from transformers import pipeline from ragas.metrics import faithfulness import re # 1. NLI model for claim verification nli = pipeline("text-classification", model="microsoft/deberta-v3-large-mnli", device="cuda") # 2. Grounding prompt template GROUNDED_PROMPT = """Answer the question based ONLY on the provided context. Rules: - Cite sources using [1], [2], etc. after each claim - If the context doesn't contain the answer, say "I don't have enough information" - Never add information not present in the context - Rate your overall confidence: [HIGH], [MEDIUM], or [LOW] Context: {context} Question: {question} Answer (with citations):""" # 3. Decompose response into atomic claims def decompose_claims(response: str) -> list[str]: """Split response into individual factual claims.""" sentences = re.split(r'(?<=[.!?])\s+', response) return [s.strip() for s in sentences if len(s.strip()) > 10] # 4. Verify each claim against retrieved context def verify_grounding(claims: list[str], context: str) -> dict: results = {"grounded": [], "ungrounded": [], "score": 0.0} for claim in claims: # NLI: does context entail this claim? result = nli(f"{context}", candidate_labels=[claim]) label = result[0]["label"] if label == "ENTAILMENT": results["grounded"].append(claim) else: results["ungrounded"].append(claim) total = len(claims) results["score"] = len(results["grounded"]) / total if total > 0 else 0 return results # 5. Full grounding pipeline with retry def grounded_rag(query, retriever, llm, max_retries=2): docs = retriever.invoke(query) context = "\n".join([f"[{i+1}] {d.page_content}" for i, d in enumerate(docs)]) for attempt in range(max_retries + 1): prompt = GROUNDED_PROMPT.format(context=context, question=query) response = llm.invoke(prompt) # Verify grounding claims = decompose_claims(response) verification = verify_grounding(claims, context) if verification["score"] >= 0.85: return {"answer": response, "grounding_score": verification["score"], "ungrounded": verification["ungrounded"], "attempts": attempt + 1} # Retry with feedback on ungrounded claims query += f"\n\nNote: these claims were ungrounded, remove them: {verification['ungrounded']}" return {"answer": response, "grounding_score": verification["score"], "warning": "Below grounding threshold after retries"}

Production Recommendations

Recommended: Layered Grounding

  1. Prompt engineering — Always include citation instructions and abstain directive (free, 30-45% reduction)
  2. NLI post-check — Run DeBERTa-MNLI on claims after generation (+50ms, 50-68% reduction)
  3. Retry loop — If faithfulness < 0.85, regenerate with feedback on ungrounded claims (1-2 retries max)
  4. Fallback — If still below threshold, return partial answer with confidence warning

Combined effect: 65-80% hallucination reduction at <300ms extra latency

Monitoring & Alerts

  • Track faithfulness score per query (RAGAS or NLI-based)
  • Alert if daily avg drops below 0.80
  • Log ungrounded claims for analysis and prompt improvement
  • Sample 1% for human review — correlate with NLI scores
  • Dashboard metrics: faithfulness, citation precision, citation recall, abstain rate
  • Watch abstain rate: >30% means retrieval quality is poor, not grounding
Key Insight: A 2024 Stanford study found that combining RAG, RLHF, and guardrails led to a 96% reduction in hallucinations vs baseline. However, even with RAG, AI legal research tools still hallucinate 17-33% of citations (Stanford 2025). The lesson: always verify citations post-generation — never trust the LLM's self-reported sources without NLI or exact-match validation against the actual retrieved passages.
Common Mistake: Many teams confuse correctness with faithfulness. A response can be factually correct but unfaithful (adds info not in context). A response can be faithful but incorrect (context itself is wrong). Grounding solves faithfulness. To solve correctness, you also need high-quality retrieval and up-to-date knowledge bases.
19 / Grounding & Faithfulness
Operations

Observability & Monitoring

Three monitoring layers: system SLOs, retrieval quality, and answer groundedness

Traces Metrics Logs OpenTelemetry Collector Dashboards & Alerts

Monitoring Layers

System SLOs

  • • Latency (p50, p95, p99)
  • • Throughput (QPS)
  • • Error rate
  • • Availability

Retrieval Quality

  • • NDCG, MRR (rank quality)
  • • Precision@k
  • • Docs retrieved per query
  • • Reranker acceptance rate

Answer Quality

  • • Faithfulness (grounded?)
  • • Answer relevance
  • • Citation accuracy
  • • User feedback signal

OpenTelemetry Tracing Decorator

from opentelemetry import trace from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace import TracerProvider tracer = TracerProvider().get_tracer("rag-pipeline") def trace_retrieval(func): def wrapper(*args, **kwargs): with tracer.start_as_current_span("retrieval") as span: span.set_attribute("query", kwargs.get('query')) result = func(*args, **kwargs) span.set_attribute("docs_retrieved", len(result)) span.set_attribute("latency_ms", elapsed) return result return wrapper @trace_retrieval async def retrieve(query): return await index.search(query)

OpenTelemetry Collector Config (with PII Scrubbing)

receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 processors: batch: timeout: 10s send_batch_size: 1024 attributes: actions: - key: user.email action: delete # Remove PII - key: http.request.header.authorization action: delete # Remove tokens exporters: otlp: endpoint: collector:4317 service: pipelines: traces: receivers: [otlp] processors: [batch, attributes] exporters: [otlp]
LangSmith

LLM tracing, debugging

Arize Phoenix

ML observability

OpenTelemetry

Core instrumentation

Datadog

Metrics, dashboards

TruLens

RAG eval metrics

Essential Dashboards:
  • Latency Breakdown: Query transform, embedding, search, rerank, LLM, total
  • Retrieval Quality: NDCG, docs per query, reranker effectiveness
  • LLM Metrics: Token usage, cost, temperature, model routing decisions
  • User Metrics: Active users, unique queries, satisfaction (thumbs up/down)
  • System Health: Disk usage (vector DB), index freshness, cache hit rates, error budget
20 / Observability & Monitoring
Infrastructure

Scaling & Performance

Scale RAG systems from thousands to billions of documents with architecture patterns

Scale Tier Document Count Query Load (QPS) Typical Latency (p95) Architecture Pattern
Small 10^5–10^6 chunks <10 QPS <2s Single in-memory FAISS index, Python app, SQLite metadata
Medium 10^7–10^8 chunks 10–300 QPS 1–4s Milvus/Weaviate cluster, Kubernetes, async queue processing, multi-region replication
Large 10^8–10^9+ chunks 300–5000+ QPS 500ms–1s Elasticsearch sharding, GPU-accelerated search (Triton), vLLM serving, distributed caching, traffic shaping
Query In L1: Exact Query Cache — Redis, ~0ms, 30-40% hit L2: Semantic Cache — Vector similarity, ~5ms, 50-60% hit L3: Embedding Cache — Content hash, avoid re-embed Cache Hit → Return Cache Hit → Return Cache Hit → Return Cache Miss L4: LLM Response Cache — Prompt hash (temp=0)

Multi-Layer Caching Strategy

L1: Exact Query

Hash(query) → response. TTL: 24h. Hit rate: 15–25% for repeated queries.

L2: Semantic

Embedding similarity clustering. Cache similar queries together. Hit rate: 30–40%.

L3: Embedding

Cache embeddings for large docs to avoid re-embedding on every query.

L4: LLM Response

Cache LLM outputs by (query, context hash). Reduces expensive inference calls.

SemanticCache Implementation

class SemanticCache: def __init__(self, embedding_model, similarity_threshold=0.95): self.embed = embedding_model self.threshold = similarity_threshold self.cache = {} # query_embedding → (answer, metadata) self.redis = Redis() # Persistent L2 cache async def get(self, query): query_emb = await self.embed.encode(query) # Find semantically similar cached queries for cached_emb, answer in self.cache.items(): sim = cosine_similarity(query_emb, cached_emb) if sim > self.threshold: return answer # Cache hit! return None def set(self, query, answer): query_emb = await self.embed.encode(query) self.cache[query_emb] = answer self.redis.setex( key=f"cache:{query_emb}", time=86400, # 24h TTL value=answer )

Scaling Architecture Patterns

Retrieval Optimization

  • Horizontal scaling: Shard index by doc_id ranges
  • GPU acceleration: Triton Inference Server for embedding
  • Connection pooling: Reuse DB connections (PgBouncer)
  • Async processing: Batch embedding requests

Generation Optimization

  • vLLM: PagedAttention for high-throughput serving
  • Model parallelism: Shard LLM across GPUs
  • Quantization: INT8/FP8 for latency reduction
  • Speculative decoding: Predict + verify next tokens in parallel

KServe Kubernetes-Native Serving (vLLM)

apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: llm-service spec: predictor: minReplicas: 2 maxReplicas: 10 containerSpec: image: vllm/vllm-openai:latest env: - name: MODEL_NAME value: meta-llama/Llama-2-13b-hf ports: - containerPort: 8000 resources: limits: nvidia.com/gpu: "1" scaleTarget: 70 # Auto-scale at 70% GPU util
Latency Budget Breakdown (P50):

  • Query Transform: 50ms (normalization, spell-check)
  • Embedding: 20ms (vectorize query)
  • Search: 30ms (FAISS/Milvus lookup)
  • Rerank: 50ms (cross-encoder)
  • LLM Generation: 800ms (token generation)
  • Total Expected: ~950ms P50

SLO Modes: Low-latency interactive (<2–4s p95) vs High-throughput batch (10–60s acceptable)
21 / Scaling & Performance
Advanced

Advanced Techniques

Agentic RAG

LLM agent decides what, when, and how to retrieve. Decomposes complex queries, chooses data sources (vector DB, SQL, API, web), iteratively refines results.

from langchain_agents import ReActAgent

agent = ReActAgent(
    llm=claude,
    tools=[vector_search, sql_query, web_search],
    max_iterations=5
)
result = agent.run("Find latest Q1 earnings and analyst sentiment")

Graph RAG

Knowledge graphs + vector search. Entity-relationship graphs, multi-hop reasoning, community detection for summarization.

MATCH (a:Company)-[r*1..3]->(b:Company)
WHERE a.name = "Acme Inc"
WITH collect(b) as connected
CALL apoc.text.summarize(connected)
  YIELD summary
RETURN summary

RAPTOR (Tree-based)

Recursively summarize clusters into a hierarchy. Query at multiple abstraction levels for top-down reasoning.

  • Hierarchical abstraction layers
  • Efficient multi-scale retrieval
  • Reduced token cost vs flat indexing

Self-RAG (Adaptive)

LLM decides if retrieval needed, generates with self-critique tokens, iterates. Reduces unnecessary retrieval ~40%.

[Retrieval?] [Utility] [Relevance] [Correctness]
Outputs critique tokens, decides iterations

Multi-Modal RAG

Index images, tables, charts alongside text. Vision models for visual content, multi-modal embeddings.

  • Cohere embed-v4 multi-modal
  • CLIP for image-text alignment
  • Unified vector space across modalities

Multi-Tenant RAG

Namespace isolation per tenant. Shared infrastructure, isolated data. Query-time ACL enforcement.

  • Cost-efficient multi-tenancy
  • Metadata-driven access control
  • Compliance for regulated industries
22 / Advanced Techniques
DevOps

Deployment & CI/CD

Lint & Test Integration Tests RAG Eval Suite GATE faith>0.85, rel>0.80 Canary 5% Progressive Rollout 5%→25%→50%→100% Rollback

Docker Compose Architecture

services:
  rag-api:
    image: rag-api:latest
    deploy:
      replicas: 3
    depends_on: [qdrant, redis]

  embedding-service:
    image: embedding-service:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  indexing-worker:
    image: indexing-worker:latest
    environment:
      - CELERY_BROKER=redis:6379

  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"

  redis:
    image: redis:7-alpine

CI/CD Pipeline Stages

1
Lint & Unit Tests

Code quality, type checks, fast tests

2
Integration Tests

End-to-end retrieval, indexing flows

3
RAG Evaluation Suite

Gate metrics: Faithfulness >0.85, Relevance >0.80

4
Canary 5%

Monitor cost, latency, error rate

5
Progressive Rollout

5% → 25% → 50% → 100% traffic

Migration Path (Prototype → Production):
  • Canonical doc schema + versioning
  • ACL-aware retrieval architecture
  • Evaluation harness in CI/CD
  • Comprehensive observability & tracing
23 / Deployment & CI/CD
Economics

Cost Optimization

Cost Drivers (Ranked)

#1 LLM Inference Tokens 45%
#2 Reranker Compute 20%
#3 Embedding Compute 15%
#4 Vector DB Costs 12%
#5 Observability 8%

Cost Breakdown per 1K Queries

LLM Generation
$4.50
Embedding Compute
$0.30
Reranker Compute
$0.20
Vector DB Storage
$1.00
Optimization Levers:
  • Embedding Costs: Batch processing, caching, model selection, self-hosting (>1M queries/day)
  • LLM Token Costs: Context pruning, model tiering, semantic caching (50–60% reduction)
  • Infrastructure: Adaptive retrieval, hybrid tuning, dimension control, vLLM serving efficiency
24 / Cost Optimization
Risk

Failure Modes & Mitigations

Irrelevant/Misleading Retrieval

Hybrid retrieval + reranking + CRAG-like evaluation gates

Over-retrieval / Context Stuffing

Cap top-k, MMR diversity, adaptive retrieval (Self-RAG)

Permission Leakage

ACL filters, ACL in cache keys, comprehensive audit trails

Prompt Injection via Retrieved Docs

Treat as untrusted, content sanitization, system prompt dominance

Embedding Drift (Model Upgrades)

Version embeddings/indexes, shadow-build, offline eval + canary

Silent Ingestion Regressions

Ingestion QA metrics, OCR confidence alerts, hit-rate monitoring

Cost Blow-ups

Token budgets, hard timeouts, reranker gating, rate limits

OWASP LLM Risk Taxonomy:

Use as a practical checklist for security & compliance governance across RAG components.

25 / Failure Modes & Mitigations
Planning

Phased Implementation Roadmap

5-Phase RAG Implementation Timeline Phase 1 Foundations (weeks 1-7) Phase 2 Pilot (weeks 8-12) Phase 3 Hardening (weeks 13-18) Phase 4 Scale (weeks 19-25) Phase 5 Governance (weeks 26-32) Data Contracts Ingestion MVP Baseline Retrieval Hybrid Fusion Citations UI Eval Harness ACL-Aware Retrieval PII Redaction Observability CDC Streaming vLLM/KServe Cost Controls Model Versioning Red-Team Compliance Deliverables by Phase Phase 1: Standardized schemas, working index pipeline Phase 2: Multi-stage retrieval, user-facing citations Phase 3: Security & compliance controls active Phase 4: High-throughput, cost-efficient ops Phase 5: Production-ready for regulated industries
26 / Implementation Roadmap
Summary

Production Readiness Checklist

Data & Indexing

  • Multi-format parsing (PDF, docx, HTML, images)
  • Incremental indexing with changelog tracking
  • Rich metadata (source, timestamp, ownership)
  • Semantic chunking strategies
  • Dead letter queue for malformed data
  • Data freshness tracking & SLAs
  • Canonical document contract

Retrieval & Generation

  • Hybrid search (dense + sparse + knowledge graph)
  • Multi-stage reranking pipeline
  • Query transformation & expansion
  • Streaming generation with token streaming
  • Citation validation & provenance
  • Confidence scoring on outputs
  • Self-correction loops (Self-RAG)

Safety & Compliance

  • Prompt injection detection & mitigation
  • PII detection (Presidio integration)
  • Hallucination detection frameworks
  • RBAC / ACL enforcement
  • Comprehensive audit logging
  • GDPR / HIPAA / EU AI Act compliance
  • NIST AI RMF & data retention policies

Operations & Scale

  • Multi-layer caching (query, response, embedding)
  • Distributed tracing (OpenTelemetry)
  • Automated evaluation in CI/CD gates
  • Canary & progressive rollout strategy
  • Cost tracking & budget enforcement
  • User feedback loop & telemetry
  • vLLM / KServe deployment optimization

"Production RAG is 20% retrieval and 80% engineering"

Data quality, evaluation frameworks, guardrails, caching strategies, comprehensive observability, and production operations are what separate a working demo from a reliable, compliant, cost-efficient product.

27 / Production Readiness Checklist

Context Compression for RAG

Reduce retrieved context length before generation — cut costs up to 80%, decrease latency, improve answer quality by eliminating noise, and fit more relevant information within the LLM's context window.

Retriever top-k docs (10K+ tokens) Context Compressor Extractive / Abstractive Token / Sentence / Document 2-20x compression Compressed 500-2K tokens → LLM Generator 80% Cost Reduction + less hallucination

Compression Taxonomy

Extractive

Select the most relevant sentences or tokens from retrieved documents. No rewriting — preserves original text fidelity.

LLMLingua Selective Context Reranker-filter

Best for: Factual QA, legal/medical where exact wording matters

Abstractive

Generate condensed summaries of retrieved context. Rewrites and merges information from multiple documents into coherent compressed text.

RECOMP Summary chains Map-reduce

Best for: Multi-doc synthesis, when space is extremely limited

Hybrid / Learned

Neural models trained to compress context into summary vectors or learned soft tokens. Encode key information into fixed-size representations.

AutoCompressors Gisting ECoRAG

Best for: Very long context, embedding-level compression

Key Techniques Compared

Technique Type Compression Quality Retention Latency Overhead Best For
LLMLingua-2 Extractive (token-level) 3-20x 95-98% ~10ms (small classifier) General-purpose; best quality/speed ratio
LongLLMLingua Extractive (query-aware) 2-10x 97-100% (can improve +21%) ~15ms Multi-doc RAG; combats lost-in-middle
Selective Context Extractive (sentence-level) 2-5x 93-96% ~5ms Simple baseline; minimal dependencies
Reranker + Top-K Filter Extractive (document-level) 2-5x 95-99% ~20-50ms (cross-encoder) Already using reranker; simplest integration
RECOMP (Extractive) Extractive (trained selector) 5-10x 94-97% ~15ms NQ/TriviaQA-style single-answer tasks
RECOMP (Abstractive) Abstractive (trained summarizer) 10-20x 90-95% ~100-200ms (small LM gen) Multi-hop reasoning; extreme compression
AutoCompressors Learned (summary vectors) 20-50x 85-92% ~50ms Very long documents; fixed-budget context
Map-Reduce Summary Abstractive (LLM chain) 10-50x 80-90% ~500ms-2s (LLM calls) 100+ page documents; report generation
ECoRAG Hybrid (evidentiality-guided) 5-15x 96-99% ~20ms Long context RAG; evidence-focused answers

LLMLingua Family — Production Standard

LLMLingua (v1)

Uses a small language model (e.g., GPT-2, LLaMA-7B) to compute per-token perplexity. Tokens with low perplexity (highly predictable) are dropped. Budget-constrained iterative token pruning.

  • Up to 20x compression
  • Only 1.5% performance loss on reasoning
  • Works with any LLM (black-box compatible)

LLMLingua-2

Reframes compression as a token classification problem. A small BERT-like model predicts which tokens to keep/drop. Trained on GPT-4 distilled labels.

  • 3-6x faster than LLMLingua v1
  • 95-98% accuracy retention
  • Task-agnostic — no prompt-specific tuning
  • Published at ACL 2024

LongLLMLingua (RAG-Optimized)

Specifically designed for RAG pipelines. Three key innovations:

  • Question-aware coarse-to-fine: Compresses differently based on query relevance — keeps more tokens from highly relevant passages
  • Document reordering: Combats the "lost-in-middle" problem by placing most relevant docs at start/end
  • Dynamic compression ratios: Uses contrastive perplexity (question-conditioned vs unconditional) to decide per-document compression level

Result: Up to 21.4% RAG quality improvement using only 25% of tokens

RECOMP — Trained Compression

Two variants from Princeton/CMU research:

  • Extractive: Trained selector picks most useful sentences from each document. Fast, preserves original text.
  • Abstractive: Trained T5-based summarizer generates concise summaries conditioned on the query. Higher compression but rewrites text.

Both outperform no-compression baselines on NQ and TriviaQA while using 5-20x fewer input tokens.

Production Implementation

# === LLMLingua-2 with LlamaIndex === from llmlingua import PromptCompressor from llama_index.core import VectorStoreIndex, SimpleDirectoryReader # Initialize compressor (uses small model for token classification) compressor = PromptCompressor( model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank", use_llmlingua2=True, device_map="cuda" ) # Retrieve documents (standard RAG pipeline) index = VectorStoreIndex.from_documents(SimpleDirectoryReader("./data").load_data()) retriever = index.as_retriever(similarity_top_k=10) nodes = retriever.retrieve("What are the key findings?") # Compress retrieved context before sending to LLM context = "\n\n".join([n.get_content() for n in nodes]) compressed = compressor.compress_prompt( context, instruction="Answer the question based on the context.", question="What are the key findings?", target_token=500, # target compressed length rate=0.5, # 50% compression ratio force_tokens=["?", "."], # always keep these tokens ) print(f"Original: {compressed['origin_tokens']} tokens") print(f"Compressed: {compressed['compressed_tokens']} tokens") print(f"Ratio: {compressed['ratio']:.1f}x") print(f"Saving: {compressed['saving']}") # Use compressed context for generation compressed_prompt = compressed["compressed_prompt"]
# === LangChain Contextual Compression Retriever === from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import ( LLMChainExtractor, EmbeddingsFilter, DocumentCompressorPipeline, ) from langchain.text_splitter import CharacterTextSplitter from langchain_openai import OpenAIEmbeddings, ChatOpenAI # Strategy 1: LLM-based extraction (highest quality, highest latency) llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) compressor = LLMChainExtractor.from_llm(llm) compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}) ) # Strategy 2: Embeddings filter (fast, no LLM call) embeddings = OpenAIEmbeddings() embeddings_filter = EmbeddingsFilter( embeddings=embeddings, similarity_threshold=0.76 # drop docs below threshold ) # Strategy 3: Pipeline — split → filter → extract (recommended) splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0) pipeline = DocumentCompressorPipeline( transformers=[splitter, embeddings_filter] ) compression_retriever = ContextualCompressionRetriever( base_compressor=pipeline, base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}) ) # Use in chain docs = compression_retriever.invoke("What are the quarterly results?") print(f"Retrieved {len(docs)} compressed docs")

Production Recommendations

Recommended: Tiered Compression

Combine multiple techniques in a pipeline for best results:

  1. Document-level: Reranker filters top-k → top-3-5 docs
  2. Sentence-level: LLMLingua-2 or EmbeddingsFilter removes irrelevant sentences
  3. Token-level: LLMLingua-2 prunes redundant tokens (optional, for aggressive compression)

Typical result: 5-10x compression, 96%+ quality, <30ms overhead

When to Use Each Approach

  • Low latency budget (<10ms): Embeddings filter or top-k reranker cutoff
  • Moderate latency (<50ms): LLMLingua-2 token classification — best all-around
  • Maximum quality: LongLLMLingua with question-aware compression
  • Extreme compression (>10x): RECOMP abstractive or map-reduce
  • Cost-sensitive: LLMLingua-2 + small local model (no API calls)
  • Multi-hop QA: ECoRAG evidentiality-guided compression

Cost & Latency Impact

Scenario Input Tokens With Compression Token Savings Cost Savings (GPT-4o)
5 docs @ 500 tokens 2,500 625 (4x) 1,875 $4.69 / 1K queries
10 docs @ 800 tokens 8,000 1,600 (5x) 6,400 $16.00 / 1K queries
20 docs @ 1000 tokens 20,000 2,000 (10x) 18,000 $45.00 / 1K queries
1M queries/month (10 docs) 8B tokens 1.6B tokens 6.4B tokens $16,000/month saved
Quality Paradox: Context compression often improves answer quality (not just reduces cost). By removing irrelevant passages, the LLM focuses on truly relevant information — reducing hallucinations by 15-30% in studies. LongLLMLingua showed up to 21.4% quality improvement on multi-document QA tasks while using only 25% of original tokens.
Watch Out: Abstractive compression (RECOMP abstractive, map-reduce) can introduce factual errors in the compressed output. Always validate with extractive baselines first. For legal, medical, or compliance-critical RAG, prefer extractive methods that preserve exact source wording.

Framework Integration

LangChain

ContextualCompressionRetriever wraps any base retriever + compressor pipeline. Built-in: LLMChainExtractor, EmbeddingsFilter, DocumentCompressorPipeline.

pip install langchain

LlamaIndex

LongLLMLinguaPostprocessor integrates directly into query pipeline as a node postprocessor. Supports LLMLingua-2.

pip install llmlingua llama-index

Direct (Microsoft)

PromptCompressor from the llmlingua library. Framework-agnostic — works with any pipeline. Supports CUDA acceleration.

pip install llmlingua

Production Checklist: 1) Benchmark compression ratio vs quality on your domain data. 2) Start with LLMLingua-2 as default. 3) Add reranker pre-filter for >10 retrieved docs. 4) Monitor compressed token counts and generation quality daily. 5) Set compression ratio alerts if quality drops >3%. 6) Cache compressed results for repeated queries.
28 / Context Compression

RAG Taxonomy — The Complete Map

A hierarchical taxonomy of RAG architectures showing how different approaches relate, evolve, and specialize. From naive foundations to advanced agentic systems.

Naive RAG Advanced RAG Modular RAG (Composable Modules) Agentic RAG (LLM Orchestration) Self-RAG / CRAG (Self-Reflective) Multimodal RAG Adaptive RAG Hybrid RAG CAG Key Insight: All RAG approaches share: retrieve → context window → generate. The differences lie in WHAT, HOW, and WHEN to retrieve.

Evolution Timeline

2020-2022: Foundation

  • Naive RAG emerges as standard approach
  • Embedding models (BERT, Sentence-BERT) become practical
  • Simple chunk → embed → retrieve → generate pattern

2023-2024: Maturation

  • Advanced RAG techniques (reranking, query rewriting)
  • Modular approaches (LangGraph, DSPy) gain adoption
  • Self-RAG papers published, agentic patterns emerge

2024-2025: Specialization

  • Multimodal and hybrid systems
  • Adaptive routing based on query complexity
  • CAG with extended context windows (200K+)

Future: Integration

  • Unified frameworks combining multiple techniques
  • Automatic approach selection via meta-reasoning
  • Stronger metrics for measuring RAG quality

Quick Taxonomy Comparison

Type Complexity Best For Latency
Naive RAG Low Prototyping, simple Q&A Fast (100-500ms)
Advanced RAG Medium Production systems, accuracy Moderate (500ms-2s)
Modular/Agentic High Complex reasoning, multi-step Slower (2-10s)
CAG Low (setup) Small corpus, low latency Fastest (<100ms)
Production Insight: Don't pick one. Build a router that selects the best RAG approach for each query. Simple queries (FAQ style) → Naive. Complex queries → Advanced + reranking. Multi-step reasoning → Agentic. When corpus is small (<50 pages) → CAG with long context. The taxonomy helps you understand what tool to reach for at each moment.
29 / RAG Taxonomy

Naive RAG — The Foundation Pattern

The simplest RAG architecture: chunk documents → embed → retrieve → generate. Powerful for basic Q&A but suffers from lost-in-the-middle, no query transformation, and no reranking. A great starting point, but not production-ready alone.

Docs (Raw) Chunk Embed Vector DB Query (User Q) embed query Top-K Retrieve LLM Generate Answer

Core Limitations

Lost in the Middle

Models attend less to information in the middle of long contexts. With naive RAG returning k=5 documents, the first and last chunks get more weight. Reranking solves this.

No Query Transformation

Complex questions aren't rewritten. A query like "How does RAG work?" gets no transformation — you retrieve with the literal user text, missing semantic variation.

No Reranking

Retrieval rank is final. If the embedding metric ranks doc #3 high but it's actually irrelevant, there's no second-pass reranker to fix it.

No Fallback Strategy

If retrieval fails or returns low-confidence results, the model still generates based on whatever was retrieved. No threshold checks or secondary retrieval.

Code Example: Minimal Naive RAG

# === Minimal Naive RAG === from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import FAISS from langchain.chains import RetrievalQA from langchain_openai import ChatOpenAI # 1. Load and chunk documents documents = ["doc1 text...", "doc2 text..."] splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) chunks = splitter.split_text("\n\n".join(documents)) # 2. Embed and index embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = FAISS.from_texts(chunks, embeddings) # 3. Create naive RAG chain llm = ChatOpenAI(model="gpt-4-turbo", temperature=0) qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", # Just stuff context into prompt retriever=vectorstore.as_retriever(search_kwargs={"k": 5}) ) # 4. Query result = qa_chain.invoke({"query": "What is RAG?"}) print(result["result"])

When to Use Naive RAG

✓ Good For: Prototypes, FAQ systems with simple queries, small corpora (<100 docs), internal tools where speed > perfection. Start here, then upgrade when accuracy suffers.
✗ Not For: Production customer-facing systems, complex multi-hop questions, ambiguous queries, domains where accuracy is critical (legal, medical). Use Advanced RAG techniques instead.
Production Tip: Naive RAG is your baseline. Measure its performance first (accuracy, latency, cost). Then add Advanced techniques one at a time: query rewriting, reranking, etc. Only add complexity when metrics justify it. Many systems optimize naive RAG with better embeddings and chunking rather than jumping to complex architectures.
30 / Naive RAG

Advanced RAG — Production Optimization

Layer 3 techniques onto naive RAG: query transformation, hybrid retrieval, reranking, context compression, and self-correction. Most production systems live here. Better accuracy with manageable complexity.

PRE-RETRIEVAL OPTIMIZATION User Query (Original) Query Rewriting RETRIEVAL OPTIMIZATION BM25 (Sparse) + Vector (Dense) POST-RETRIEVAL OPTIMIZATION Rerank (LLM/ML) Context Compress Generate + Self-Check Query Rewriting: Multi-query, HyDE, contextual expansion Hybrid Search: Fuse BM25 + vectors with RRF Advanced Ranking: ColBERT, LLM-Rank, LLM-as-Judge Typical Advanced RAG Flow: 1. Rewrite query (LLM or heuristic) → 2. Retrieve with both BM25 and vectors → 3. Fuse results (RRF) → 4. Rerank top-k with cross-encoder → 5. Compress/filter for LLM context → 6. Generate answer → 7. Self-correction (question-answering validation) Cost: More API calls (+rewrite, +rerank), but higher accuracy. Latency: 500ms-2s typical. Much better for production quality.

Three Optimization Layers

Pre-Retrieval

  • Query Rewriting: Rephrase for clarity
  • HyDE: Generate hypothetical doc
  • Multi-Query: Ask multiple ways
  • Contextual Expansion: Add domain context

Retrieval

  • Hybrid Search: BM25 + vectors
  • RRF Fusion: Merge rankings
  • Semantic Router: Route by topic
  • Metadata Filtering: Pre-filter

Post-Retrieval

  • Reranking: Cross-encoder scoring
  • Context Compression: Distill docs
  • Diversity: Remove redundancy
  • Self-Correction: Validate output

Code: Advanced RAG with Reranking

# === Advanced RAG: Query Rewriting + Hybrid + Reranking === from langchain.retrievers import BM25Retriever, EnsembleRetriever from langchain_community.document_compressors import CrossEncoderReranker from langchain.retrievers import ContextualCompressionRetriever # 1. Query rewriting (LLM-based) def rewrite_query(query, llm): prompt = f"Rewrite this for clarity: {query}" return llm.invoke(prompt) # 2. Hybrid retrieval: BM25 + Vector bm25_retriever = BM25Retriever.from_documents(docs) vector_retriever = vectorstore.as_retriever(k=10) ensemble = EnsembleRetriever( retrievers=[bm25_retriever, vector_retriever], weights=[0.5, 0.5] # RRF fusion ) # 3. Reranking with cross-encoder compressor = CrossEncoderReranker( model="cross-encoder/ms-marco-MiniLM-L-12-v2", top_n=5 ) compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=ensemble ) # 4. Execute advanced RAG query = "What is prompt engineering?" rewritten = rewrite_query(query, llm) docs = compression_retriever.invoke(rewritten) # Best docs are now ranked, rewritten query improved retrieval

Tools & Frameworks

Pre-Retrieval Tools

  • HyDE — Generate hypothetical answers
  • Query2doc — Multi-query expansion
  • Prompt chaining — Step-by-step rewriting

Reranking Models

  • ColBERT — Token-level scoring
  • LLM-Rank — Use LLM as judge
  • Jina Reranker — Free, open API
Production Insight: Most production RAG systems use Advanced RAG. It's the sweet spot: 15-30% better accuracy than naive, manageable complexity, acceptable latency. Start with query rewriting + hybrid search (lowest cost). Add reranking if accuracy plateaus. Use context compression to stay under token limits. Self-correction (validate answers) catches hallucinations before users see them.
31 / Advanced RAG

Modular RAG — Composable Components

Break RAG into pluggable, reusable modules: routing, retrieval, reranking, generation. Mix and match for different scenarios. Powers DSPy, LangGraph, and production orchestration systems. Enables rapid experimentation and A/B testing of components.

Query Router (Select Path) BM25 Vector Graph DB Fusion (RRF/Learned) Reranker (Cross-Enc) Generator (LLM) Validator (Feedback) Modular Philosophy: Each box = independently swappable module. Test different retrievers without changing reranker. Swap LLM without touching retrieval. Modules communicate via standard interfaces (documents, scores, embeddings). Enables A/B testing, gradual upgrades, and rapid iteration.

Core Modules

Router

  • Semantic Router: Route by topic
  • Rule-Based: Pattern matching
  • LLM Router: Let model decide
  • Multi-Query: All paths

Retrievers

  • Vector: Semantic search
  • BM25: Keyword search
  • Graph: Relationship-based
  • Fusion: Merge multiple

Processors

  • Reranker: Score/reorder
  • Compressor: Shrink context
  • Filter: Remove irrelevant
  • Validator: Check quality

Frameworks & Tools

LangGraph

State machine-based orchestration. Define nodes (modules) and edges (flow). Built on LangChain. Great for explicit control flow and multi-step pipelines.

DSPy

Stanford framework for modular composition. Signatures define input/output contracts. Optimizers auto-tune prompts. Excellent for experimentation.

LlamaIndex

Query engines compose retrievers and query fusion. Component-based architecture. Strong integrations with vector DBs and LLM APIs.

Custom Orchestration

Build from scratch with Python. Explicit control, minimal dependencies. Good when you need very specific workflows or want to avoid framework lock-in.

Code: Modular RAG with LangGraph

# === Modular RAG with LangGraph === from langgraph.graph import StateGraph from typing import TypedDict class RAGState(TypedDict): query: str route: str retrieved_docs: list answer: str # Define modules as functions def router_module(state): "Route query: simple, complex, or multi-hop" route = "simple" if len(state["query"].split()) < 5 else "complex" return {"route": route} def retrieve_module(state): "Use appropriate retriever based on route" if state["route"] == "simple": docs = simple_retriever.invoke(state["query"]) else: docs = hybrid_retriever.invoke(state["query"]) return {"retrieved_docs": docs} def generate_module(state): "Generate answer from retrieved docs" context = "\n".join([d.page_content for d in state["retrieved_docs"]]) prompt = f"Context: {context}\n\nQ: {state['query']}\nA:" answer = llm.invoke(prompt) return {"answer": answer} # Wire modules into graph graph = StateGraph(RAGState) graph.add_node("router", router_module) graph.add_node("retriever", retrieve_module) graph.add_node("generator", generate_module) graph.add_edge("router", "retriever") graph.add_edge("retriever", "generator") graph.set_entry_point("router") graph.set_finish_point("generator") # Invoke app = graph.compile() result = app.invoke({"query": "What is RAG?"}) print(result["answer"])
Production Insight: Modular RAG shines for large teams and complex workflows. Each module can be owned, tested, and upgraded independently. Use LangGraph for orchestration-heavy systems. Use DSPy for research/experimentation where you're iterating on prompts and optimizers. Start modular from the beginning if you expect the system to grow.
32 / Modular RAG

Agentic RAG — LLM as Orchestrator

The LLM decides when, what, and how to retrieve. Uses ReAct (Reasoning + Action), tool calling, and iterative multi-step reasoning. Can perform complex workflows: plan → retrieve → refine → retrieve again → generate. Closest to human problem-solving.

Agentic Loop (ReAct Pattern) Thought (Reasoning) Action (Tool Call) Observation (Tool Output) Continue? (LLM Decides) Iterate until done Available Tools: retrieve_docs (Search corpus) web_search (Search web) query_database (SQL query) calculate (Math/analysis) custom_tool (Your logic)

Why Agentic RAG?

Multi-Step Reasoning

Complex questions often need multiple retrievals. "Who won the 2024 Oscars and what's their next film?" → Retrieve Oscars → Retrieve actor bio → Retrieve filmography. Agentic handles this naturally.

Tool Composition

The LLM decides which tool to use. Combine retrieval, web search, SQL, calculators, APIs. The model figures out the workflow instead of you hard-coding it.

Uncertainty Handling

If the model is uncertain, it can retrieve more docs, search the web, or ask for clarification. No fixed pipeline — it adapts to the problem.

Explainability

You see the chain of thought: "I need to find... then I'll retrieve... then I'll compute...". Why the model did something is transparent.

Code: Agentic RAG with LangGraph

# === Agentic RAG with Tool Use === from langgraph.prebuilt import create_react_agent from langchain_core.tools import tool from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4-turbo") # Define tools as functions @tool def retrieve_docs(query: str) str: "Search the knowledge base for documents matching the query" docs = vector_store.similarity_search(query, k=5) return "\n".join([d.page_content for d in docs]) @tool def web_search(query: str) str: "Search the web for current information" results = tavily_search(query, max_results=3) return "\n".join(results) @tool def query_database(sql: str) str: "Execute a SQL query against the database" result = db.execute(sql) return str(result) # Create agent with ReAct pattern tools = [retrieve_docs, web_search, query_database] agent = create_react_agent(llm, tools) # Invoke with a complex question result = agent.invoke({ "input": "Find documents about RAG, then search web for latest developments, then tell me the top 3 trends" }) print(result["output"]) # LLM decides which tools to call, in what order

Frameworks

LangGraph Agents

  • create_react_agent — ReAct out-of-the-box
  • Custom graphs for specialized workflows
  • Built-in memory and persistence

Anthropic API

  • Claude tool_use_activated for agentic flows
  • Native support for multi-turn conversations
  • Guaranteed tool calling in Claude 3.5+
Caution: Agentic RAG is more powerful but also slower (multiple LLM calls, tool invocations). Latency can be 5-30 seconds. Best for offline/async tasks or when reasoning complexity justifies the cost. For real-time, low-latency systems, stick with Advanced RAG. Monitor token usage carefully — agentic patterns use 3-5x more tokens than naive RAG.
33 / Agentic RAG

Self-RAG & Corrective RAG (CRAG) — Self-Reflective Retrieval

The model reflects on its own retrieval and generation. Self-RAG evaluates retrieved doc relevance and output correctness. CRAG adds automatic fallback to web search if retrieval confidence is low. Both reduce hallucination through self-correction.

Self-RAG Decision Flow Query (Decide to retrieve?) Retrieve (If needed) Relevance Check Generate Answer Verify Correctness Output 1 Retrieve? (Yes/No) 2 Relevant? (Yes/No) 3 Correct? (Yes/No) CRAG: Add Web Search Fallback Retrieve from Knowledge Base (Low confidence?) Confidence Check (Threshold) High conf Low conf Fallback to Web Search (Current info) Outcome: Higher accuracy through self-reflection + web fallback ensures answers are grounded and current

Self-RAG Decisions

Retrieve?

Can I answer from my weights? Or do I need external knowledge? Smart models skip retrieval for "What is 2+2?" but retrieve for "Latest AI trends."

Relevant Docs?

Are the retrieved docs actually answering the question? If not, re-retrieve or retrieve differently. This prevents using irrelevant context.

Correct Output?

Does my answer follow from the retrieved docs? Or did I hallucinate? Self-check before outputting. This is explicit hallucination detection.

Code: Self-RAG Pattern

# === Self-RAG: Query Routing with Verification === from langchain.prompts import PromptTemplate # Step 1: Decide whether to retrieve decide_to_retrieve = PromptTemplate.from_template( """Given this question, should we retrieve documents? Question: {query} Answer Yes or No. Be decisive. Questions about 'latest' or 'current' → Yes.""" ) # Step 2: Retrieve and verify relevance verify_relevance = PromptTemplate.from_template( """Are these documents relevant to the question? Question: {query} Documents: {docs} Rate as RELEVANT, PARTIALLY_RELEVANT, or NOT_RELEVANT""" ) # Step 3: Generate and verify correctness verify_generation = PromptTemplate.from_template( """Based on these documents, generate an answer. Then verify it. Documents: {docs} Question: {query} Answer: [Your answer] Supported_by_docs: Yes or No (is the answer grounded in the documents?)""" ) # Full Self-RAG flow def self_rag(query, llm): # 1. Decide to retrieve should_retrieve = llm.invoke(decide_to_retrieve.format(query=query)) if "Yes" not in should_retrieve.content: return llm.invoke(f"Q: {query}\nA:") # Answer without retrieval # 2. Retrieve docs = vectorstore.similarity_search(query, k=5) # 3. Verify relevance relevance = llm.invoke(verify_relevance.format(query=query, docs=str(docs))) if "NOT_RELEVANT" in relevance.content: # Fallback: web search for CRAG web_docs = tavily_search(query) docs = web_docs # 4. Generate with verification result = llm.invoke(verify_generation.format(query=query, docs=str(docs))) if "No" in result.content: # Not supported by docs return "I cannot answer this based on available information." return result.content

Self-RAG vs CRAG

Aspect Self-RAG CRAG
Self-Reflection Decides retrieve, eval relevance, verify output Same + web fallback on low confidence
Data Source Only knowledge base + model weights Knowledge base + web search fallback
Currency Limited to indexed knowledge Can access real-time web data
Best For Internal knowledge, hallucination prevention Questions needing current info
Production Insight: Self-RAG dramatically reduces hallucinations. Add it to any RAG system: simple retrieve-generate baseline now becomes: decide → retrieve → validate → generate → verify. Adds ~500ms per query. CRAG is better for time-sensitive questions (news, stock prices) where web search matters. Use CRAG for customer-facing systems with high accuracy requirements.
34 / Self-RAG & CRAG

Adaptive RAG — Dynamic Strategy Selection

Classify query complexity and dynamically select retrieval strategy. Simple questions skip retrieval. Moderate questions use single-step retrieval. Complex questions trigger multi-step retrieval and reasoning. Optimizes latency and accuracy on a per-query basis.

User Query Complexity Classifier Simple Moderate Complex Direct LLM Retrieve+Gen Multi-Step RAG Answer Simple Query: "What is RAG?" → No retrieval needed. LLM has general knowledge. Moderate Query: "How does hybrid search work?" → Retrieve docs, generate answer. Single pass. Complex Query: "Compare Self-RAG vs CRAG with specific trade-offs for my use case" → Multi-hop retrieval, reasoning.

Three Routing Strategies

Simple

  • No retrieval
  • LLM answers from weights
  • Lowest latency
  • Examples: "What is 2+2?", "Who is Elon Musk?"

Moderate

  • Single retrieval step
  • Hybrid search (BM25+vector)
  • Rerank top-5
  • Examples: "Explain RAG", "Latest AI news"

Complex

  • Multi-step agentic flow
  • Multiple retrievals + reasoning
  • Web search fallback
  • Examples: Comparative analysis, multi-part questions

How to Classify Complexity

Rule-Based

  • Word count < 5 → Simple
  • Contains "compare", "vs" → Complex
  • Contains "how", "why" → Moderate+
  • Fast, deterministic

LLM-Based

  • Use LLM to classify query
  • More accurate but slower
  • Handles nuance and edge cases
  • Cache classification results

Code: Query Routing

# === Adaptive RAG: Route by Complexity === def classify_complexity(query: str) str: "Simple rule-based classifier" words = query.lower().split() if len(words) < 5: return "simple" complex_indicators = ["compare", "versus", "vs", "trade-off", "analyze"] if any(ind in query.lower() for ind in complex_indicators): return "complex" return "moderate" def adaptive_rag(query, llm): # Step 1: Classify complexity = classify_complexity(query) # Step 2: Route if complexity == "simple": # Direct LLM answer return llm.invoke(f"Q: {query}\nA:") elif complexity == "moderate": # Single retrieval + generation docs = hybrid_retriever.invoke(query) context = "\n".join([d.page_content for d in docs]) prompt = f"Context: {context}\n\nQ: {query}\nA:" return llm.invoke(prompt) else: # complex # Multi-step agentic RAG agent = create_react_agent(llm, [retrieve_docs, web_search, analyze_tool]) return agent.invoke({"input": query}) # Usage answer = adaptive_rag("What is RAG?", llm) # → Simple route, fast answer = adaptive_rag("Explain vector RAG with embeddings", llm) # → Moderate answer = adaptive_rag("Compare all RAG types with latency trade-offs", llm) # → Complex
Production Insight: Adaptive RAG dramatically improves user experience. Average latency drops 30-40% because many queries skip expensive retrieval. Accuracy improves because complex queries get multi-step reasoning. Start with rule-based routing for speed. Upgrade to LLM-based classification if accuracy plateaus. Measure query complexity distribution — if most are simple, adaptive RAG saves significant cost.
35 / Adaptive RAG

Multimodal RAG — Text, Images, Audio, Video

Extend RAG beyond text to images, tables, audio, video. Use multimodal embeddings (CLIP, GPT-4V) for cross-modal retrieval. Unified indexing allows querying like "find images of dogs" or "transcript sections about AI." Emerging but powerful for rich media corpora.

Input Modalities: Text (Documents) Images (Photos, charts) Tables (CSV, HTML) Audio (Transcripts) Video (Frames + audio) Multimodal Embedding Layer (CLIP, GPT-4V) Unified Vector Space Cross-Modal Retrieval Query Modalities: Text Query "Find images of..." Image Query "Similar images" Query Embedding (same space) Retrieve mixed results: text docs, images, video clips all ranked by similarity

Multimodal Embedding Models

CLIP

  • Text ↔ Image alignment
  • Open-source (OpenAI)
  • Fast inference
  • Good for product images

GPT-4V / Claude

  • Vision + language understanding
  • API-based (cost)
  • Excellent description
  • Complex visual reasoning

LLaVA / Falcon

  • Open-source vision LLMs
  • Self-hosted option
  • Decent accuracy
  • Lower cost than APIs

Use Cases

E-commerce

Upload product photo → find similar products. Retrieve docs describing materials. Both text and image results ranked together.

Scientific Research

Search for papers + retrieve figures/tables. "Find papers about protein folding with diagrams." Text + images indexed together.

Video Content

Retrieve video sections by transcript. "Find the part where they explain embeddings" → Return timestamp + transcript excerpt.

Documentation

Index docs + diagrams. "How do I deploy on AWS?" → Text guide + architecture diagram retrieved together.

Tools & Frameworks

Multimodal Indexing

  • LlamaIndex MultiModal — Multi-doc indexes
  • Vespa — Text + image vectors
  • Qdrant multimodal plugin — Native support
  • Weaviate — Multi-modal indexing

Embedding APIs

  • OpenAI CLIP — Multi-modal embeddings
  • Google Gemini Vision — Image understanding
  • Anthropic Claude Vision — Rich analysis
  • Hugging Face models — Open source options
Production Consideration: Multimodal RAG adds complexity and storage overhead. Images/videos require special handling. Embeddings are larger. Best for rich corpora where modality matters. For text-only systems, focus on Advanced RAG first. When adding multimodal, start with images (simplest) and expand to video/audio if use case justifies cost.
36 / Multimodal RAG

Hybrid RAG — Fusing Multiple Retrieval Methods

Combine sparse (BM25, keyword) + dense (vector embeddings) + structured (graph, SQL). Use fusion algorithms (RRF, learned fusion) to merge rankings. Eliminates single point of failure. Captures both exact matches and semantic similarity. Production RAG standard.

Query BM25 Vector Graph DB Rank 1,3,5 Rank 2,4,6 Rank 1,2,7 Fusion Algorithm (RRF, Learned, etc) Fused Ranking Best from all Fusion Strategies: Reciprocal Rank Fusion (RRF) score = Σ 1/(k + rank_i) Simple, parameter-free. Best general choice. k=60 typical (Elasticsearch default) Weighted Fusion score = w1*norm(bm25) + w2*norm(vector) Requires tuning weights. Popular: 0.5/0.5 or BM25-biased. Can optimize on validation set. Learned Fusion Train ML model on relevance labels. LambdaMART, LearningToRank. Best accuracy but needs training data. Used by Google, Bing.

Retrieval Method Combinations

Combination Strengths Cost Use When
BM25 + Vector Keywords + semantic, high recall, no gaps Low Production standard. Always start here.
BM25 + Vector + Graph Keywords, semantic, entity relationships Medium Structured data: knowledge graphs, ontologies
Multiple Dense Different embedding models, perspectives Medium-High Unclear best embedding model. Ensemble approach.
Full Hybrid All modalities covered, highest recall High Complex domain, diverse corpus types

Code: RRF Fusion

# === Hybrid RAG with RRF Fusion === from langchain.retrievers import BM25Retriever, EnsembleRetriever # 1. Create two retrievers bm25_retriever = BM25Retriever.from_documents(docs) vector_retriever = vectorstore.as_retriever(k=10) # 2. Ensemble with RRF (built-in) ensemble = EnsembleRetriever( retrievers=[bm25_retriever, vector_retriever], weights=[0.5, 0.5], # RRF uses reciprocal rank k=5 # Return top-5 after fusion ) # 3. Use in RAG chain qa = RetrievalQA.from_chain_type( llm=llm, retriever=ensemble, return_source_documents=True ) result = qa.invoke({"query": "What is RAG?"}) # 4. Manual RRF scoring (if needed) def rrf_fusion(bm25_docs, vector_docs, k=60): """Reciprocal Rank Fusion""" scores = {} for rank, doc in enumerate(bm25_docs, 1): scores[doc.metadata["id"]] = 1 / (k + rank) for rank, doc in enumerate(vector_docs, 1): doc_id = doc.metadata["id"] scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank) # Sort by score ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True) return ranked[:5] # Top-5
Production Standard: Hybrid RAG (BM25 + vectors) is the production baseline. RRF fusion is simple and parameter-free. Increases recall by 10-20% over vector search alone. Zero additional infrastructure if you use PostgreSQL FTS (BM25) + pgvector. Most mature production systems start here. Only add complexity (learning-to-rank, multi-graph, etc.) if metrics plateau.
37 / Hybrid RAG

Cache-Augmented Generation (CAG) — Pre-load Knowledge into KV Cache

Instead of retrieving at runtime, pre-compute embeddings of your entire corpus and cache them in the model's KV cache. Eliminates retrieval latency. Only feasible for small corpora (<100 pages) that fit in extended context windows. Fastest possible RAG approach for fitting problems.

Traditional RAG (Every Query) Query Retrieve ~200ms Generate ~1s Total: ~1.2s per query Cache-Augmented Generation (Setup Once) One-time Setup: Corpus + Docs Tokenize All Pre-compute KV Cache (Knowledge) Per-Query (Cached): Query Cached KV Generate Answer <100ms No retrieval overhead Trade-off: Traditional: Slower per query, works for any corpus size CAG: Fast queries, only for small corpora (context window limited)

When CAG Makes Sense

Small Knowledge Base

Entire corpus <100 pages, <50K tokens. Product docs, internal policies, FAQs. Fits in 200K context windows easily.

Real-Time Latency Critical

Sub-100ms response time needed. Chatbots, real-time assistants. Retrieval overhead is unacceptable.

Static or Rarely Updated

Knowledge base changes <once/week. One-time setup, no daily cache invalidation. Stable reference docs.

Implementation Approaches

Context Stuffing

Simplest: Put all docs in system prompt or context. Claude 200K window easily fits 50-100 pages. Model uses in-context attention. No external retrieval.

KV Cache Caching

Pre-compute model's key-value cache for corpus. Anthropic API supports prompt caching. Only compute KV once, reuse for 100s of queries.

Prefix Caching

Cache common prefixes (docs, instructions) across requests. Saves API costs. Supported by Claude, OpenAI, Anthropic APIs.

Embedding Summary

Generate summaries of each doc, cache summaries. Query against summaries, then in-context search. Hybrid approach.

Code: CAG with Prompt Caching

# === Cache-Augmented Generation (Prompt Caching) === from anthropic import Anthropic client = Anthropic() # 1. Load entire corpus with open("knowledge_base.txt", "r") as f: corpus = f.read() print(f"Corpus size: {len(corpus):,} tokens (~{len(corpus)//4})") # 2. Create message with cached corpus # First request: cache is populated response1 = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1000, system=[ { "type": "text", "text": "You are a helpful assistant with access to the following knowledge base:" }, { "type": "text", "text": corpus, "cache_control": {"type": "ephemeral"} # Enable caching } ], messages=[ {"role": "user", "content": "What is RAG?"} ] ) print(f"First query latency: {response1.usage.elapsed}ms") print(f"Cache created size: {response1.usage.cache_creation_input_tokens}") # 3. Subsequent requests reuse cache for query in ["Explain embeddings", "What is retrieval?", "Tell me about vectors"]: response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1000, system=[ { "type": "text", "text": "You are a helpful assistant with access to the following knowledge base:" }, { "type": "text", "text": corpus, "cache_control": {"type": "ephemeral"} } ], messages=[ {"role": "user", "content": query} ] ) print(f"{query}: {response.usage.elapsed}ms, cache_read: {response.usage.cache_read_input_tokens}") # Subsequent queries much faster, cache tokens reused

CAG vs Traditional RAG

Metric Traditional RAG CAG
Latency Per Query 500ms-2s (retrieval + gen) <100ms (gen only)
Setup Latency None (on-demand) 1-2s (first request, cache KV)
Corpus Size Limit Unlimited (external retrieval) <200K tokens (context window)
Cost Per Query Retrieval (DB) + LLM tokens LLM tokens (cached cheaper)
Knowledge Updates Instant (next retrieval) Requires cache invalidation
Scalability Scales to GB+ of docs Limited to context window

Use Cases

Customer Support

Cache product docs, FAQs, policies. Every agent query reuses cache. 10x faster than retrieval-based RAG. Lower API costs per conversation.

Internal QA Bots

Onboarding docs, internal policies, company handbook. Cache once, serve employees instantly. No external DB needed.

Real-Time Chat

Where latency is critical. Academic papers summary cache. Medical reference guides cache. Sub-100ms response times.

Mobile/Edge Apps

Local knowledge base cached in app. Offline-first architecture. Sync when online. No dependency on external retrieval service.

CAG Summary: Not all problems need traditional RAG. If your corpus is small (product docs, internal KB), use CAG with prompt/prefix caching. It's the simplest, fastest, and cheapest solution for fitting problems. No external DB, no retrieval infrastructure, sub-100ms latency. For large, dynamic corpora (news, social media), traditional RAG is necessary. Choose based on corpus size and update frequency, not by default.
38 / Cache-Augmented Generation

Graph RAG — Knowledge Graph Enhanced Retrieval

Augment vector retrieval with structured knowledge graphs to enable multi-hop reasoning, entity-aware retrieval, traceable answers, and dramatically reduced hallucinations — especially in entity-rich domains like finance, healthcare, legal, and enterprise knowledge bases.

Documents Unstructured text LLM extract Knowledge Graph Vector Store Chunk embeddings User Query + entity detect Hybrid Retrieval Graph traversal + vector Reciprocal Rank Fusion LLM Generator Grounded answer + traceable sources Graph RAG combines structured reasoning (entities & relations) with semantic search (embeddings)

Why Graph RAG?

Baseline RAG Limitations

  • Can only reason within a single retrieved chunk
  • Fails on multi-hop questions ("Who is the CEO of the company that acquired X?")
  • No understanding of entity relationships
  • Hard to trace why a chunk was retrieved
  • Global summarization questions return fragmented answers

Graph RAG Advantages

  • Multi-hop reasoning: Traverse entity → relation → entity paths
  • Entity awareness: Disambiguate "Apple" (company vs fruit)
  • Traceable answers: Show the graph path that supports each claim
  • Reduced hallucination: Grounded in verified structured facts
  • Global queries: Community summaries answer "What are the main themes?"

Baseline RAG vs Graph RAG

Dimension Baseline (Vector) RAG Graph RAG
Retrieval Semantic similarity (embedding cosine) Semantic + structural (graph traversal + embeddings)
Reasoning Single-hop (within chunk) Multi-hop (across entity chains)
Explainability Low — "matched chunk X" High — "followed path A→B→C"
Global queries Poor (fragmented across chunks) Good (community summaries)
Entity resolution None Built-in (graph deduplication)
Hallucination rate 10-25% 3-10% (grounded in facts)
Setup cost Low ($100s) Medium-High ($1K-10K, 3-5x baseline)
Latency 50-200ms 100-500ms (graph + vector)
Maintenance Re-embed on doc update Re-extract entities + re-embed

Implementation Approaches

Microsoft GraphRAG

LLM-based entity/relation extraction → Leiden community detection → hierarchical summaries. Best for global queries and corpus-level understanding.

Open source Python

Cost: 3-5x baseline (LLM extraction)

Neo4j + LangChain

LLMGraphTransformer for entity extraction → Neo4j for storage/traversal → Cypher query generation → hybrid vector+graph retrieval.

Neo4j Cypher

Best for production enterprise deployments

LlamaIndex PropertyGraph

PropertyGraphIndex with auto-extraction. Supports Neo4j, Nebula, or in-memory graph store. Integrates with existing LlamaIndex pipelines.

LlamaIndex Flexible

Easiest integration if already using LlamaIndex

KG Construction Pipeline

1. Chunk Documents 2. Extract Entities 3. Extract Relations 4. Resolve & Dedupe 5. Community Detect 6. Summarize

Entity extraction: LLM-based (GPT-4o / Claude) or dependency-based (spaCy + custom rules — 10x cheaper, comparable quality). Relation extraction: Two-stage approach (KGGEN) — entities first, then relations — reduces error propagation. Community detection: Leiden algorithm creates hierarchical clusters for global summarization.

Implementation: Neo4j + LangChain

# === Graph RAG with Neo4j + LangChain === from langchain_community.graphs import Neo4jGraph from langchain_experimental.graph_transformers import LLMGraphTransformer from langchain_openai import ChatOpenAI from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_core.documents import Document # 1. Connect to Neo4j graph = Neo4jGraph( url="bolt://localhost:7687", username="neo4j", password="password" ) # 2. Extract entities and relations from documents llm = ChatOpenAI(model="gpt-4o", temperature=0) transformer = LLMGraphTransformer( llm=llm, allowed_nodes=["Person", "Company", "Product", "Technology"], allowed_relationships=["WORKS_AT", "ACQUIRED", "USES", "FOUNDED"], ) # 3. Chunk and transform splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200) chunks = splitter.split_documents(documents) graph_docs = transformer.convert_to_graph_documents(chunks) # 4. Store in Neo4j graph.add_graph_documents(graph_docs, baseEntityLabel=True) print(f"Nodes: {len(graph_docs[0].nodes)}, Rels: {len(graph_docs[0].relationships)}")
# === Hybrid Retrieval: Graph + Vector === from langchain_community.vectorstores import Neo4jVector from langchain.chains import GraphCypherQAChain # Vector index on chunk embeddings in Neo4j vector_store = Neo4jVector.from_existing_graph( embedding=embeddings, node_label="Document", text_node_properties=["text"], embedding_node_property="embedding", ) # Graph Cypher chain for structured queries cypher_chain = GraphCypherQAChain.from_llm( llm=llm, graph=graph, verbose=True, allow_dangerous_requests=True, # needed for Cypher generation ) # Hybrid retrieval function def hybrid_graph_rag(query: str): # 1. Vector retrieval (semantic) vector_results = vector_store.similarity_search(query, k=5) # 2. Graph retrieval (structured) graph_result = cypher_chain.invoke({"query": query}) # 3. Fuse results (Reciprocal Rank Fusion) context = f"""Graph facts: {graph_result['result']} Retrieved passages: {chr(10).join([d.page_content for d in vector_results])}""" # 4. Generate with fused context answer = llm.invoke( f"Based on the following context, answer: {query}\n\n{context}" ) return answer # Multi-hop query that baseline RAG fails on result = hybrid_graph_rag("Who founded the company that acquired Instagram?") # Graph path: Instagram -[ACQUIRED_BY]-> Meta -[FOUNDED_BY]-> Mark Zuckerberg

Production Recommendations

Use Graph RAG When

  • Entity-rich domains: Finance (companies, people, transactions), healthcare (drugs, conditions, treatments), legal (cases, entities, rulings)
  • Multi-hop questions are common: "What drugs interact with medications prescribed to patients with condition X?"
  • Explainability required: Regulated industries need traceable reasoning paths
  • Global/thematic queries: "What are the main themes across all documents?"
  • Entity disambiguation matters: Same name = different entities across documents

Stick with Baseline RAG When

  • Simple factual QA: Single-hop lookups within documents
  • Budget-constrained: KG extraction costs 3-5x more than baseline
  • Rapidly changing corpus: KG maintenance overhead is significant
  • Small document set: <100 docs — graph overhead not justified
  • Latency-critical: Graph traversal adds 50-300ms per query

Cost Comparison

Component Baseline RAG Graph RAG Delta
Indexing (10K docs) $5-15 (embeddings) $50-200 (LLM extraction + embeddings) 3-15x more
Storage $10-30/mo (vector DB) $50-150/mo (Neo4j + vector DB) 3-5x more
Query latency 50-200ms 100-500ms 2-3x slower
Per-query cost $0.001-0.005 $0.002-0.01 2x more
Answer quality (multi-hop) 40-60% accuracy 75-90% accuracy +30-50% better
Hallucination rate 10-25% 3-10% 50-70% less

Tools & Libraries

Graph Databases

  • Neo4j — Industry standard; Cypher query language
  • Amazon Neptune — Managed; good for AWS stacks
  • NebulaGraph — Open source; scales to billions of edges
  • FalkorDB — Redis-based; ultra-low latency

KG Construction

  • LLMGraphTransformer — LangChain; LLM-based
  • microsoft/graphrag — Full pipeline; community detection
  • spaCy + custom — Dependency-based; 10x cheaper
  • Diffbot NLU — API-based entity linking

Frameworks

  • LangChain — GraphCypherQAChain, Neo4jVector
  • LlamaIndex — PropertyGraphIndex, KnowledgeGraphIndex
  • RAGatouille — ColBERT + graph integration
  • Haystack — Knowledge graph retriever component
Production Tip: Start with baseline vector RAG. Add Graph RAG incrementally — extract entities for your top 20% highest-value documents first. Use dependency-based extraction (spaCy) instead of LLM-based to cut indexing costs 10x. Monitor multi-hop query accuracy: if it improves >15%, expand graph coverage. Use hybrid retrieval (RRF fusion) to combine graph and vector results rather than replacing vector search entirely.
39 / Graph RAG

Vectorless RAG — Retrieval Without Embeddings

Vectorless RAG approaches bypass traditional embedding-based retrieval entirely, using techniques like BM25, structured SQL queries, LLM-native context stuffing, or direct API calls to retrieve relevant information — eliminating the need for vector databases, embedding models, and index maintenance.

User Query Natural language LLM Router Selects retrieval strategy BM25 / Full-Text Keyword + TF-IDF scoring SQL / Structured Query Text-to-SQL via LLM Context Stuffing Long-context LLM window API / Tool Calls Direct data source access LLM Generator Grounded answer No vectors needed Vectorless RAG retrieves context through keyword search, SQL, long-context windows, or direct API calls

Vectorless Retrieval Approaches

BM25 / Full-Text Search

Classic keyword-based retrieval using term frequency and inverse document frequency (TF-IDF). Works through Elasticsearch, OpenSearch, PostgreSQL full-text, or SQLite FTS5. Excels at exact-match queries, domain-specific terminology, and code search where semantic similarity fails.

Elasticsearch PostgreSQL Zero ML cost

Text-to-SQL

LLM translates natural language questions into SQL queries against structured databases. Ideal for analytics, reporting, and questions with precise filters (dates, ranges, aggregations). Leverages existing relational data without any embedding pipeline.

SQL databases Exact answers Aggregations

Long-Context Stuffing

With models supporting 128K-1M+ token windows (GPT-4o, Claude, Gemini), feed entire document collections directly into the prompt. Eliminates retrieval entirely for small-to-medium corpora. The LLM itself acts as the retriever and reasoner simultaneously.

128K-1M tokens Zero infra Simple

Agentic Tool Use / API Calls

LLM agents call external APIs, search engines, or tools (web search, code interpreters, database connectors) to retrieve information on demand. Each query dynamically selects the right data source. No pre-built index required — retrieval is just-in-time.

Function calling Dynamic Multi-source

Vector RAG vs Vectorless Approaches

Dimension Vector RAG BM25 Context Stuffing Text-to-SQL
Setup complexity Medium (embeddings + vector DB) Low (search index) None Low (schema + prompt)
Semantic understanding High None (keyword match) High (LLM-native) Structured only
Exact match / filters Poor Excellent Good Excellent
Corpus size limit Millions of docs Millions of docs ~500 pages (1M tokens) Unlimited (DB)
Latency 50-200ms 5-50ms Slow (large prompt) 50-500ms
Cost per query $0.001-0.005 $0.0001 $0.01-0.10 (token cost) $0.001-0.01
Infra required Vector DB + embedding API Search engine LLM API only SQL database
Best for Semantic similarity Keyword, code, exact terms Small corpora, prototyping Structured data, analytics

When to Go Vectorless

Vectorless Works Well When

  • Small corpus (<500 pages): Context stuffing is simpler and often more accurate than chunking + retrieval
  • Structured data: SQL databases with well-defined schemas — Text-to-SQL beats embedding-based retrieval
  • Exact-match queries: Technical terms, product codes, error messages — BM25 outperforms semantic search
  • Rapid prototyping: Skip the vector pipeline entirely — just stuff context and iterate
  • Real-time data: API/tool calls fetch live data that can't be pre-indexed
  • Budget-constrained: No embedding model costs, no vector DB hosting

Vectors Still Better When

  • Large corpus (>10K docs): Context stuffing is infeasible; BM25 misses semantic matches
  • Semantic similarity matters: "How do I fix a slow API?" matching "performance optimization for endpoints"
  • Multilingual: Embedding models handle cross-language retrieval natively
  • Fuzzy/conceptual queries: Questions that don't contain the exact keywords present in documents
  • Cost at scale: Context stuffing becomes very expensive with large token windows

Implementation: BM25 with Rank-BM25

# === Vectorless RAG: BM25 Full-Text Retrieval === from rank_bm25 import BM25Okapi import nltk from nltk.tokenize import word_tokenize # 1. Prepare corpus documents = [doc.page_content for doc in loaded_docs] tokenized_corpus = [word_tokenize(doc.lower()) for doc in documents] # 2. Build BM25 index (no embeddings needed!) bm25 = BM25Okapi(tokenized_corpus) # 3. Retrieve def bm25_retrieve(query: str, k: int = 5): tokenized_query = word_tokenize(query.lower()) scores = bm25.get_scores(tokenized_query) top_k = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k] return [(documents[i], scores[i]) for i in top_k] # 4. Generate answer results = bm25_retrieve("How to configure rate limiting?") context = "\n\n".join([doc for doc, score in results]) answer = llm.invoke(f"Answer based on context:\n{context}\n\nQuestion: {query}")
# === Vectorless RAG: Long-Context Stuffing === from pathlib import Path # 1. Load all documents into a single context all_docs = [] for f in Path("./docs").glob("*.md"): all_docs.append(f"\n--- {f.name} ---\n{f.read_text()}") full_context = "\n".join(all_docs) print(f"Total chars: {len(full_context):,}") # Check fits in context window # 2. Stuff everything into the prompt — no retrieval step! response = llm.invoke( f"""You are a helpful assistant. Use the following documents to answer. Documents: {full_context} Question: {query} Answer concisely, citing the document name.""" ) # Works great for <500 pages with 128K+ context models # Trade-off: higher token cost but zero retrieval infrastructure
# === Vectorless RAG: Text-to-SQL === from langchain_community.utilities import SQLDatabase from langchain.chains import create_sql_query_chain # 1. Connect to your database db = SQLDatabase.from_uri("sqlite:///products.db") print(db.get_usable_table_names()) # ['products', 'reviews', 'orders'] # 2. Create text-to-SQL chain chain = create_sql_query_chain(llm, db) # 3. Natural language → SQL → Answer query = "What are the top 5 products by average rating with more than 100 reviews?" sql_query = chain.invoke({"question": query}) print(f"Generated SQL: {sql_query}") result = db.run(sql_query) answer = llm.invoke( f"Given SQL result: {result}\nAnswer: {query}" ) # Precise, aggregated answers impossible with vector retrieval

Hybrid: Best of Both Worlds

The most effective production systems combine vectorless and vector approaches:

1. Query Router 2. BM25 (keywords) + Vector (semantic) 3. RRF Fusion 4. Rerank & Generate

Reciprocal Rank Fusion (RRF) merges BM25 and vector results: score = Σ 1/(k + rank_i). This captures both exact keyword matches and semantic similarity. Many vector databases (Elasticsearch, Weaviate, Qdrant) support hybrid search natively. Adding BM25 to vector search typically improves recall by 10-20% with near-zero additional latency.

Tools & Libraries

BM25 / Full-Text

  • rank-bm25 — Pure Python; great for prototyping
  • Elasticsearch — Production-grade; built-in BM25
  • PostgreSQL FTS — Built into Postgres; zero new infra
  • SQLite FTS5 — Embedded; perfect for small apps

Text-to-SQL

  • LangChain SQL — create_sql_query_chain
  • LlamaIndex NLSQL — NLSQLTableQueryEngine
  • Vanna.ai — OSS text-to-SQL with training
  • DuckDB — In-process analytics + LLM pairing

Hybrid Search

  • Weaviate — Native hybrid (BM25 + vector)
  • Qdrant — Sparse + dense vector fusion
  • Elasticsearch 8+ — kNN + BM25 in one query
  • Vespa — Advanced ranking with hybrid retrieval
Production Tip: Start vectorless. For prototyping, stuff your documents into a long-context model and measure answer quality. If the corpus fits in the context window and accuracy is acceptable, you may never need vectors. For production, add BM25 as your first retrieval layer — it's fast, cheap, and handles exact matches that embeddings miss. Only add vector search when you need semantic matching that BM25 can't provide. The best systems use hybrid retrieval (BM25 + vectors) fused with RRF scoring, giving you the best of both worlds.
40 / Vectorless RAG

Distillation Overview — The Teacher-Student Paradigm

Knowledge distillation transfers the dark knowledge from large teacher models (GPT-4o, Claude Opus) into smaller, faster student models. The student learns not just to match labels, but to mimic the teacher's probability distributions, enabling 10-100x cost reduction with 85-95% quality retention. Essential for production RAG systems serving millions of requests.

Teacher GPT-4o / Llama 405B High quality Expensive Forward Student Phi-3 / Llama 8B Fast, small Cheap Soft labels (temperature T) Distillation Loss L = α·L_KL(T, S) + (1-α)·L_CE(Y, S) KL div on soft targets + cross-entropy on hard labels Backprop Evolution of Distillation: 2015 Hinton KD 2019 DistilBERT 2021 TinyBERT 2023 Alpaca/Vicuna 2024 DeepSeek-R1-Distill

Core Distillation Concepts

What is Knowledge Distillation?

A training technique where a large teacher model teaches a smaller student model to approximate its behavior. The student learns from soft probability distributions (soft labels) rather than just hard ground-truth labels, capturing the teacher's confidence and uncertainty patterns—the "dark knowledge."

Why Distill for Production RAG?

Cost: 20-100x cheaper inference. Latency: 10-50x faster. Privacy: Run locally without API calls. Edge Deployment: Fits on mobile/edge devices. Reliability: No rate limits or service dependencies.

Key Terminology

  • Teacher: Large, high-quality model that teaches
  • Student: Smaller model that learns
  • Soft Labels: Teacher's probability distributions (softmax with temperature)
  • Hard Labels: Ground truth class labels
  • Temperature (T): Controls softness of probability distribution (higher T = softer, more gradual gradients)
  • Dark Knowledge: Teacher's learned correlations between outputs beyond ground truth

Quality Retention Mechanics

Typical results: Embedding models retain 90-96% quality at 15-50x compression. Rerankers retain 94-97% at 3-10x compression. Generation models retain 85-92% at 10-30x compression. Quality loss is primarily in nuanced reasoning and rare edge cases; core competencies remain strong.

Production Tip: Distillation is not a one-shot process. Start by measuring your teacher model's performance on your specific RAG tasks (embedding quality, reranking accuracy, generation BLEU/ROUGE). Use this as your quality baseline. Then distill incrementally—start with the smallest student and measure how much quality you lose. Only scale up the student architecture if you hit quality thresholds you cannot accept. This binary search approach saves weeks of training time.
41 / Distillation Overview

Distillation Techniques — Seven Methods Explained

Different distillation techniques target different components of the teacher's knowledge. Response-based distillation matches final outputs; feature-based captures intermediate representations; relation-based preserves data point relationships; synthetic data generation scales to new domains. Choosing the right technique depends on your architecture, available teacher access, and quality targets.

1. Logit/Response-Based Distillation

Student learns the teacher's final output probability distributions (logits) using soft label matching with temperature scaling. The KL divergence loss makes gradients smoother, allowing the student to learn from the teacher's confidence patterns.

Formula: L = α·KL(softmax(T/τ), softmax(S/τ)) + (1-α)·CE(Y, S)

Classic KD Temperature ~4 KL divergence

Best for: BERT, RoBERTa, embeddings. Speed: ~20% training overhead.

2. Feature/Intermediate Distillation

Student mimics teacher's intermediate hidden states and attention maps, not just final outputs. Matches layer activations via mean-squared error loss. Essential for encoder models where intermediate representations matter.

Used by TinyBERT, MobileBERT. Loss: L = Σ ||H_student - H_teacher||²

Hidden states Attention maps Layer matching

Best for: BERT-family, rerankers. Quality: 90%+ retention even at 10x compression.

3. Relation-Based Distillation

Preserves relationships between data points rather than individual predictions. Contrastive distillation for embeddings: student embeddings maintain the same relative distances and similarities as teacher embeddings. Critical for semantic search.

Loss: L = Σ sim(e_student, e_teacher) matching pairwise relationships

Contrastive Embeddings Triplet loss

Best for: E5, BGE embeddings. Benefit: Preserves ranking structure.

4. Synthetic Data Distillation

Teacher generates training data (Q&A pairs, reasoning chains, labeled examples) that student fine-tunes on. Does not require access to teacher weights—API-based. Most practical for LLM distillation. Examples: Alpaca (from Davinci), Orca, Vicuna.

Process: Generate 5K-10K examples → filter for quality → fine-tune student on synthetic data

Data generation API-based At scale

Best for: Generation models, RAG readers. Cost: ~$50-500 API calls per million tokens.

5. Progressive/Multi-Stage Distillation

Distill through intermediate-size models in stages: GPT-4 (175B) → Llama 13B → Phi 3.8B → TinyBERT 14M. Each stage acts as both student and teacher. Enables extreme compression (1000x) with graceful quality degradation.

Why: Knowledge at each stage is closer to student's architecture, easier to learn.

Multi-stage Chain Extreme compression

Best for: Mobile/edge, extreme latency constraints. Trade-off: More training stages but better final quality.

6. Self-Distillation

Model distills from itself: larger layers teach smaller layers (Born-Again Networks), or early-exit heads teach final heads. Used for progressive inference and efficient early stopping. Requires no external teacher.

Variant: Ensemble of differently-sized versions of the same architecture.

Internal No teacher Early exit

Best for: Improving single models, progressive inference. Benefit: 2-5% quality boost at same size.

7. Domain Adaptation Distillation

Teacher fine-tuned on domain (biomedical, legal, code) teaches student. Combines in-domain expert knowledge with compact student architecture. Teacher learns domain patterns; student compresses domain knowledge into fewer parameters.

Process: Domain-FT teacher → generate domain synthetic data → student learns domain + generalization

Domain-aware Expert teacher Fine-tuned

Best for: Specialized domains (biotech, legal, code). Result: Small domain-expert models.

Distillation Techniques Comparison

Technique Teacher Weights? Architecture Quality Retention Training Time Best For
Logit Distillation Yes (inference) Same/different 90-97% +20% Classifiers, embeddings
Feature Distillation Yes (full) Encoder-only 92-98% +40% BERT models, rerankers
Relation Distillation Yes (inference) Same/different 94-97% +30% Embeddings, ranking
Synthetic Data No (API only) Any decoder 85-92% 1-10 days LLM generation, RAG
Progressive Yes (multi-stage) Any 88-95% 2-4 weeks Extreme compression
Self-Distillation No (internal) Same (variants) 102-105% +10% Model improvement
Domain Adaptation Yes (domain FT) Domain-expert 87-94% 2-5 days Specialized domains
Production Tip: Start with logit distillation for encoders (DistilBERT template) and synthetic data distillation for LLMs (Alpaca template). These two techniques handle 80% of production RAG use cases. Use feature distillation if you need 94%+ quality retention on classifiers. Progressive distillation only if you're targeting mobile/edge with extreme constraints. Domain adaptation distillation is worth it only if your domain has specialized terminology (biomedical, legal, code) that generic teachers struggle with.
42 / Distillation Techniques Deep Dive

Distillable Models for RAG — The Complete Catalog

The RAG pipeline has four critical components, each with a specialized set of distilled models. Embedding models for retrieval, rerankers for ranking, generation models for answering, and routers for intent classification. This section catalogs production-ready models for each stage with their distillation lineage, performance characteristics, and deployment costs.

Query User question Embedder E5, BGE, GTE 33-335M params Retriever Vector DB / BM25 Top-K docs Reranker BGE, ms-marco 568M params Router DistilBERT, TinyBERT 14-66M params Generator Phi-3, Llama 8B 2-8B params

RAG Component Models

Embedding Models (Bi-Encoder)

Dense vector representations for semantic retrieval. Distilled from larger encoder models to 33-335M parameters. Deployed at scale for every document query.

  • E5-small/base/large — 33M/110M/335M params; MTEB top-tier; distilled from Mistral-7B
  • BGE-small/base/large — 33M/110M/1.1B params; BAAI; multilingual; contrastive learning
  • GTE-Qwen2-1.5B-instruct — 1.5B params; strong instruction-following; instruction-tuned embeddings
  • Nomic Embed 1.5 384 — 137M params; 8192 context; Matryoshka dimensions
  • all-MiniLM-L6-v2 — 22M params; fastest; SBERT distillation
  • Alibaba Gte-base — 110M params; multilingual; strong on code/technical

Deployment: $0.05-0.20/M queries at scale

Reranker Models (Cross-Encoder)

Score query-document pairs for relevance. Compact cross-encoders (568M-1B params). Applied to top-K from retriever for precision ranking.

  • BGE-reranker-v2-m3 — 568M params; multilingual; distilled from large cross-encoder
  • ms-marco-MiniLM-L-12 — 33M params; ultra-compact; MS MARCO trained
  • Jina Reranker v2 — 137M params; code + text; Jina-1.5-large distillation
  • ColBERTv2-hnswlib — Late interaction; sub-ms latency; token-level matching
  • Cohere Rerank v3 — API-based; production-grade; handles 20 languages
  • mxbai-rerank-xsmall-v1 — 66M params; ultra-light; Mistral base

Deployment: Applied to top-50 docs; $0.10-0.30/M queries

Generation (Reader) Models

Small LLMs for grounded answer generation. 2-8B parameters, trained on domain/RAG-specific data. Distilled from frontier models (GPT-4o, Claude, Llama 405B).

  • Phi-3-mini (3.8B) — Microsoft; curated textbook data; strong reasoning; 4K context
  • Llama 3.1 8B — Meta; instruction-tuned; 128K context; Apache 2.0 license
  • Mistral 7B / Nemo 7B — Sliding window; 32K context; 60% faster inference
  • Gemma 2 2B/9B — Google; distilled from Gemini; excellent on factual QA
  • Qwen2.5 7B — Alibaba; 128K context; multilingual; strong on code
  • DeepSeek-R1-Distill 7B — Reasoning capability; chain-of-thought; 16K context

Deployment: $0.20-0.50/M tokens at scale

Router / Classifier Models

Tiny models for query routing, intent classification, content moderation. 14-66M parameters. Applied early in pipeline to route or filter.

  • DistilBERT-base — 66M params; 60% faster than BERT; 97% performance retention
  • TinyBERT-6L-768H — 14.5M params; 7.5x faster; distilled 4-layer
  • MobileBERT — 25M params; mobile-optimized; real-time classification
  • DeBERTa-v3-small — 44M params; NLI + classification; superior to DistilBERT
  • ALBERT-base-v2 — 12M params; parameter sharing; cross-layer distillation
  • Sentence-BERT-tiny — 14M params; semantic classification; STS benchmark trained

Deployment: <1ms per request; $0.01/M queries

Speculative Decoding Draft Models

Tiny models that propose tokens quickly; larger model verifies. Enables 2-3x generation speedup. Draft model distilled from main generator.

  • Phi-3-mini as draft for Llama 70B — 3.8B proposes; 70B verifies; 2.5x speedup
  • Gemma 2 2B as draft for 9B — Same family; better latency savings
  • Draft-only models (research) — Models trained specifically to be draft models

Use case: High-throughput RAG backends; lower inference cost 30-40%

Mixture-of-Experts (MoE) Distillation

Distill sparse MoE models (Mixtral, GLaM) into dense models. Teacher has 46B params but uses only 12B per token; student is fully dense 7-8B.

  • Mixtral 8x7B → Mistral 7B — Route expert knowledge into dense model
  • Mixtral 8x22B → Llama 13B — Compress expert routing to dense layers
  • Approach: Teacher routes on examples → student learns all routes as single dense model

Benefit: No expert overhead; simpler deployment; better VRAM efficiency

Model Selection Flowchart

What component do you need? Embedding/Retrieval Corpus <100K docs: E5-small Corpus 100K-10M: E5-base Budget critical: all-MiniLM Reranking Quality critical: BGE-m3 Ultra-light (<100ms): ms-marco Multilingual: BGE-m3 / Jina Generation/Reader Fast/cheap: Phi-3 (3.8B) Best quality: Llama 8B Reasoning needed: DeepSeek-R1 Embedding latency <100ms? Yes: Small model. No: Larger Budget per query <$0.01? Yes: TinyBERT. No: BGE Domain-specific knowledge? Yes: Fine-tune first. No: OOTB Recommended Combinations Startup all-MiniLM + ms-marco + Phi-3 = $2/1M queries Production E5-base + BGE-m3 + Llama 8B = $8/1M queries Quality-First GTE-Qwen2 + BGE-m3 + Llama 8B = $12/1M queries

Distilled Models for RAG — Full Comparison

Model Component Params Context Quality Cost/1M Latency
Embedding Models
all-MiniLM-L6-v2 Retrieval 22M 512 ~85% $0.02 2ms
E5-small Retrieval 33M 512 ~90% $0.04 5ms
E5-base Retrieval 110M 512 ~95% $0.08 12ms
BGE-base Retrieval 110M 512 ~93% $0.07 11ms
Nomic Embed 1.5 Retrieval 137M 8192 ~94% $0.10 18ms
Reranker Models
ms-marco-MiniLM-L-12 Reranking 33M 512 ~91% $0.02 3ms/pair
BGE-reranker-v2-m3 Reranking 568M 512 ~96% $0.08 8ms/pair
Jina Reranker v2 Reranking 137M 8192 ~94% $0.05 6ms/pair
Generation Models
Phi-3-mini Generation 3.8B 4096 ~88% $0.20 50ms/token
Gemma 2 2B Generation 2B 8192 ~85% $0.15 35ms/token
Llama 3.1 8B Generation 8B 128K ~92% $0.35 80ms/token
Mistral 7B Generation 7B 32K ~90% $0.30 60ms/token
DeepSeek-R1-Distill 8B Generation 8B 16K ~88% (reasoning) $0.40 120ms/token
Router/Classifier Models
DistilBERT Classification 66M 512 ~97% $0.01 1ms
TinyBERT Classification 14.5M 512 ~92% $0.005 0.5ms
Production Tip: Start with the SMALLEST model in each component and measure quality on your specific corpus. Embedding quality varies dramatically with domain (code embeddings need code-trained models; biomedical embeddings need biomedical training). Rerankers are the highest ROI—a good reranker can salvage retrieval mistakes from cheaper embedders. Use E5-small + ms-marco + Phi-3-mini as your baseline (total $2-3 per million queries). Only upgrade if you hit precision/recall targets that require it. Speculative decoding with Phi-3-mini as draft for Llama 8B can cut generation costs 30-40% without quality loss.
43 / Distillable Models for RAG

Quantization & Compression — Post-Distillation Optimization

Distillation reduces model size 10-50x. Quantization (4-bit, 2-bit), pruning, and low-rank factorization reduce it another 2-8x. Combined effects are multiplicative: a 405B model distilled to 8B (50x) then quantized to 2-bit (8x) becomes equivalent to a 1.6B full-precision model—nearly 250x reduction with 85-90% quality retention. This section covers every compression technique for production RAG.

Full Model GPT-4o 405B params 1.6TB FP32 $60/M tokens Distill 50x smaller Distilled Llama 8B 8B params 32GB FP32 $0.35/M tokens Quantize 4-bit GPTQ Quantized Llama 8B-Q4 8B params 4-8GB VRAM 85-90% quality Prune Sparsity 30% Optimized Llama 8B-Q4-Pruned 5.6B eff. params 2-4GB VRAM 80-85% quality 405B → 8B → 4GB 250x smaller, 3ms latency

Quantization Methods

GPTQ (4-bit)

Post-training quantization: 32-bit weights → 4-bit integers. Quantizes one layer at a time, using Hessian information to minimize loss. No retraining needed. Fast inference with vLLM.

  • 8x model size reduction (32GB → 4GB)
  • Quality retention: 97-99%
  • Latency: 20-30% faster than FP32
  • Training time: 30 min - 2 hours per model

Best for: Production inference on consumer GPUs

AWQ (Activation-Aware)

Like GPTQ but considers activation patterns. Moves quantization errors to less important weights based on actual data distributions. Better quality at extreme compression.

  • 8x model size reduction (32GB → 4GB)
  • Quality retention: 98-99%
  • Latency: 15-25% faster than FP32
  • Training time: 1-4 hours per model

Best for: Max quality at 4-bit; preferred for generation models

GGUF (llama.cpp)

Quantization format for CPU inference. Multiple quantization levels (Q2, Q3, Q4, Q5, Q8). Minimal dependencies; runs on CPU without GPU. Popular for local/edge deployment.

  • 2-8x reduction depending on level
  • Quality: Q4 = 95-98%, Q2 = 85-90%
  • Latency: 50-300ms/token on CPU
  • No GPU required; runs anywhere

Best for: Local inference, privacy-critical apps, edge devices

BitsAndBytes / QLoRA

Load 4-bit model, add small LoRA adapters. Training-friendly. Model stored in 4-bit; adapters in float32 for gradient computation. Great for fine-tuning distilled models.

  • 8x reduction + memory-efficient training
  • Quality: 98%+ (no inference-time loss)
  • Fine-tune 70B on single 40GB GPU
  • Adapters portable; base model quantized

Best for: Fine-tuning distilled models at scale

Structured Pruning

Remove entire attention heads or feed-forward neurons. Maintains model architecture; reduces FLOPs. Combines well with quantization for 2-4x additional speedup.

  • 2-4x latency reduction (removes FLOPs)
  • Quality retention: 92-96%
  • Works with standard inference frameworks
  • Usually done during fine-tuning or distillation

Best for: Latency-critical systems; combines with quantization

SparseGPT & Magnitude Pruning

Remove 20-50% of weights (unstructured). Requires sparse inference libraries for speedup. SparseGPT uses Hessian-aware pruning for minimal quality loss at high sparsity.

  • Up to 2-3x reduction (not all hardware supports)
  • Quality at 50% sparsity: 92-96%
  • Requires sparse-aware inference (e.g., Mochi)
  • Combined effect with quantization: 4-6x

Best for: Custom hardware; extreme compression research

Compression Methods Comparison

Method Size Reduction Speed Boost Quality Loss GPU Required? Training Time Best Use
GPTQ 4-bit 8x 1.2-1.3x 1-3% Yes (calibration) 30min - 2hr Production inference
AWQ 4-bit 8x 1.15-1.25x 1-2% Yes (calibration) 1-4hr Quality-critical generation
GGUF Q4 8x 0.2-0.5x (CPU) 2-5% No (inference) 5-30min Local/edge deployment
BitsAndBytes 4-bit 8x 1.1x 0% (lossless) Yes (inference + training) 0min (inference) Fine-tuning + inference
Structured Pruning 2-4x 2-4x 4-8% Yes (training) 1-3 days Latency-critical
Magnitude Pruning 2-5x 1-2x (sparse HW) 4-10% Maybe (sparse HW) 1 hour - 1 day Custom hardware
Distil + Q4 + Prune 50x + 8x + 3x = 1200x 100x overall 10-15% Yes 1-2 weeks Ultimate compression

Code Example: Quantize a Distilled Model with AutoGPTQ

# Quantize Llama 8B (distilled) to 4-bit GPTQ with AutoGPTQ from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig import torch # Model to quantize (your distilled Llama) model_name = "meta-llama/Llama-2-8b-hf" # Quantization config: 4-bit, group size 128, symmetric quantize_config = BaseQuantizeConfig( bits=4, # 4-bit quantization group_size=128, # Weight grouping desc_act=False, # Don't sort by activation sym=True, # Symmetric quantization ) # Load and quantize (takes 30min-2hr on single GPU) model = AutoGPTQForCausalLM.from_pretrained( model_name, quantize_config=quantize_config, device_map="auto", # Auto place on GPU ) # Save quantized weights (4GB instead of 32GB) model.save_pretrained("./llama-8b-gptq-4bit") # Load and use in production with vLLM # vLLM auto-detects GPTQ format and uses optimized kernels from vllm import LLM llm = LLM(model="./llama-8b-gptq-4bit", quantization="gptq") # Result: 32GB → 4GB storage, 20-30% faster inference # Cost reduction: $0.35/M tokens → $0.15/M tokens

Cumulative Compression: Pipeline Savings

Stage Example Model Size Cumulative Reduction Cost/M Tokens
1. Original GPT-4o 405B params, 1.6TB 1x $15.00
2. Distillation Llama 8B 8B params, 32GB 50x $0.35
3. + Quantization (4-bit) Llama 8B-GPTQ 8B params, 4GB 50x × 8x = 400x $0.15
4. + Pruning (30%) Llama 5.6B-Q4-Pruned 5.6B eff. params, 2.8GB 400x × 3x = 1200x $0.08
5. Quality Retention ~85-88% of original GPT-4o quality on RAG tasks 98x cheaper
Production Tip: Always quantize distilled generation models to GPTQ 4-bit or AWQ. For embeddings and rerankers, 8-bit quantization is sufficient (2x size reduction, no quality loss). For maximum compression on latency-critical routers, use GGUF Q2 or Q3 on CPU. The multiplicative effect of distillation (10-50x) + quantization (4-8x) + pruning (2-3x) enables running production RAG systems at <1ms latency on consumer hardware. Test quality extensively—typical loss is 2-3% in closed-domain RAG but can reach 5-10% on open-domain reasoning tasks.
44 / Quantization & Compression

Domain-Specific Distillation — Specialized Models for Specialized Domains

Generic distilled models work well for most tasks, but specialized domains (biomedical, legal, code, finance) have unique terminology, conventions, and reasoning patterns. Domain-specific distillation fine-tunes the teacher on domain data first, then distills into a compact student. The result: small, specialized models that understand nuanced domain knowledge without the cost of frontier APIs.

Healthcare & Biomedical

BioMistral & PubMed Models

Mistral 7B fine-tuned on 20M biomedical papers, then distilled. PubMedBERT pre-trained on 18M PubMed abstracts. Domain vocabulary includes medical terminology, drug names, pathways.

  • BioMistral 7B — Generation, QA over medical literature
  • PubMedBERT — Embeddings, retrieval from PubMed corpus
  • ClinicalBERT — Clinical notes, discharge summaries
  • SciBERT — General scientific papers, methodology extraction

Regulatory: HIPAA-compliant fine-tuning; FDA 21 CFR Part 11 for records

Use Cases & Quality

  • Patient record QA: "What medications is the patient allergic to?"
  • Drug interaction retrieval: Find papers on specific drug combinations
  • Clinical trial matching: Match patients to relevant trials
  • Literature synthesis: Summarize findings across papers

Quality on domain tasks: 92-96% (vs 85% for generic models). Latency: 50-100ms per query.

Legal & Compliance

LegalBERT & SaulLM

LegalBERT trained on 12M legal documents (contracts, case law). SaulLM-7B fine-tuned for legal reasoning. Understands statutes, precedent citations, contract clauses.

  • LegalBERT — Embeddings, contract clause retrieval
  • SaulLM 7B — Legal reasoning, opinion generation
  • Legal-BERT-small — Compact classification, ruling prediction
  • Case Law BERT — Precedent similarity, case law search

Compliance: Audit trails required; document all reasoning steps

Use Cases & Quality

  • Contract review: Identify risky clauses, flag deviations
  • Due diligence: Retrieve relevant contracts by clause type
  • Case law retrieval: Find precedent for legal arguments
  • Compliance checking: Verify contracts against templates

Quality: 94-98% on legal classification. Cost: $0.30/doc for GPT-4, $0.02/doc distilled.

Finance & Trading

FinBERT & BloombergGPT Distillations

FinBERT trained on 10K SEC filings, earnings calls, financial news. Understands ticker symbols, financial ratios, sentiment about markets. Distilled down to 66M-110M parameters.

  • FinBERT — Sentiment analysis, embeddings from SEC filings
  • BloombergGPT-distilled — Financial reasoning, earnings summarization
  • SEC Retriever BERT — Find relevant filings by section type
  • FraudBERT — Anomaly detection in financial documents

Regulatory: SEC requires documentation of AI systems for financial advice

Use Cases & Quality

  • Earnings analysis: Extract guidance, management commentary
  • SEC filing search: Find risk factors, related party transactions
  • Sentiment scoring: Score news and analyst reports
  • Fraud detection: Flag unusual disclosures or language patterns

Quality: 96%+ on classification; 90%+ on sentiment. Real-time processing: <100ms.

Code & Engineering

CodeLlama & StarCoder Distillations

CodeLlama 7B/13B trained on 500B tokens of code from GitHub. StarCoder2 3B/7B distilled from larger model. Understand syntax, APIs, dependencies, documentation patterns across 80+ languages.

  • CodeLlama 7B — Code generation, completion, infilling
  • StarCoder2 3B/7B — Fill-in-middle, multi-language, low latency
  • DeepSeek-Coder 6.7B — Code search, documentation generation
  • Granite-code 3B — IBM's distilled code model

Licensing: Verify open-source compatibility (CodeLlama uses Llama license)

Use Cases & Quality

  • Codebase RAG: "Find usage of this function across repos"
  • Code completion: Autocomplete functions, fix syntax
  • Documentation: Generate docs from docstrings, code comments
  • Bug detection: Identify common patterns, security issues

Quality: 85-90% on HumanEval. Latency: 30-60ms. Cost: $0.20/1M tokens.

Scientific & Research

SciBERT & Domain-Specific Models

SciBERT trained on 1.2M scientific papers. MatSciBERT for materials science papers. ChemBERT for chemistry. Each understands domain-specific terminology, experimental methodologies, result reporting conventions.

  • SciBERT — General scientific papers, citation context
  • MatSciBERT — Materials science, synthesis conditions
  • ChemBERT — Chemistry, molecular structures, reactions
  • AstroGLUE — Astronomy papers, telescope data analysis

Citation tracking: Models can retrieve papers cited by retrieved papers

Use Cases & Quality

  • Paper search: Find papers by methodology, findings
  • Citation analysis: Extract key citations, author networks
  • Result extraction: Parse numerical results, comparisons
  • Meta-analysis: Summarize findings across papers

Quality: 93-97% on citation prediction. Enables research synthesis at scale.

Multilingual & Cross-Lingual

mBERT & XLM-RoBERTa Distillations

Multilingual BERT trained on 104 languages. XLM-RoBERTa small (124M params) distilled from large model. Enable cross-lingual embeddings and retrieval—documents in one language can retrieve queries in another.

  • mBERT-base — 104 languages, unified embedding space
  • XLM-RoBERTa-small — Lightweight, 44M params, 100+ languages
  • LaBSE — Cross-lingual semantic search
  • mDPR — Multilingual dense passage retrieval

Zero-shot: Train on English, deploy on any language in the model's coverage

Use Cases & Quality

  • Cross-lingual search: Query in French, retrieve Chinese docs
  • Multilingual customer support: Route queries to knowledge base
  • International legal: Match contracts across jurisdictions
  • Academic search: Unified search across multiple languages

Quality: 85-92% on multilingual MTEB; zero-shot performance good for high-resource languages.

Production Tip: Domain-specific models are worth it when: (1) your domain has specialized terminology (biomedical: genes, proteins; legal: tort, precedent), (2) generic models perform <85% on your benchmarks, or (3) inference cost matters (distilled domain models are 10-20x cheaper than API calls). Start with pre-trained domain models if available (FinBERT, LegalBERT). If not, fine-tune a generic teacher on your domain (1-5 days), then distill to 7-8B student (2-3 days). For specialized embeddings (biomedical retrieval), fine-tune BGE-base on your domain corpus with contrastive learning—results in 94%+ domain-specific quality at 1/10 the teacher cost.
45 / Domain-Specific Distillation

Distillation Implementation Guide — From Teacher to Production

Distillation is a systematic process: select teacher, generate or curate training data, prepare dataset, configure student, train with distillation loss, evaluate, quantize, and deploy. This section walks through the full pipeline with code examples for each stage, covering practical production concerns like data quality, training stability, and evaluation metrics.

1. Select Teacher GPT-4o / Claude Llama 405B / Mixtral 8x22B 2. Generate Data 5K-10K Q&A pairs from teacher API 3. Prepare Dataset Clean, dedupe, stratify by difficulty 4. Select Student Phi-3-mini Llama 8B DistilBERT 5. Train Student Distillation loss α: 0.7 T: 4, 1-3 days 6. Evaluate & Deploy A/B test, measure BLEU/F1, quantize to production

Step-by-Step Implementation

1. Select Teacher Model

  • For generation: GPT-4o ($0.015/K tokens), Claude 3.5-Sonnet, Llama 405B
  • For embeddings: E5-Mistral-7B, BGE-large, sentence-transformers
  • Criteria: High accuracy on your domain, affordable API access, reproducible outputs
  • Cost estimate: 5K-10K examples ≈ $50-500 in API calls

2. Generate Training Data

  • Synthetic data: Teacher generates Q&A, reasoning chains from corpus
  • Data quality: Set temperature 0.3-0.5, filter low-confidence outputs
  • Diversity: Sample from different topics, difficulty levels
  • Deduplication: Remove near-duplicates (use embedding similarity)

3. Prepare Dataset

  • Format: JSON Lines, each line: {"instruction": "...", "output": "..."}
  • Train/val split: 90/10 or 85/15 for stratified sampling
  • Tokenization: Truncate to max_length (4096 for Llama, 512 for BERT)
  • Class balance: For classification, stratify by label

4. Student Architecture

  • Generation: Phi-3-mini (3.8B) or Llama 8B start point
  • Embeddings: all-MiniLM-L6-v2 (22M) → E5-base (110M)
  • Reranker: ms-marco-MiniLM-L-12 (33M) → BGE-m3 (568M)
  • Classifier: TinyBERT (14.5M) → DistilBERT (66M)

Code Example 1: Generate Synthetic Data at Scale

# Generate 10K Q&A pairs from your corpus using teacher API import json, random from openai import OpenAI client = OpenAI() # Load your domain corpus (documents, passages) documents = [load_corpus()] # Your doc chunks training_data = [] for doc in random.sample(documents, 10000): # Teacher generates diverse questions response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "system", "content": "Generate 3 diverse questions from this doc." }, { "role": "user", "content": doc["content"] }], temperature=0.3 # Low temp for consistency ) # Generate answers for each question questions = parse_questions(response.choices[0].message.content) for q in questions: answer = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "user", "content": f"Doc: {doc['content']}\n\nQ: {q}" }], temperature=0.3 ) training_data.append({ "instruction": f"Answer from doc:\n{doc['content']}\n\nQ: {q}", "output": answer.choices[0].message.content }) # Deduplication & quality filtering def is_high_quality(example): return len(example["output"]) > 20 and "\n" not in example["output"][:50] training_data = [e for e in training_data if is_high_quality(e)] # Save to JSONL with open("training_data.jsonl", "w") as f: for ex in training_data: f.write(json.dumps(ex) + "\n")

Code Example 2: Fine-tune with Unsloth + LoRA

# Fine-tune student on synthetic data with QLoRA from unsloth import FastLanguageModel from datasets import load_dataset from trl import SFTTrainer # Load base student model (4-bit quantized) model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Meta-Llama-3.1-8B-Instruct", max_seq_length=4096, load_in_4bit=True, # QLoRA for memory efficiency dtype=None, ) # Add LoRA adapters (16 rank, ~0.5% additional params) model = FastLanguageModel.get_peft_model( model, r=16, # LoRA rank lora_alpha=16, lora_dropout=0.05, bias="none", use_gradient_checkpointing=True, ) # Load training data dataset = load_dataset("json", data_files="training_data.jsonl", split="train") # Supervised fine-tuning trainer trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="instruction", # or your output field max_seq_length=4096, args=TrainingArguments( per_device_train_batch_size=4, gradient_accumulation_steps=4, warmup_steps=100, num_train_epochs=3, learning_rate=2e-4, logging_steps=10, save_steps=500, output_dir="./distilled-model", ), ) trainer.train() # Save merged model (compress with GPTQ after) model.save_pretrained("./distilled-final")

Code Example 3: Evaluate Distillation Quality

# Compare teacher vs student quality on held-out test set from rouge_score import rouge_scorer import numpy as np # Load test data (not seen during training) test_data = load_test_set() # Get predictions from both models teacher_outputs = [get_teacher_response(ex["input"]) for ex in test_data] student_outputs = [get_student_response(ex["input"]) for ex in test_data] # Evaluate with ROUGE (generation) or F1 (classification) scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL']) teacher_scores = [] student_scores = [] for ref, t_out, s_out in zip([ex["reference"] for ex in test_data], teacher_outputs, student_outputs): t_score = scorer.score(ref, t_out)['rougeL'].fmeasure s_score = scorer.score(ref, s_out)['rougeL'].fmeasure teacher_scores.append(t_score) student_scores.append(s_score) quality_retention = (np.mean(student_scores) / np.mean(teacher_scores)) * 100 print(f"Teacher ROUGE-L: {np.mean(teacher_scores):.3f}") print(f"Student ROUGE-L: {np.mean(student_scores):.3f}") print(f"Quality Retention: {quality_retention:.1f}%")

Production Tips & Best Practices

Training Stability

  • Batch size: 16-32 for 8B models, 4 for 13B+
  • Learning rate: 2e-4 to 5e-4 (start conservative)
  • Warmup: 5-10% of total steps to prevent instability
  • Loss curves: Should decrease smoothly; spikes indicate issues
  • Gradient clipping: max_grad_norm=1.0 to prevent explosion

Evaluation Metrics

  • Generation: ROUGE-L, BLEU, F1 vs reference outputs
  • Embeddings: MRR, NDCG, MAP on retrieval task
  • Classification: Accuracy, precision, recall per class
  • Human eval: Sample 100-200 outputs, rate quality 1-5
  • Latency: Track inference time vs quality tradeoff

Deployment & Monitoring

  • A/B testing: Compare 10% teacher, 90% student for 1-2 weeks
  • Shadow mode: Log student predictions, compare offline
  • Quantization: Post-train GPTQ or AWQ after distillation
  • Cost monitoring: Track cost-per-query before/after distillation
  • Quality drift: Monitor quality metrics in production weekly

Common Issues & Fixes

  • Collapse to mode: Student predicts same output for all inputs → lower learning rate
  • Quality gap >15%: More training data, better teacher, larger student
  • Overfitting: Large gap between train/val loss → add dropout, regularization
  • Slow convergence: Use cosine schedule with warmup, longer training
  • Divergence: Loss becomes NaN → reduce batch size, use gradient clipping
Production Tip: Distillation is not fire-and-forget. Budget 2-4 weeks for full pipeline: 3-5 days for data generation ($100-500), 1-3 days for training (single 40GB GPU), 3-5 days for evaluation and iteration. Start with smallest possible student (TinyBERT for classifiers, Phi-3-mini for generation). If quality gap is <5%, ship it. If >10%, try larger student or more data. Use speculative decoding (small student generates, larger student verifies) to get quality of large model with speed of small model—gives you the best of both worlds without quality sacrifice.
46 / Distillation Implementation Guide

Distillation Summary & Production Decision Framework

Distillation enables production RAG at 1/100th the cost of frontier APIs. The decision framework below guides model selection based on your constraints: budget, latency, privacy, quality floor. Combined with quantization and pruning, distillation achieves extreme compression—405B to <1B with 85-90% quality retention—enabling deployment on edge devices and consumer hardware.

What are your constraints? Budget-Critical Target: <$2/1M queries → all-MiniLM + TinyBERT + Phi-3-mini (free tier) Latency-Critical Target: <50ms end-to-end → E5-small + ms-marco + Speculative decoding Quality-First Target: 95%+ vs teacher → GTE-Qwen2 + BGE-m3 + Llama 8B (AWQ) Privacy-Required? Yes: Local only, quantize → GGUF Q4 on CPU Serving at Scale? Yes: Batch inference, vLLM → tensor-parallel, pipeline Reasoning Critical? Yes: DeepSeek-R1-Distill → CoT capability preserved Recommended Stacks by Scenario Startup all-MiniLM + Phi-3 + TinyBERT = $2/1M queries Production E5-base + BGE-m3 + Llama 8B (AWQ) = $8/1M queries Enterprise GTE-Qwen2 + BGE-m3 + Llama 8B (GPTQ) = $12/1M, 95%+ quality Edge/Local all-MiniLM (GGUF) + Phi-3 (GGUF Q4) = On-device, zero API

Key Takeaways

1. Cost Reduction — Distillation alone: 10-50x. With quantization: 50-400x. Combined savings compound multiplicatively.

2. Quality Retention — Well-distilled models keep 85-95% of teacher quality. 10-15% loss is rare; usually <5% on closed-domain RAG.

3. Technique Selection — 80% of use cases: logit distillation (encoders) + synthetic data (LLMs). Progressive distillation only for extreme compression.

4. Domain Matters — Generic models work well (80%+ quality). Domain-specific teachers matter only for specialized fields (biomedical, legal, code).

5. Deployment Path — GPU → GPTQ/AWQ → Pruning → GGUF. Each step trades quality for speed/size. Stop when you hit your requirements.

6. Evaluation Essential — Never ship without A/B testing teacher vs. student on 100-1000 held-out examples. 2-week shadow period recommended.

Quick Reference: Scenario → Recommendation

Scenario Constraints Recommended Approach Expected Outcome Timeline
API-to-Local Zero API deps, privacy Phi-3-mini (synthetic data) + GGUF Q4 85% quality, on-device 2 weeks
Cost Reduction Budget <$5/1M queries E5-small + ms-marco + Phi-3 (GPTQ) 90% quality, 50x cost cut 1 week
Latency Critical P95 <50ms end-to-end all-MiniLM + DistilBERT + Phi-3 (speculative) 88% quality, 5ms avg latency 1 week
Domain-Specific Biomedical/legal/code Fine-tune teacher → Distill to 7-8B + domain data 94%+ domain quality, 10x cost cut 3 weeks
Scale Inference 1B+ queries/day E5-base + BGE-m3 (batch) + Llama 8B (vLLM) 92% quality, $8/1M tokens 2 weeks
Extreme Compression Mobile/edge, <100MB Distil → Q4 → Prune → GGUF Q2 80-85% quality, 250x smaller 4 weeks

Top 5 Mistakes & How to Avoid Them

❌ Mistake 1: Skipping Evaluation

Shipping student without A/B testing against teacher. Can lose 15-30% quality silently.

✓ Fix: Evaluate on 200+ held-out examples. Human eval for 50 outputs. 2-week shadow mode (log but don't use student).

❌ Mistake 2: Low-Quality Training Data

Generating synthetic data at high temperature (0.8+) or without filtering. Student learns inconsistent/noisy examples.

✓ Fix: Use temperature 0.3-0.5. Filter outputs <50 chars. Dedupe with embedding similarity (thr=0.95).

❌ Mistake 3: Overshrinking Student

Going straight from 405B to 1B. Quality drops 20-30%. Better to go 405B → 13B → 7B progressively.

✓ Fix: Start with 50% reduction (405B → 7B). If quality ok, shrink more. Progressive distillation for extreme sizes.

❌ Mistake 4: Wrong Temperature

Temperature T too low (<2) → student just memorizes. T too high (>8) → too soft, easy to learn but flat gradients.

✓ Fix: Start with T=4. If training unstable, increase to 6-8. If converging too fast, lower to 2-3.

❌ Mistake 5: Insufficient Training

Training for 1 epoch over 5K examples. Student hasn't converged; performance is suboptimal.

✓ Fix: Train for 3-5 epochs. Monitor train/val loss. Stop when val loss plateaus (typically day 2-3 for 8B on single GPU).

⚠️ Challenge: Quality Degradation

Student loses 15-20% quality despite everything looking right. Common in open-domain reasoning, edge cases.

✓ Mitigation: Increase training data (10K → 50K). Use larger student. Add domain-specific hard examples. Accept loss on reasoning tasks.

Resources & Tools

Key Papers

  • Hinton et al. 2015 — Distilling Knowledge in Neural Networks (original KD)
  • Jiao et al. 2019 — TinyBERT (layer + feature distillation)
  • Anil et al. 2023 — Large Language Model Distillation (Gemini)

Tools & Frameworks

  • Unsloth — 2-5x faster distillation (QLoRA)
  • vLLM — Batch inference for evaluation
  • AutoGPTQ — Easy GPTQ quantization
  • HuggingFace SFT Trainer — Supervised fine-tuning

Benchmark & Datasets

  • MTEB — Embedding evaluation (56 tasks)
  • HumanEval — Code generation quality
  • MMLU — Knowledge/reasoning benchmark
  • SuperGLUE — NLU/classification tasks
Final Production Insight: Distillation + quantization + pruning is the path to production AI at massive scale. A 405B model distilled to 8B (50x), quantized to 4-bit (8x), and pruned 30% (3x) becomes effectively a 1.6B model with 85% quality. This runs on a single 24GB consumer GPU, costs <$0.20 per million tokens, and has 2-5ms latency. Combine with speculative decoding for another 30-40% latency reduction. The era of cheap, fast, private AI inference is here—distillation is the key technology enabling it.
47 / Distillation Summary & Decision Framework

Fine-Tuning vs RAG — When to Use Which

Fine-tuning bakes knowledge into model weights; RAG retrieves it at runtime. The right choice depends on whether your knowledge is static or dynamic, whether you need behavioral changes or factual grounding, and your budget for maintenance.

What do you need? Change behavior/style Fine-Tune Tone, format, domain style, reasoning Add factual knowledge RAG Dynamic facts, documents, real-time data Both Fine-Tune + RAG Domain style + dynamic knowledge Static knowledge OK Custom output format Frequent updates Source attribution Decision tree: Choose based on whether you need behavioral adaptation, factual grounding, or both

Head-to-Head Comparison

Dimension Fine-Tuning RAG Fine-Tune + RAG
Knowledge freshness Frozen at training time Always up-to-date Up-to-date
Hallucination control Hard to control Grounded in sources Best of both
Source citation Not possible Built-in Built-in
Output style control Excellent Limited (prompt-based) Excellent
Setup cost $100-10K (GPU training) $50-500 (indexing pipeline) $200-10K
Per-query cost Low (small model) Medium (retrieval + LLM) Medium
Maintenance Retrain on new data Re-index documents Both
Data volume needed 1K-100K examples Any number of documents 1K+ examples + documents
Latency Fastest (single forward pass) +50-200ms (retrieval) +50-200ms
Best for Tone, style, format, domain jargon Facts, docs, real-time data, QA Enterprise production systems

Decision Guide

Choose Fine-Tuning When

  • Custom output format: JSON schemas, specific templates, branded voice
  • Domain adaptation: Medical terminology, legal language, code style
  • Behavioral changes: Response length, reasoning approach, safety rules
  • Latency-critical: No retrieval overhead; single forward pass
  • Stable knowledge: Information that won't change often
  • Cost at scale: Fine-tuned small model cheaper than large model + RAG

Choose RAG When

  • Dynamic knowledge: Documents updated daily/weekly
  • Source attribution: Users need to verify where answers come from
  • Large corpus: Thousands of documents that can't fit in training data
  • Compliance: Audit trails, explainability, data governance
  • Multi-tenant: Different knowledge bases per user/org
  • Rapid prototyping: No training loop; index and query immediately
Production Tip: The best production systems combine both. Fine-tune a small model (Llama 8B, Phi-3) on your domain's style and output format, then use RAG to inject current knowledge at query time. This gives you domain-adapted behavior with always-fresh facts. Start with RAG alone — it's faster to set up. Add fine-tuning only when prompt engineering can't achieve the style/format you need. Monitor: if >30% of your prompt tokens are formatting instructions, fine-tuning will be more cost-effective.
48 / Fine-Tuning vs RAG

RAG Prompt Engineering — Optimizing Generation

The prompt template connecting retrieved context to the LLM is the most underappreciated component of RAG. Small prompt changes can swing answer quality by 20-40%. Master these patterns to eliminate hallucination, improve faithfulness, and control output format.

Essential Prompt Patterns

Grounding Instructions

Force the model to answer only from provided context, reducing hallucination.

"""Answer the question based ONLY on the provided context. If the context does not contain enough information to answer, say "I don't have enough information to answer this question." Do NOT use prior knowledge. Context: {retrieved_chunks} Question: {query} Answer:"""

Citation / Attribution

Require inline citations that map back to source documents.

"""Answer using ONLY the numbered sources below. Cite each claim with [Source N]. Sources: [1] {chunk_1} (from: {doc_name_1}) [2] {chunk_2} (from: {doc_name_2}) [3] {chunk_3} (from: {doc_name_3}) Question: {query} Answer (with citations):"""

Chain-of-Thought RAG

Ask the model to reason through the context step by step before answering.

"""Given the context, answer step by step: 1. Identify relevant information 2. Check for contradictions 3. Synthesize a coherent answer 4. Cite your sources Context: {chunks} Question: {query} Step-by-step reasoning:"""

Refusal / Uncertainty

Teach the model to express confidence levels and refuse gracefully when unsure.

"""Rate your confidence (HIGH/MEDIUM/LOW) based on how well the context supports your answer. - HIGH: Direct answer in context - MEDIUM: Inferred from context - LOW: Partially supported If LOW, say: "Based on limited context, ..." and suggest what additional info would help. Context: {chunks} Question: {query}"""

Common Anti-Patterns to Avoid

Anti-Pattern Problem Fix
No grounding instruction Model mixes retrieved facts with parametric knowledge, causing subtle hallucinations Always include "answer ONLY from context"
Context before system prompt Long context pushes instructions out of attention window ("lost in the middle") Place instructions first, then context, then question
Too many chunks Dilutes relevant info; model struggles to find the answer in noise Rerank and limit to top 3-5 most relevant chunks
No refusal path Model invents answers when context doesn't contain the answer Explicitly instruct "say I don't know if unsupported"
Missing metadata Model can't distinguish document sources or dates Include doc title, date, source URL with each chunk
Vague output format Inconsistent response structure across queries Specify exact output format (JSON, bullets, paragraphs)

Advanced Prompt Techniques

Multi-Document Synthesis

When chunks come from multiple documents, instruct the model to identify agreements, contradictions, and gaps between sources before synthesizing.

Conflict resolution Cross-doc

Structured Output

Use JSON mode or XML tags to get consistent, parseable output. Define the schema in the prompt: {"answer": "...", "sources": [...], "confidence": "HIGH"}.

JSON mode Parseable

Few-Shot RAG Examples

Include 2-3 example context→answer pairs in the prompt to demonstrate the expected citation style, reasoning depth, and refusal behavior.

In-context learning Style guide
Production Tip: A/B test your prompts. The optimal prompt structure is: System instruction (grounding rules + output format) → Retrieved context (with source metadata) → User question → Output constraints. Keep context chunks under 5 for most queries — more chunks rarely improve quality and often hurt it. Use structured output (JSON) if downstream systems consume the response. Always include a refusal path — it's cheaper to say "I don't know" than to hallucinate and lose user trust.
49 / RAG Prompt Engineering

Caching Strategies for Production RAG

Multi-layer caching is the single highest-ROI optimization for production RAG — reducing latency by 60-90%, cutting LLM costs by 40-70%, and improving user experience with near-instant responses for repeated or similar queries.

User Query raw input L1: Exact Match Redis / Memcached TTL: 1-24 hours Hit rate: 15-25% Latency: <5ms Hash(query+filters) L2: Semantic Cache GPTCache / custom Cosine sim > 0.95 Hit rate: 25-40% Latency: 15-50ms embed(query) → ANN L3: Embedding Avoid re-embed Doc-level dedup Save: 60-80% embed API costs content_hash → vec Full Pipeline

Cache Layer Deep Dive

Exact Match Cache

Hash the normalized query + metadata filters as cache key. Store the full response (answer + citations + confidence score). Best for FAQ-style queries and repeated searches.

def cache_key(query, filters): normalized = query.lower().strip() key_str = f"{normalized}|{sorted(filters.items())}" return hashlib.sha256(key_str.encode()).hexdigest() # Redis with TTL + invalidation hooks cached = redis.get(cache_key(q, f)) if cached: return json.loads(cached) # <5ms
Redis SHA-256 TTL

Semantic Cache

Embed the query, search a cache-specific vector index for similar past queries. If cosine similarity exceeds threshold (0.95+), return the cached response. Handles paraphrases and near-duplicates.

class SemanticCache: def lookup(self, query_embedding): results = self.cache_index.search( query_embedding, top_k=1 ) if results[0].score > 0.95: return self.response_store[ results[0].id ] return None # cache miss
GPTCache ANN search cosine sim

Embedding Cache

Cache computed embeddings keyed by content hash. Avoids re-embedding unchanged documents during re-indexing. Critical for cost control at scale (embedding APIs charge per token).

def get_embedding(text): content_hash = hash_content(text) cached = redis.get(f"emb:{content_hash}") if cached: return np.frombuffer(cached) vec = embedding_model.encode(text) redis.setex( f"emb:{content_hash}", ttl=86400 * 7, # 7 days value=vec.tobytes() ) return vec
content hash re-index safe

Cache Invalidation Strategies

Time-Based (TTL)

Set TTL based on data volatility. Static docs: 24h+. News/feeds: 1-4h. Real-time data: 5-15min. Always pair with event-based invalidation.

Event-Driven Invalidation

On document update/delete, invalidate all cache entries referencing that doc_id. Use CDC (Change Data Capture) or webhook triggers from source systems.

Versioned Keys

Include index version or embedding model version in cache keys. Model upgrade = automatic full invalidation without manual flush.

Confidence-Gated Caching

Only cache responses with confidence score above threshold (e.g., >0.85). Low-confidence answers should always be regenerated fresh.

Production Impact: A well-tuned 3-layer cache typically handles 50-70% of production queries without hitting the LLM, reducing p50 latency from 2-4s to <100ms and cutting monthly LLM costs by 40-60%. Start with exact match (week 1), add semantic cache (week 3), then tune thresholds with A/B testing.
50 / Caching Strategies

Metadata Filtering & Hybrid Search

Pure vector similarity search is rarely sufficient in production. Metadata filtering adds structured constraints (date ranges, access levels, document types, departments) to narrow the search space before or after vector retrieval — improving precision, enforcing security, and reducing noise.

Pre-Filter vs Post-Filter Architecture

Pre-Filtering (Recommended)

Apply metadata constraints before vector search. The vector DB only searches within the filtered subset. Faster at query time, but requires indexed metadata fields. Supported natively in Qdrant, Pinecone, Weaviate, Milvus.

# Qdrant pre-filter example results = client.search( collection="docs", query_vector=query_vec, query_filter=Filter(must=[ FieldCondition( key="department", match=MatchValue(value="engineering") ), FieldCondition( key="created_at", range=Range(gte="2025-01-01") ), FieldCondition( key="acl_groups", match=MatchAny(any=user.groups) ) ]), limit=10 )
Faster queries Smaller search space

Post-Filtering

Retrieve top-k results first, then filter by metadata. Simpler to implement but risks returning fewer results than requested (top-10 search → filter → 3 results). Use when filter selectivity is low or metadata isn't indexed.

# Post-filter: over-fetch then trim raw = vector_db.search( query_vec, top_k=50 # 5x over-fetch ) filtered = [ r for r in raw if r.meta["dept"] == user.dept and r.meta["access"] <= user.level ][:10] # trim to final top-k
Simpler setup Flexible filters

Production Metadata Schema

Field Type Purpose Index? Example
doc_idstringUnique document identifierYesdoc_a3f8c1
source_typekeywordFilter by document originYesconfluence, gdrive, s3
departmentkeywordOrg-level filteringYesengineering, legal, hr
acl_groupskeyword[]Access control enforcementYes["eng-team", "all"]
created_atdatetimeFreshness filteringYes2025-11-15T10:30:00Z
updated_atdatetimeStaleness detectionYes2026-02-01T08:00:00Z
languagekeywordMultilingual supportYesen, fr, de
doc_typekeywordContent type filteringYespolicy, runbook, faq
chunk_indexintegerOrdering within parent docNo3
parent_doc_idstringLink chunks to parentYesdoc_a3f8c1
confidencefloatIngestion quality scoreNo0.92
versionintegerDocument version trackingYes3
Design Principle: Index every field you filter on frequently. Under-indexed metadata forces expensive post-filtering. Over-indexed metadata wastes storage but doesn't hurt query performance. When in doubt, index it — storage is cheap, latency is not.
51 / Metadata Filtering

Conversational & Multi-Turn RAG

Single-turn RAG treats each query independently. Production chat applications require multi-turn awareness — resolving pronouns, maintaining topic context, handling follow-up questions, and deciding when to re-retrieve vs reuse prior context.

User: "Tell me about HNSW"
Retrieve + Generate
Response (HNSW details)
User: "How does it compare to IVF?"
Context Resolver
Rewritten: "Compare HNSW vs IVF indexing"
Retrieve + Generate

Multi-Turn Resolution Strategies

1. Query Rewriting with History

Use the LLM to rewrite the latest query into a standalone query by resolving coreferences from chat history. This is the most reliable approach for production.

def rewrite_query(history, current_query): prompt = f"""Given this conversation: {format_history(history)} Rewrite this follow-up into a standalone search query: "{current_query}" Standalone query:""" return llm.generate(prompt)

2. Context Carryover Window

Append the last N retrieved chunks to the new generation context. Simple and effective for follow-ups that reference previous answers. Risk: context window bloat after many turns.

# Sliding window: keep last 3 turns context_window = [] for turn in conversation[-3:]: context_window.extend(turn.retrieved_chunks) # Deduplicate by chunk_id context_window = dedupe(context_window) # Add new retrieval results context_window += new_retrieved_chunks

3. Retrieval Decision Gate

Not every follow-up needs new retrieval. Use an LLM classifier or heuristics to decide: re-retrieve, reuse context, or answer from conversation history alone. Saves 30-50% of retrieval calls.

def needs_retrieval(history, query): # Classify intent intent = classify(query, labels=[ "new_topic", # retrieve "follow_up", # maybe "clarification", # no "chitchat" # no ]) return intent == "new_topic"

4. Memory-Augmented RAG

Maintain a structured memory store alongside the vector DB. Track user preferences, established facts from the conversation, and topic threads. Enables personalized retrieval over sessions.

class ConversationMemory: entities: dict # extracted entities preferences: dict # user prefs topic_stack: list # active topics facts: list # established facts # Enrich retrieval with memory context filters = build_filters(memory.entities) boost = build_boost(memory.preferences)
Common Pitfall: Passing the entire raw conversation history as retrieval context. This floods the vector search with irrelevant terms and degrades precision. Always rewrite to a standalone query or use a sliding window with deduplication. Limit conversation context to 3-5 turns maximum.
52 / Multi-Turn RAG

User Feedback & Continuous Improvement

A production RAG system without a feedback loop is flying blind. User signals (thumbs up/down, click-through, reformulations, explicit corrections) are the ground truth for measuring real-world quality and driving iterative improvement.

RAG Pipeline User Feedback Analytics & Insights Dashboards, alerts, reports Improvement Actions Tune, retrain, fix data

Feedback Signal Taxonomy

Explicit Signals

Thumbs up/down, star ratings, "this was helpful" clicks, written corrections, citation relevance ratings. Highest quality signal but lowest volume (2-5% of queries).

Implicit Signals

Query reformulations (user wasn't satisfied), click-through on citations (answer was useful), copy-paste actions, session duration, follow-up patterns. High volume, noisier signal.

System Signals

Low confidence scores, retrieval misses (no results above threshold), hallucination detection triggers, timeout/fallback activations. Automated quality indicators.

Closing the Loop: Improvement Actions

1

Build Eval Datasets from Feedback

Convert thumbs-down responses into test cases. The query + bad answer + user correction becomes a regression test. Target: 500+ labeled examples for statistical significance.

2

Identify Failure Patterns

Cluster negative feedback by root cause: retrieval misses (wrong docs), grounding failures (hallucination), formatting issues, stale data, permission errors. Fix the highest-impact category first.

3

Targeted Improvements

Retrieval misses → adjust chunking, add synonyms, tune hybrid weights. Hallucinations → strengthen grounding prompts, lower confidence thresholds. Stale data → fix ingestion pipeline, reduce TTLs.

4

A/B Test & Measure

Deploy improvements behind feature flags. Run A/B tests comparing new vs old pipeline. Measure: answer acceptance rate, reformulation rate, confidence scores, latency. Promote only if metrics improve across the board.

Key Metric: Track "Answer Acceptance Rate" (% of responses without reformulation or negative feedback) as your north star. World-class RAG systems achieve 85-92%. Below 70% signals fundamental retrieval or grounding issues. Instrument this from day one — it's your single best proxy for real-world quality.
53 / Feedback & Continuous Improvement

Structured Data RAG: Text2SQL & Table QA

Not all knowledge lives in documents. Production RAG systems often need to query structured data — relational databases, data warehouses, spreadsheets, and APIs. Text2SQL converts natural language into SQL queries, while Table QA reasons over tabular data directly.

NL Question
Schema Retrieval
SQL Generation
Validation
Execute & Synthesize

Text2SQL Pipeline

Convert natural language to SQL using schema-aware prompting. Key: provide table schemas, column descriptions, sample values, and example query pairs in the LLM prompt.

class Text2SQLPipeline: def query(self, question: str): # 1. Retrieve relevant schemas schemas = self.schema_retriever.search( question, top_k=5 ) # 2. Generate SQL sql = self.llm.generate( self.prompt_template.format( schemas=schemas, question=question, examples=self.few_shot_examples ) ) # 3. Validate & sanitize sql = self.sql_validator.check(sql) # 4. Execute (read-only!) results = self.db.execute(sql) # 5. Synthesize answer return self.synthesizer.answer( question, sql, results )

Table QA (Direct Reasoning)

For smaller tables or CSV data, pass the table directly into the LLM context. The model reasons over rows and columns without SQL. Best for aggregations, comparisons, and trend analysis on <100 rows.

# Serialize table as Markdown table_md = df.to_markdown(index=False) prompt = f"""Given this data table: {table_md} Answer: {question} Rules: - Only use data from the table - Show your calculation steps - If data is insufficient, say so""" answer = llm.generate(prompt)

Text2SQL Safety & Guardrails

SQL Injection Prevention

Always use read-only DB connections. Parse and validate generated SQL against an allowlist of operations (SELECT only). Block DROP, DELETE, UPDATE, INSERT, GRANT.

Query Cost Guards

Add EXPLAIN before execution to estimate row scans. Set query timeouts (5-30s). Block full table scans on large tables. Limit result set size (LIMIT 1000).

Column-Level Access Control

Enforce column-level permissions in the schema retriever. Don't expose salary, SSN, or PII columns to unauthorized users. Redact sensitive columns from schema context.

When to Use Which: Text2SQL for large databases (millions of rows), complex joins, and precise aggregations. Table QA for small datasets (<100 rows), quick analysis, and when SQL complexity isn't justified. Hybrid approach: use Text2SQL for retrieval, then Table QA for reasoning over the result set. Always combine with document RAG for complete answers — numbers from SQL + context from docs.
54 / Structured Data RAG

Data Lifecycle, Freshness & Deletion

Production RAG cannot stop at ingestion. You need deterministic handling for updates, deletes, retention, cache invalidation, tombstones, and legal erasure requests so the system never serves stale or non-compliant content.

Source System create / update / delete CDC / Webhook versioned change event Lifecycle Orchestrator upsert index write tombstone invalidate caches audit deletion SLA Vector Index upsert / purge Caches response / embed / query Audit delete retention erasure SLA

Lifecycle Rules

  • Every chunk must carry `doc_id`, `version`, `source_updated_at`, `retention_class`, and `delete_by` metadata.
  • Deletes should write tombstones first, then purge vector rows, cache entries, and derived artifacts asynchronously.
  • Freshness is an SLO, not a hope: define targets like "95% of updates searchable within 5 minutes".
  • Legal erasure must verify downstream deletion, not just remove the primary source record.

Failure Cases to Prevent

  • Updated source document but stale semantic cache still serving old answer.
  • Delete event lost, leaving orphaned chunks in the vector index.
  • Embedding model upgrade without full lineage causing mixed-version retrieval.
  • Retention policy applied to source DB but not to traces, audit logs, and feedback datasets.

Delete Propagation Pattern

class LifecycleManager: async def handle_delete(self, doc_id, tenant_id, version): tombstone = { "doc_id": doc_id, "tenant_id": tenant_id, "version": version, "deleted_at": now_utc(), } await self.audit_log.write(tombstone) await self.vector_index.delete(filter={ "doc_id": doc_id, "tenant_id": tenant_id, }) await self.cache.invalidate_prefix(f"{tenant_id}:{doc_id}:") await self.blob_store.purge(doc_id) await self.metrics.increment("rag.delete.completed")
55 / Data Lifecycle, Freshness & Deletion

Tenant Isolation & Authorization Propagation

Multi-tenant RAG fails dangerously when identity is lost between the API edge and retrieval. Authorization must propagate through query rewriting, retrieval filters, cache keys, reranking, citations, and structured data access.

JWT / Session
Policy Engine
Retrieval Filters
Cache Keys
Answer + Citations

Identity Context

Normalize identity into a signed request context: tenant, user, groups, region, classification clearance, data residency, and session purpose.

Policy Resolution

Compile ABAC/RBAC decisions once per request and pass concrete filters downstream. Do not let each service reinterpret permissions differently.

Output Enforcement

Filter citations, schema context, and tool outputs after retrieval as well. A safe retriever can still leak through a broad synthesizer prompt.

Authorization Contract

request_context = { "tenant_id": "acme", "user_id": "u-123", "groups": ["support", "tier2"], "region": "us", "purpose": "customer_support", "allow": { "doc_types": ["kb", "ticket"], "classifications": ["public", "internal"] } } filters = { "tenant_id": request_context["tenant_id"], "region": request_context["region"], "classification": {"$in": request_context["allow"]["classifications"]}, } cache_key = sha256(json.dumps({ "query": normalized_query, "tenant": request_context["tenant_id"], "policy_hash": hash_policy(request_context), }, sort_keys=True).encode()).hexdigest()
Non-negotiable rule: never cache answers by query alone. Cache keys must be ACL-sensitive, tenant-scoped, and versioned by policy model to prevent cross-tenant leakage.
56 / Tenant Isolation & Authorization

Human Review Ops & Golden Datasets

Evaluation frameworks are not enough by themselves. Production teams need a disciplined review loop: sample traffic, adjudicate failures, curate regression sets, and assign ownership for fixing systematic defects.

Traffic Samples prod + canary + fails Review Queue stratified + risk-weighted Annotators label + root cause + severity Adjudication resolve conflicts Gold set CI

Review Program Design

  • Sample at least three buckets: top traffic, low-confidence responses, and high-risk policy domains.
  • Require labels for retrieval quality, groundedness, citation quality, and user task completion.
  • Track reviewer agreement and escalate ambiguous cases to adjudication.
  • Promote only adjudicated examples into the golden regression set.

Dataset Operating Model

  • Keep separate sets for smoke, regression, hard edge cases, and release blocking policy cases.
  • Version datasets like code and record model, prompt, and index version used to generate them.
  • Retire stale eval samples when source policy or corpus semantics change materially.
  • Assign owners for every recurring failure cluster, not just every model.

Minimal Review Schema

review_record = { "query_id": "q-20260416-001", "query": user_query, "retrieved_chunks": chunk_ids, "answer": answer, "labels": { "grounded": True, "intent_match": True, "citation_quality": "partial", "task_success": "no", "root_cause": "stale_source_data", }, "reviewer_id": "rev-17", "adjudicated": False, }
57 / Human Review Ops & Golden Datasets

Reliability, Failover & Degraded Modes

A production RAG system must keep answering safely when dependencies fail. Define fallback order, circuit breakers, restore targets, and degraded modes before you need them during an incident.

Primary Path

Hybrid retrieval + reranker + response evaluation + citations. Highest quality, highest dependency count.

Degraded Path

BM25-only retrieval, smaller local model, cached answers, or template response if vector DB, reranker, or API model is down.

Fail-Safe Path

Refuse cleanly, escalate to human, or serve a narrow verified FAQ set. Never silently drop safety checks.

Dependency Failure Matrix

Dependency Failure Signal Fallback Hard Rule
Vector DB timeout / error budget burn BM25 index or cached answer set Disable claims needing fresh retrieval
Reranker high latency / no replicas lower `top_k`, rely on retrieval scores Mark answer confidence lower
LLM API provider outage / 429 storm secondary model or local distilled model Preserve same guardrails and filters
Policy Engine cannot resolve permissions fail closed Never answer with missing auth context

Reliability Controls

async def answer_query(query, ctx): if not policy_engine.is_available(): raise FailClosed("authorization unavailable") try: docs = await vector_search.with_timeout(250).run(query, ctx.filters) except TimeoutError: docs = await bm25_fallback.search(query, ctx.filters) ctx.mode = "degraded_retrieval" answer = await generator.run(query, docs, ctx) verdict, safe_answer = await response_eval.run(query, answer, docs) if verdict == "fallback": return human_handoff_or_verified_faq(query) return safe_answer
Ops baseline: define `RPO`, `RTO`, restore drill cadence, and dependency-specific circuit-breaker thresholds. Backup plans that are never restore-tested do not count.
58 / Reliability, Failover & Degraded Modes

Citation UX & Source Attribution

Grounding is only useful if users can inspect it. Production RAG should define how claims map to sources, how conflicting evidence is shown, and how citations differ across chat, search, copilots, and agent workflows.

Claim-Level Citations

Attach citations to atomic claims, not just the whole answer. One answer can have mixed evidence quality across sentences.

Source Preview

Show document title, snippet, timestamp, source system, and anchor location. Users should not need to open the full document to trust the claim.

Conflict Handling

When sources disagree, say so explicitly and rank by freshness, authority, and tenant-approved source priority.

Answer Contract with Citations

{ "answer": "Refunds are allowed within 30 days for unopened items.", "claims": [ { "text": "Refunds are allowed within 30 days", "citations": [ {"doc_id": "policy-12", "anchor": "p3#refund-window", "confidence": 0.94} ] } ], "source_summary": [ {"title": "Returns Policy", "updated_at": "2026-04-01", "authority_rank": 1} ] }
Implementation rule: render inline citations in chat, expandable evidence cards in search, and full execution logs in agent workflows. One attribution format does not fit every product surface.
59 / Citation UX & Source Attribution

Multilingual & Locale-Aware RAG

Multilingual retrieval is more than using a multilingual embedding model. You need locale-aware routing, translation policy, source preference by market, and evaluation sliced by language and script.

User Locale
Language Detection
Native or Translate?
Locale Filters
Localized Answer

Serving Policy

  • Prefer native-language retrieval when the corpus exists in that locale.
  • Use translation only as a fallback, and keep both original and translated evidence IDs.
  • Apply locale-specific ranking for policy, legal, pricing, and compliance content.

Evaluation Requirements

  • Track metrics by language, script, market, and translated-vs-native path.
  • Maintain hard test sets for code-switching, transliteration, and named-entity spelling variants.
  • Never hide poor minority-language performance behind global averages.
60 / Multilingual & Locale-Aware RAG

Personalization, Memory Boundaries & Deletion

Personalization improves usefulness, but it creates new correctness and compliance risks. The system must define what memory is allowed, how long it persists, who can see it, and how user corrections or deletions propagate.

Allowed Memory

Preferences, saved entities, work context, and prior explicit corrections. Keep this separate from shared knowledge retrieval.

Boundary Controls

Do not let user memory silently override system facts. Personal memory can bias ranking, not rewrite source-of-truth records.

Deletion Semantics

A user memory delete must remove embeddings, cache entries, summaries, and feedback traces tied to that memory object.

Policy rule: personalization should be opt-in, inspectable, and reversible. If a user cannot view and clear memory, the feature is not production-ready.
61 / Personalization, Memory Boundaries & Deletion

Secrets Management & Credential Rotation

Connectors, model providers, vector stores, and observability backends all introduce credentials. Production RAG needs explicit controls for secret storage, scoping, rotation, and auditability.

Required Controls

  • Use a secret manager or workload identity, never hardcoded env files committed to the repo.
  • Scope credentials per service and connector, not per environment.
  • Rotate provider and connector tokens on a schedule and on incident.
  • Log secret access events and failed decrypt attempts.

Common Failures

  • Shared API key across ingestion, retrieval, and agent tools.
  • Long-lived connector tokens without revocation flow.
  • Secrets leaking into traces, prompts, or failed job payloads.
  • Rotation that breaks warm instances because caches never refresh credentials.
62 / Secrets Management & Credential Rotation

RAG Framework Selection: What Each Is Best For

Framework choice should match the job. The wrong abstraction layer slows teams down just as much as the wrong model. Use this as a default selection guide, then override it only with clear constraints.

Framework / Approach Best For Why Use When
LlamaIndex Data indexing + retrieval Strong abstractions for ingestion, indexing, retrievers, node parsers, graph/property indexes, and retrieval composition. You need to stand up robust retrieval quickly without building every data primitive yourself.
LangChain Full LLM apps Broad ecosystem for prompts, tools, chains, agents, integrations, and app-level orchestration. You are building an end-to-end LLM product, not just a retriever.
Haystack Production pipelines Pipeline-oriented design, component composition, and strong production ergonomics for retrieval/generation systems. You want explicit, maintainable, production-ready pipeline graphs.
LangGraph / AutoGen Agents Stateful orchestration and multi-step agent workflows with tool use, branches, retries, and explicit control flow. You need agentic execution, not just one-pass RAG.
DSPy Auto-optimized pipelines Signature-driven modules and optimizers make it strong for prompt/program search and systematic quality tuning. You are iterating experimentally and want the pipeline to optimize itself against metrics.
Custom stack Performance + control Minimal overhead, exact ownership of latency, storage, auth, and reliability behavior. You have strict production constraints or framework abstraction is becoming the bottleneck.

Default Rule

Pick the highest-level framework that does not hide a production constraint you care about.

Migration Rule

Start with a framework, then peel off hot or risky components into custom services once the bottlenecks are proven.

Anti-Pattern

Do not use an agent framework to solve a retrieval problem, or a retrieval framework to solve orchestration complexity.

63 / RAG Framework Selection

Glossary of RAG Technical Terms

355 technical terms, tools, models, metrics, and concepts — click a letter or search to jump directly.

355 terms
A B C D E F G H I J K L M N O P Q R S T U V W Z

A

TermDefinition
A/B TestingComparing model variants in production by routing traffic splits and measuring metrics to determine which version performs better on grounding, latency, and user satisfaction.
Access ControlMechanism restricting who can query which documents; critical for multi-tenant RAG systems where different users have access to different knowledge bases.
AccuracyFraction of correct predictions out of total predictions; measures overall classification or retrieval quality.
ACL-sensitive cache keysCache keys incorporating access control preventing leakage.
Adaptive chunk countDynamically adjusts retrieved chunks by query complexity.
adversarial testingProbing systems with malicious inputs to find weaknesses.
Agentic RAGPattern where LLM agent autonomously decides when/how to retrieve, orchestrating multi-step loops rather than following a fixed pipeline.
AGREE approachAutomated grounding evaluation framework.
ALCE approachAttribution and loss-aware evaluation.
alert thresholdsBoundaries triggering notifications on metric violations.
ALiBi (Attention with Linear Biases)Positional encoding adding linear biases to attention scores for length extrapolation beyond training sequences.
all-mpnetSentence-transformer combining multiple pooling strategies for versatile embeddings.
amazon-neptuneAWS managed graph database for property graphs and RDF in Graph RAG.
ANN (Approximate Nearest Neighbor)Algorithms like HNSW and IVF that trade exactness for speed in vector search, enabling sub-linear retrieval.
anomaly detectionIdentifies unusual patterns suggesting failures.
answer correctnessEvaluates generated answer accuracy against ground truth.
Answer RelevancyRAGAS metric measuring how well the generated answer addresses the original question.
answer similarityCompares generated answers to references using embedding or semantic similarity.
AnswerCorrectnessLLM-based metric scoring generated answer accuracy and completeness.
Apache TikaJava library extracting text from 1000+ file formats with OCR support for multimodal RAG.
ArgoCDGitOps tool managing Kubernetes applications and RAG infrastructure changes.
Arize PhoenixML observability platform monitoring embeddings, LLM outputs, and performance drift.
asymmetric searchDifferent encodings for queries vs documents.
async processingNon-blocking operation handling.
Attention MechanismNeural component allowing tokens to selectively focus on other tokens via Q·K^T/√d → softmax → V.
audit trailsLogging retrieval/generation for compliance and transparency.
Autoregressive DecodingSequential generation conditioning each token on all previously generated ones.
Adaptive RAGRAG pattern that dynamically selects retrieval strategies based on query complexity — routing simple queries to direct retrieval and complex ones to multi-step.
Advanced RAGEnhanced RAG with query transformation, hybrid retrieval, reranking, context compression, and self-correction loops for production quality.
Agentic ChunkingUsing an LLM to decide chunk boundaries based on semantic content rather than fixed rules — highest quality but most expensive.
AnswerCorrectnessRAGAS metric combining factual correctness and semantic similarity of the generated answer against a ground-truth reference.
Asymmetric SearchRetrieval where queries and documents are encoded differently — short queries mapped to the same space as long documents.

B

TermDefinition
BatchingGrouping multiple queries for efficient parallel processing on GPU.
BEIRBenchmarking IR — zero-shot evaluation across 18 diverse retrieval datasets.
BentoMLFramework for productionizing and deploying ML models including embeddings.
BGE (BAAI General Embedding)Family of open-source embedding and reranker models.
bge-m3BAAI's multilingual embedding supporting dense, sparse, and colbert-style retrieval simultaneously.
Bi-EncoderModel that independently encodes queries and documents into separate vectors for fast retrieval.
binarizationConverts continuous to binary.
BLEUBilingual Evaluation Understudy — metric for evaluating generated text against references.
Bloom FilterProbabilistic data structure for fast membership testing with no false negatives.
blue-green deploymentParallel versions enabling instant rollback.
BM25Best Matching 25 — probabilistic sparse retrieval algorithm using TF-IDF-like scoring.
Binary QuantizationReducing embedding vectors to binary bits (0/1) for ultra-fast retrieval with ~32x memory reduction at moderate quality cost.

C

TermDefinition
CachingStoring computed results for reuse — semantic cache, exact cache, and embedding cache reduce latency and cost.
calibrationAdjusts confidence matching actual accuracy.
Canary DeploymentGradually routing traffic to a new model version while monitoring for regressions.
CARGO approachCascading grounding optimization.
Chain-of-ThoughtPrompting technique eliciting step-by-step reasoning before final answer.
ChromaLightweight open-source embedding database for AI applications.
ChunkingSplitting documents into smaller segments — strategies include fixed-size, recursive, semantic, sentence-window.
Circuit BreakerResilience pattern preventing cascading failures by short-circuiting calls to failing services.
CitationReference to a specific source passage supporting a generated claim.
Citation PrecisionFraction of inline citations that actually support their attached claim; target ≥0.80.
Citation RecallFraction of claims that have at least one valid supporting citation; target ≥0.75.
ClusteringGrouping similar items without labels — used for topic modeling and document organization.
CohereAI company providing embedding and reranking models via API.
ColBERTContextualized Late Interaction over BERT — 10-100x faster than cross-encoders.
Community DetectionAlgorithm like Leiden that identifies clusters of densely connected entities in knowledge graphs.
compliance and governancePolicies ensuring RAG meets regulatory requirements.
CompressionReducing context length before generation — extractive, abstractive, or hybrid.
confidence calibrationEnsures predicted confidence matches correctness.
Confidence taggingTags claims by credibility based on retrieval confidence.
confidence-based weightingWeights by model confidence scores.
connection poolingReuses connections reducing overhead.
Consensus answerCombines multiple answers via voting reducing individual hallucinations.
Consistency CheckingVerifying generated content agrees with source material.
content quality evaluationAssesses retrieved content quality.
Context InjectionAdding retrieved passages into the LLM prompt as grounding context.
context recallFraction of all relevant information successfully retrieved in top-K results.
Context StuffingAnti-pattern of including excessive context that confuses the model.
Context WindowMaximum tokens an LLM can process in one pass — determines how much retrieved context fits.
Contrastive LearningTraining embeddings by pulling similar pairs closer and pushing dissimilar pairs apart.
Cosine SimilaritySimilarity metric computing cos(θ) between two vectors; standard for embedding comparison.
CPU optimizationOptimizes for CPU and parallelism.
Cross-EncoderReranking model processing query-document pairs jointly via full cross-attention; more accurate but slower.
CypherNeo4j's graph query language used for structured graph retrieval in Graph RAG.
Code-Aware ChunkingChunking that respects code structure — splitting at function/class boundaries rather than mid-expression for technical documentation.
Context PrecisionRAGAS metric measuring the proportion of relevant retrieved chunks among all retrieved chunks — higher means less noise.
Context RecallRAGAS metric measuring the proportion of required information that was successfully retrieved from the knowledge base.
Contextual ChunkingAnthropic's approach prepending a short context summary to each chunk describing its position and role in the parent document.
ContextualCompressionRetrieverLangChain's wrapper combining a base retriever with a document compressor pipeline for automatic context reduction.
Corrective RAG (CRAG)RAG pattern that evaluates retrieval quality after each step and triggers alternative retrieval strategies when confidence is low.

D

TermDefinition
Data PoisoningAdversarial attack introducing corrupted data into the knowledge base to manipulate outputs.
data residencyData never leaves geographic regions or infrastructure.
DeBERTaDecoding-enhanced BERT — used as NLI model for grounding verification.
DecompositionBreaking complex queries into simpler sub-questions for independent retrieval.
DeepEvalEvaluation framework offering pre-built metrics for RAG without manual labels.
Dense EmbeddingHigh-dimensional continuous vector representing text semantics.
Dense RetrievalRetrieval using learned dense vectors where similarity = cosine/dot-product.
dependency scanningAutomated scanning for known vulnerabilities.
DiffbotWeb intelligence API providing entity extraction and knowledge graph construction from web content.
dimensionality reductionReduces features via PCA/SVD.
DisambiguationResolving ambiguity when the same term refers to different entities.
DiskANNMicrosoft's disk-based ANN algorithm enabling billion-scale vector search.
distance metricsSimilarity functions (cosine, L2, dot, Hamming).
distillation lossObjective comparing student to teacher.
distributed tracingRecords request paths across services for latency analysis.
diversity-based weightingBalances relevance and diversity.
DockerContainerization technology packaging RAG applications with dependencies.
Document LoaderComponent ingesting raw files into the pipeline — LangChain loaders, Unstructured.io, Apache Tika.
Document reorderingRearranges compressed documents putting most relevant content first.
Document ShardingPartitioning documents across nodes for horizontal scaling.
document-type routerRoutes queries to specialized pipelines by document type.
Dot ProductSum of element-wise multiplication — used as fast similarity metric for normalized vectors.

E

TermDefinition
ECoRAGEvidentiality-guided Compression for long-context RAG — 5-15x compression with 96-99% quality.
ElasticsearchDistributed search engine supporting both keyword and vector search.
element-aware parsingPreserves document structure (tables, code, lists) during parsing.
EmbeddingDense vector representation mapping text to continuous high-dimensional space.
Embedding ModelNeural network encoding text into fixed-size vectors for similarity comparison.
ensemble methodsCombines multiple models for robustness.
Entailment checkNLI-based verification confirming context entails generated claims.
Entity LinkingConnecting entity mentions to entries in a knowledge base or graph.
Entity RecognitionNER — identifying named entities and their types in text.
error budgetsAllowable errors before breaching SLAs.
euclidean distanceL2 distance between vectors.
Evaluation FrameworkSystematic approach for measuring RAG quality — RAGAS, ARES, custom suites.
Eventual ConsistencyDistributed system property where all nodes converge to consistent state over time.
Exact Match CacheCaching strategy storing results for identical query strings.
Exponential BackoffProgressively increasing wait time between retries to avoid overloading.
Extractive CompressionSelecting most relevant sentences/tokens from context without rewriting.
Embedding Drift DetectionMonitoring technique tracking how embedding model outputs change over time, triggering re-indexing or retraining when drift exceeds thresholds.

F

TermDefinition
FActScoreFact-level metric decomposing claims and scoring verifiable facts.
FAISSFacebook AI Similarity Search — library for efficient similarity search, supports CPU and GPU.
FaithfulnessCore grounding metric — fraction of generated claims supported by retrieved context; RAGAS target ≥0.85.
FalkorDBGraph database specialized for knowledge graphs and multi-hop reasoning in RAG.
Fallback strategiesAlternative approaches on low confidence.
Few-Shot LearningPerforming a task with minimal examples provided in the prompt.
FilteringSelecting subset of results based on metadata, relevance threshold, or safety criteria.
Fine-TuningAdapting a pretrained model to a specific task or domain with task-specific data.
FlashRankFast approximate reranker for initial filtering before expensive cross-encoders.
FP16 computationHalf-precision reducing memory.
FusionCombining results from multiple retrievers/rankers — typically via RRF or weighted scoring.
Fuzzy MatchingFinding approximately matching items allowing minor differences in spelling or phrasing.
FlagEmbeddingBAAI's training framework for state-of-the-art embedding and reranker models with support for retrieval-augmented fine-tuning.

G

TermDefinition
GPUGraphics Processing Unit — hardware for parallel computation powering embedding generation and LLM inference.
GrafanaVisualization platform creating dashboards from Prometheus and other metric sources.
Graph DatabaseDatabase storing data as nodes and relationships — Neo4j, Amazon Neptune, NebulaGraph.
Graph RAGRAG enhanced with knowledge graphs for multi-hop reasoning, entity disambiguation, and traceable answers — reduces hallucination 50-70%.
Graph TraversalNavigating connected nodes in a knowledge graph to find multi-hop answers.
GroundingAnchoring every LLM claim to specific evidence from retrieved documents — primary defense against hallucination.
gRPCGoogle's high-performance RPC framework for low-latency service communication.
GTE-Qwen (7B)Qwen-based general text embedding model supporting multiple languages and modalities.
GuardrailsInput/output validation rules enforcing safety, compliance, and quality — PII detection, topic filtering, toxicity checks.

H

TermDefinition
HallucinationLLM generating plausible but factually incorrect information; baseline RAG: 10-25%, with grounding: 3-10%.
hamming distanceDistance for binary strings.
hard negative miningSelects challenging negatives improving discrimination.
hard timeoutsMaximum operation duration limits.
hard veto rulesAbsolute blocking rules preventing certain responses.
harmfulnessEvaluates if generated content violates ethical, legal, or safety guidelines.
HaystackEnd-to-end RAG framework with retrieval, reranking, generation.
HelmKubernetes package manager enabling templated RAG infrastructure deployment.
HNSWHierarchical Navigable Small World — ANN algorithm building multi-layer graph for O(log N) search with high recall.
Hybrid RetrievalCombining dense/semantic and sparse/keyword retrieval via RRF fusion — production best practice.
HyDEHypothetical Document Embeddings — generates a hypothetical answer first, then embeds it as the query vector.
HNSW ef ParameterHNSW search parameter controlling beam width during query — higher ef means more accurate but slower search.
HNSW M ParameterHNSW build parameter controlling graph connectivity — higher M means better recall but more memory per node.

I

TermDefinition
IDFInverse Document Frequency — weighting factor reducing importance of common terms.
In-Context LearningModel learning from examples in the prompt without weight updates.
incident responseProcedures for detecting and resolving failures.
IndexData structure optimizing lookup — vector indexes like HNSW, IVF; keyword indexes like inverted index.
infrastructure as codeVersion-controlled infrastructure definitions.
Ingestion PipelineOffline workflow: load → parse → clean → chunk → embed → store in vector DB.
InstructorLarge embedding model pre-trained on diverse tasks with explicit instruction support for asymmetric search.
Intent RecognitionUnderstanding user's goal from their query to route to appropriate retrieval strategy.
Inverted IndexData structure mapping terms to documents containing them — backbone of keyword search.
IVFInverted File — ANN indexing that clusters vectors, searches only nearest clusters.
Instruction-Tuned EmbeddingsEmbedding models fine-tuned to follow task-specific instructions prepended to queries, improving retrieval for specific use cases.
IVF-PQCombined index using Inverted File clustering with Product Quantization — enables billion-scale vector search with reduced memory.

J

TermDefinition
JitterSmall random delay added to prevent thundering herd problems in distributed systems.

K

TermDefinition
KNNK-Nearest Neighbors — finding K closest vectors to a query in embedding space.
knowledge distillationTrains student models to mimic teachers.
Knowledge GraphStructured entity-relationship representation enabling multi-hop reasoning in Graph RAG.
knowledge transferLeverages pre-trained models for downstream tasks.
KServeKubernetes-native platform deploying embeddings and LLM models at scale.
KubernetesContainer orchestration deploying, scaling, and managing RAG services in production.
KV cacheStores key/value matrices from previous tokens.

L

TermDefinition
lambda lossLearning-to-rank loss optimizing metrics.
LangChainFramework for building LLM applications — provides document loaders, text splitters, retrievers, chains, agents.
LangFuseOpen-source LLM observability platform with tracing, metrics, and cost analysis.
LangSmithLangChain's tracing and monitoring platform for debugging LLM applications in production.
late interactionToken-level interactions at reranking stage.
LatencyTime from query submission to response delivery — measured as P50/P95/P99 percentiles.
Leiden AlgorithmCommunity detection algorithm used in Graph RAG for hierarchical clustering of entities.
listwise rankingRanks entire lists jointly.
LlamaIndexData framework for LLM apps — VectorStoreIndex, PropertyGraphIndex, LongLLMLinguaPostprocessor.
LLMGraphTransformerConstructs knowledge graphs from documents using LLM.
LLMLinguaMicrosoft's prompt compression: v1 perplexity-based 20x compression; v2 token classification 3-6x faster.
Load BalancingDistributing requests across servers — round-robin, least connections, weighted.
LongLLMLinguaRAG-optimized compression with question-aware coarse-to-fine, document reordering, dynamic ratios.
LoRALow-rank fine-tuning with minimal parameters.
Lost-in-the-MiddlePhenomenon where LLMs disproportionately attend to beginning/end of long contexts, ignoring middle.
low-rank approximationApproximates with lower rank.
Late ChunkingChunking strategy that first embeds the full document, then segments into chunks preserving cross-boundary context in the embeddings.
LongLLMLinguaPostprocessorLlamaIndex's node postprocessor integrating LLMLingua compression directly into the query pipeline.

M

TermDefinition
manhattan distanceL1 distance summing absolute differences.
MAPMean Average Precision — average of precision values at each relevant document position.
matrix factorizationDecomposes matrices into factors.
matryoshka representation learningTrains embeddings for multiple dimensionalities.
Maximal Marginal RelevanceMMR — balancing relevance and diversity in retrieved results to reduce redundancy.
Metadata FilteringPre-filtering vector search by structured fields: date, source, category, access level.
metric collectionSystematic gathering of performance metrics across systems.
MilvusOpen-source vector database for scalable similarity search with HNSW, IVF, DiskANN indexes.
MiniLMCompact Transformer family — all-MiniLM-L6-v2 is popular for fast production embedding.
MLflowML lifecycle platform for experiment tracking and model registry.
MMRMaximal Marginal Relevance — see above.
model provenanceTracking model origin, training data, and modifications.
model routing decisionsSelects models by query type or constraints.
MonitoringContinuous observation of system health: latency, throughput, quality metrics, error rates.
MRRMean Reciprocal Rank — average of 1/rank of first relevant result across queries.
MTEBMassive Text Embedding Benchmark — standard leaderboard across 8 tasks and 50+ datasets.
Multi-Hop ReasoningAnswering questions requiring traversal across multiple connected facts or documents.
Multi-Query RetrievalGenerating multiple rephrasings of a query, retrieving for each, and deduplicating results.
multi-tenancySingle system serving isolated organizations.
Multilingual E5E5 family supporting 100+ languages for cross-lingual RAG and multilingual retrieval.
mxbai-rerankMixedbread AI reranker providing efficient ranking of retrieved documents.
Markdown Header ChunkingSplitting documents at markdown header boundaries (H1, H2, H3) to create topically coherent chunks matching document structure.
Modular RAGArchitecture decomposing RAG into interchangeable modules (retrieval, reranking, compression, generation) that can be independently upgraded or swapped.
Multi-TenancySingle vector database instance serving multiple isolated organizations/users with separate data partitions and access controls.

N

TermDefinition
namespacesLogical partitions organizing data by tenant.
NDCG@KNormalized Discounted Cumulative Gain — ranking metric weighting higher positions more heavily.
NebulaGraphDistributed graph database optimized for large-scale knowledge graphs.
NERNamed Entity Recognition — identifying people, organizations, locations in text.
NLINatural Language Inference — entailment classification used for grounding verification via DeBERTa-MNLI.
Nomic (embed model)Open-source embedding model optimized for long-context sequences up to 8K tokens.
NormalizationStandardizing vectors to unit length for cosine similarity, or standardizing data formats.
Nucleus SamplingTop-P sampling — selecting from smallest token set exceeding cumulative probability P.
Naive RAGThe simplest RAG pattern: retrieve top-K chunks, concatenate into prompt, generate answer. No reranking, no query transformation, no self-correction.
NamespacesLogical partitions within a vector database organizing data by tenant, project, or use case for isolated retrieval.

O

TermDefinition
ObservabilityUnderstanding system internal state from metrics, logs, and distributed traces.
OpenTelemetryObservability framework collecting distributed traces and metrics from RAG systems.
ORTModelONNX Runtime for hardware-optimized inference.
OverlapDuplication between adjacent chunks in sliding-window chunking to preserve cross-boundary context.
OWASPOpen Web Application Security Project — LLM Top 10 threats for RAG security.

P

TermDefinition
PagedAttentionvLLM's memory management technique that pages KV cache like virtual memory for efficient batching.
pairwise rankingCompares pairs for relative relevance.
parameter-efficientFine-tuning with few parameters vs full tuning.
Parent Document RetrievalSearching on small chunks but returning the full parent document for complete context.
Passage RankingOrdering text passages by relevance to a query.
pdfplumberPython library for precise PDF text extraction and table parsing with layout awareness.
pgvectorPostgreSQL extension for vector similarity search — convenient when already using Postgres.
PIIPersonally Identifiable Information — must be detected and redacted from documents and outputs.
PineconeManaged cloud vector database with serverless and pod-based deployment.
PipelineSequence of processing stages — ingestion pipeline, query pipeline, evaluation pipeline.
Pointwise RankingScoring each document independently vs pairwise or listwise approaches.
PrecisionFraction of retrieved items that are relevant.
PreprocessingData cleaning steps before indexing: normalize unicode, remove boilerplate, extract text from formats.
PrometheusTime-series metrics database collecting system and application performance data.
Prompt EngineeringDesigning effective prompts with system instructions, few-shot examples, and constraints.
Prompt InjectionAdversarial attack embedding malicious instructions in documents or queries — top OWASP threat.
Prompt TuningLearning task-specific soft tokens prepended to input.
PromptCompressorLangChain wrapper applying compression to retrieved context.
PruningRemoving unnecessary model weights for compression and speed.
Parent-Child RetrievalIndexing small child chunks for precise matching but returning the larger parent document for complete generation context.
Product Quantization (PQ)Vector compression technique factorizing high-dimensional space into independent low-dimensional subspaces, each quantized separately.

Q

TermDefinition
QdrantVector database with advanced filtering, payload indexing, and hybrid search.
QLoRAQuantized LoRA combining compression and efficiency.
quality regression detectionDetects accuracy/relevance drops in production.
QuantizationReducing model precision to decrease memory and increase speed — GPTQ, AWQ, GGUF.
Query DecompositionBreaking complex queries into simpler sub-questions for independent retrieval and synthesis.
Query ExpansionEnriching queries with synonyms, related terms, or LLM-generated reformulations.
Query RewritingTransforming queries for better retrieval — conversational-to-standalone, typo correction, clarification.
Query routingClassifies queries and directs them to specialized retrieval strategies by domain.
Query RoutingDirecting queries to different retrieval backends based on intent classification — e.g., keyword search for codes/IDs, semantic for concepts.

R

TermDefinition
RAGRetrieval-Augmented Generation — architecture combining document retrieval with LLM generation for grounded answers.
RAGASRAG Assessment — evaluation framework scoring faithfulness, answer relevancy, context precision/recall.
ranking algorithmsOrders results (BM25, neural, learning-to-rank).
Rate LimitingControlling request frequency to prevent system overload.
Ray ServeDistributed serving framework scaling RAG models across multiple nodes.
Recall@KFraction of relevant documents in the top-K results; target ≥0.90.
Reciprocal Rank FusionRRF — combining ranked lists from multiple retrievers: score = Σ 1/(k + rank_i); standard for hybrid search.
RECOMPTrained compression: extractive variant selects sentences; abstractive variant generates summaries; 5-20x compression.
recursive character splittingRecursively uses delimiters preserving semantic units.
red teamingAdversarial testing discovering vulnerabilities (injection, jailbreaks).
Red-TeamingAdversarial testing to discover vulnerabilities — prompt injection, jailbreaks, data extraction.
redundancy reductionDeduplicates results avoiding repetition.
regression detectionAutomated alerting when metrics fall below baselines.
regulatory requirementsLegal constraints (GDPR, HIPAA, SOC2) affecting design.
RelevanceDegree to which a retrieved document addresses the user's information need.
RerankerModel rescoring retrieved documents for better ranking — cross-encoders like BGE-reranker, mxbai-rerank.
RerankingRe-ordering initially retrieved results using a more accurate but slower model.
retraining triggersMetrics initiating model retraining.
Retry LogicAutomatically re-attempting failed operations with backoff.
Retry loopsRepeats failed operations with strategy variations.
ROUGERecall-Oriented Understudy — metric for evaluating summarization quality.
RRFReciprocal Rank Fusion — see above.
RAG-FusionQuery transformation technique generating multiple query variants, retrieving for each, and fusing results via Reciprocal Rank Fusion for improved recall.

S

TermDefinition
SamplingSelecting tokens during generation — temperature, top-k, top-p/nucleus control diversity.
score gap analysisAnalyzes difference between top-1 and top-K guiding reranking necessity.
secrets managementSecure storage and rotation of credentials and tokens.
Self-ConsistencyGrounding technique: generate N responses, keep claims appearing in ≥60% — reduces hallucination 40-55%.
Self-Correction LoopRAG pattern evaluating output quality and retrying retrieval/generation if below threshold.
Semantic CacheCaching results for semantically similar queries using embedding similarity threshold.
Semantic ChunkingSplitting at natural topic boundaries using embedding similarity between adjacent sentences.
Semantic SearchRetrieval based on meaning rather than keyword matching, using dense embeddings.
Sentence TransformersLibrary for sentence embeddings and semantic search.
Sentence Window RetrievalIndexing individual sentences but returning surrounding window ±N sentences for context.
SetFitFew-shot learning framework enabling supervised embedding fine-tuning with minimal labeled data.
ShardingPartitioning data across multiple nodes for horizontal scaling.
similarity metricsFunctions measuring vector similarity.
SLAService Level Agreement — contractual performance guarantees for latency, uptime, accuracy.
Sliding WindowChunking strategy using fixed-size window with overlap stepping across the document.
soft targetsProbabilistic targets from teacher vs hard labels.
SoftmaxFunction converting logits to probability distribution summing to 1.
SpaCyIndustrial NLP library for entity recognition, dependency parsing, and document preprocessing.
Sparse RetrievalKeyword-based retrieval using BM25/TF-IDF term matching — excels at exact terms, acronyms, proper nouns.
speculative decodingDrafts tokens in parallel reducing latency.
SPLADESparse Lexical and Expansion model — learned sparse retrieval combining term matching with expansion.
Step-Back PromptingGenerating a more abstract query version before retrieval for broader context.
student modelSmaller model trained to mimic teacher.
Sub-Question DecompositionBreaking multi-part queries into simpler questions for independent retrieval.
supply chain securityEvaluating security of dependencies and models.
symmetric searchIdentical encoding for both queries and documents.
Scalar QuantizationReducing embedding precision from FP32 to INT8 or lower, achieving 4x memory reduction with minimal quality loss.
Self-RAGRAG pattern where the LLM decides when to retrieve, what to retrieve, and self-evaluates whether retrieved passages are relevant before generating.
Sentence TransformersPython library for computing dense vector representations of sentences using pre-trained transformer models. Powers most embedding pipelines.
Symmetric SearchRetrieval where both items are encoded identically — used for similar document finding, deduplication, and clustering.

T

TermDefinition
T2 escalationSupport tickets indicating embedding drift.
T3 rejectionSystem rejections of low-confidence responses.
teacher modelLarge model training smaller student models.
TemperatureParameter scaling logits: higher = more random/diverse, lower = more deterministic/focused.
TensorRT-LLMNVIDIA's inference optimization engine with optimized GPU kernels for LLM serving.
TerraformInfrastructure-as-code for provisioning cloud resources and RAG systems.
TF-IDFTerm Frequency-Inverse Document Frequency — classical term weighting scheme for keyword retrieval.
Threat ModelSystematic analysis of security risks — OWASP LLM Top 10 covers injection, data leakage, excessive agency.
ThroughputRequests processed per unit time — tokens/sec for LLMs, queries/sec for retrieval.
tier-based retrievalRoutes queries to different strategies by complexity and confidence.
TokenFundamental text unit in LLMs — subword pieces produced by tokenizers; ~0.75 English words per token.
TokenizationConverting text into tokens via BPE, SentencePiece, or WordPiece algorithms.
Top-KReturning K most similar results from vector search; also sampling strategy limiting to K highest-probability tokens.
toxicity detectionIdentifies harmful or abusive content for filtering.
Triton Inference ServerNVIDIA's production model serving with dynamic batching, model ensembles, multi-GPU.
TruLensFeedback framework for evaluating and improving RAG systems with LLM-based metrics.
TruthfulQABenchmark evaluating truthfulness on challenging factual questions.
TTLTime To Live — cache expiration duration after which entries are refreshed.
text-embedding-3-largeOpenAI's highest-quality text embedding model (3072 dimensions) for dense retrieval in RAG systems.
text-embedding-3-smallOpenAI's compact embedding model (1536 dimensions) balancing quality and speed for cost-efficient production RAG.
TruthfulQABenchmark evaluating LLM tendency to generate truthful answers vs common misconceptions — important for RAG quality assessment.

U

TermDefinition
uncertainty estimationQuantifies model confidence and ambiguity.
Unstructured.ioPlatform processing diverse file types with element-aware parsing and metadata extraction.
uptime requirementsAvailability targets (e.g., 99.99%) for services.

V

TermDefinition
VectorOrdered array of numbers representing a point in high-dimensional space.
Vector DatabaseSpecialized database for storing, indexing, and searching embeddings — Pinecone, Weaviate, Milvus, Qdrant, Chroma, pgvector.
Vector SearchFinding nearest neighbors in embedding space using ANN algorithms.
VectorizationConverting text to numerical vectors via embedding models.
vendor security auditSecurity evaluation before integrating external services.
version trackingMaintains model versions and performance.
vLLMHigh-throughput inference engine using PagedAttention and continuous batching — 10-50x faster than naive HuggingFace.
Voyage AICommercial embedding API providing Voyage-large and Voyage-code models optimized for enterprise retrieval tasks.

W

TermDefinition
Warm-UpInitial cache/index loading phase before system reaches peak performance.
WeaviateOpen-source vector database with built-in vectorization, hybrid search, and GraphQL API.
weighted aggregationCombines by importance weights.
Weights & BiasesExperiment tracking platform for RAG training and evaluation runs.
WhyLabsModel monitoring platform for tracking embedding quality and anomaly detection.

Z

TermDefinition
Zero-Shot LearningPerforming tasks without task-specific training examples — relying on model's general knowledge.
Cross-Reference: For a unified glossary covering ALL LLM topics beyond RAG, see the unified LLM Glossary with 140+ terms across all documents.
64 / Glossary