Advanced Production-Grade
RAG Pipeline Implementation
Building enterprise-ready retrieval-augmented generation systems with semantic search, adaptive policies, and self-correction loops
Embedding Models Self-Correction Loops MLOps
22 comprehensive sections covering architecture, implementation, and production deployment
What is RAG?
RAG is not a single model but a complete AI application architecture that combines retrieval systems with language models to ground responses in external knowledge.
Three Pillars
- ✓ Multi-stage Retrieval — Hybrid, sparse, dense, and ranked retrieval
- ✓ Adaptive Policies — Context-aware retrieval strategies
- ✓ Self-Correction Loops — Reflection and iterative refinement
Enterprise Risks
Beyond hallucinations, consider:
- ⚠ Permission leakage
- ⚠ Prompt injection attacks
- ⚠ Data poisoning
- ⚠ Unbounded cost/latency
- ⚠ Silent quality regressions
Use Cases
- 📄 Employee Knowledge Work — Internal docs, wikis
- 🤝 Customer Support — FAQs, tickets, logs
- 📊 Structured+Unstructured — Reports, databases, forms
- 🎨 Multimodal Knowledge — PDFs, images, videos
Indexing Plane
Off-line: Data ingestion, parsing, chunking, embedding, and vector storage. Built once, queried many times.
Serving Plane
On-line: Query processing, retrieval, reranking, LLM inference, and safety checks. Low-latency, high-throughput.
Full Architecture: Two Planes + Governance
Complete end-to-end RAG architecture with indexing, serving, and governance layers
IndexingPipeline
QueryPipeline
Document Ingestion: Connectors + Contracts
Reliable data onboarding with standardized contracts, multi-format support, and three-speed indexing
Canonical Document Contract
Every document must conform to this schema for consistent retrieval and governance:
Connector Types & Data Sources
Enterprise Document Parsing
Apache Tika, Unstructured.io — Parse PDFs, DOCX, images with layout preservation and OCR support
Structured Data
CRM, ERP, Databases — Direct queries or treat as "tool use" for on-demand retrieval; knowledge views
Streaming & CDC
Debezium → Kafka — Real-time event streams from databases; capture inserts, updates, deletes
Web Content
Crawlers with Compliance — Respect robots.txt, rate limits, GDPR; extract HTML/JSON with link tracking
Multimodal Sources
OCR + Image Embeddings — Extract text from images, create vision embeddings; preserve layout
Custom Connectors
Plugin API — Implement standardized interface for proprietary systems, internal APIs, legacy apps
Three-Speed Indexing Model
Batch Rebuilds
Full reindexing of large datasets weekly/monthly; highest throughput, controlled resources. Use for bulk imports, historical data.
Incremental Upserts
Append new chunks, update modified docs via change detection; moderate latency (seconds). Triggered by scheduled jobs or webhooks.
Real-Time Streams
Event-driven CDC or message queue ingestion; sub-second latency for hot data. Use for live chat logs, sensor feeds, user events.
Element-Aware PDF Parsing
Extract with positional metadata (page, bbox, reading order). Preserve table structure, preserve images. Enables citation anchoring.
Dead Letter Queue Pattern
Send unparseable docs to DLQ for manual inspection; enable retry with fallback parsers or human review. Never silently drop data.
ProductionIngester Example
Chunking Strategies
Chunk for retrieval (findability) and store separate representations for generation (readability)
| Strategy | Description | Best For | Trade-offs |
|---|---|---|---|
| Fixed-Size | Split at token/word boundary | Predictable, simple baseline | May split sentences; low semantic coherence |
| Recursive | Split recursively by delimiters (newline, paragraph, sentence) | Structured documents, code | Still may cut semantically important boundaries |
| Semantic | Embed sentences, split at embedding distance threshold | Narrative text, research papers | Expensive; latency + cost for embedding all chunks |
| Document-Structure | Respect sections, headings, tables, code blocks | Mixed-format documents (PDFs, Markdown) | Requires parser awareness |
| Agentic/LLM | Use LLM to decide breaks and chunk metadata | Complex domain logic, multilingual | High cost and latency; not real-time |
| Sliding Window | Overlapping fixed-size chunks with stride | Preserve local context, boundary queries | Higher storage; redundant retrieval |
| Parent-Child (Sentence-Window) | Store fine-grained chunks; expand with surrounding context at retrieval | Precision + context balance | Requires two-stage retrieval; complex indexing |
SemanticChunker: Embedding Similarity Breakpoints
- • Chunk size: 256–512 tokens (optimal for retrieval + generation trade-off)
- • Overlap: 10–15% to preserve boundary context
- • Metadata inheritance: Propagate doc_id, section, source to every chunk
- • Context enrichment: Prepend section headers or document title to chunk
- • Element-aware parsing: Preserve tables, code blocks, images as intact units in PDFs
Recommended: Hybrid Multi-Layer Chunking for Production
No single chunking strategy works for all document types. Production systems use a document-type router that selects the best chunking strategy per document, combined with a parent-child indexing pattern that stores small chunks for precise retrieval but returns larger context windows for generation.
Production Chunking Pipeline
class ProductionChunkingPipeline:
"""Route-aware, parent-child, metadata-enriched."""
def __init__(self):
self.router = DocTypeRouter()
self.parsers = {
"pdf": UnstructuredParser(strategy="hi_res"),
"markdown": MarkdownHeaderSplitter(),
"html": HTMLSectionSplitter(),
"code": ASTChunker(), # tree-sitter
"plaintext": SemanticChunker(),
}
self.semantic = SemanticChunker(
model="all-MiniLM-L6-v2",
max_tokens=384, # child chunk size
threshold=0.5,
)
def process(self, doc: Document) -> list[Chunk]:
# Step 1: Route to parser
doc_type = self.router.classify(doc)
elements = self.parsers[doc_type].parse(doc)
# Step 2: Create parent sections
parents = self.group_into_sections(elements)
# Step 3: Split parents into child chunks
all_chunks = []
for parent in parents:
children = self.semantic.split(parent.text)
for i, child_text in enumerate(children):
chunk = Chunk(
text=child_text,
parent_id=parent.id,
parent_text=parent.text, # stored separately
position=i,
metadata=self.enrich(doc, parent, child_text),
)
all_chunks.append(chunk)
return all_chunks
def enrich(self, doc, parent, text):
"""Prepend context + propagate metadata."""
return {
"doc_id": doc.id,
"source": doc.source,
"section": parent.heading,
"page": parent.page_num,
"doc_type": doc.content_type,
"indexed_at": datetime.utcnow(),
# Prepended for better retrieval:
"enriched_text": (
f"{doc.title} > {parent.heading}\n"
f"{text}"
),
}Parent-Child Retrieval at Query Time
class ParentChildRetriever:
def search(self, query, top_k=5):
# 1. Search CHILD chunks (precise)
children = self.vector_db.search(
query, top_k=top_k * 3 # over-fetch
)
# 2. Expand to PARENT sections
parent_ids = set(c.parent_id for c in children)
parents = self.doc_store.get_parents(parent_ids)
# 3. Deduplicate + rank parents by
# best child match score
scored = {}
for child in children:
pid = child.parent_id
if pid not in scored or child.score > scored[pid]:
scored[pid] = child.score
ranked = sorted(
parents, key=lambda p: scored[p.id],
reverse=True
)[:top_k]
return ranked # full parent contextChunk Size Guide by Document Type
| Doc Type | Child (Search) | Parent (Context) | Strategy |
|---|---|---|---|
| Product docs | 128–256 tok | 512–1024 tok | Heading-based + semantic |
| Legal / Policy | 256–384 tok | 1024–2048 tok | Section-based, keep clauses intact |
| Research papers | 256–512 tok | 1024–2048 tok | Semantic breakpoints |
| FAQ / KB | Whole Q&A pair | Same (no parent) | Question-Answer as unit |
| Code | Function/class | File or module | AST-aware (tree-sitter) |
| Chat logs | Single turn | Full conversation | Turn-based splitting |
| Tables / CSV | Row group | Full table + header | Keep header with every chunk |
Why Parent-Child Wins
Problem: Small chunks retrieve precisely but lose context. Large chunks give context but pollute retrieval with irrelevant text.
Solution: Index small (128–256 tok) for search precision. At retrieval time, expand to the parent section (512–1024 tok) for coherent LLM context. Best of both worlds.
Context Enrichment (Prepending)
Prepend the document title and section heading to each chunk before embedding. This dramatically improves retrieval for ambiguous queries.
# Without enrichment:
"Returns are accepted within 30 days."
# With enrichment:
"Product Policy > Returns & Refunds\n"
"Returns are accepted within 30 days."
# Now retrieves for "return policy" queriesTools for Production
Parsing: unstructured.io (hi_res), LlamaParse, Docling
Splitting: LangChain RecursiveCharacterTextSplitter, LlamaIndex SentenceWindowNodeParser
Semantic: Sentence Transformers + custom breakpoint
Code: tree-sitter (AST), CodeSplitter
Parent-Child: LlamaIndex ParentDocumentRetriever, custom doc_store + vector_db combo
Embedding Models & Strategies
| Model Family | Type | Notable Capabilities | Operational Considerations | Cost/Latency |
|---|---|---|---|---|
| OpenAI text-embedding-3 | API | small/large variants; dimension shortening; multilingual | Quota limits; regional latency | $0.02/M tokens (small); higher for large |
| Cohere Embed v3/v4 | API | Multilingual + multimodal (text+image); fine-tuning available | Document and query encoding modes | $0.10/1M tokens |
| BGE-M3 | Open-source (HuggingFace) | Multi-lingual multi-function (dense+sparse+multi-vector) | 8192 token context; self-hosted overhead | Free; requires GPU infrastructure |
| Multilingual E5 | Open-source | Strong multilingual; published training/eval methodology | Community-maintained; good reproducibility | Free; 2–5ms per chunk on A100 |
| GTE-Qwen2 (7B) | Open-source | State-of-the-art; 131K context window | Larger model; requires more VRAM | Free; ~20ms/chunk on A100 |
| voyage-3-large | API | Long-context (128K) + code understanding | Premium pricing; excellent for code RAG | $0.15/1M tokens |
| nomic-embed-text-v1.5 | Open-source | Matryoshka embeddings; dimension flexibility | Efficient storage; truncation-stable | Free; 3–4ms latency on CPU |
EmbeddingService: Caching, Rate-Limiting, Batch Processing
Matryoshka Embeddings
Truncate high-dimensional embeddings to lower dimensions without retraining. Trade-off: smaller vectors (10–20% storage savings) vs. slight accuracy loss.
Fine-Tuning with Contrastive Learning
Train embeddings on domain-specific relevance pairs using triplet loss. Improves domain-specific retrieval by 15–30% with 5K–50K labeled pairs.
Instruction-Tuned Embeddings
Prepend task instructions ("Retrieve document for query: ") to asymmetrically encode queries vs. documents. Boosts retrieval by leveraging prompt tuning.
How Semantic Search Works
Semantic search retrieves content by meaning rather than keyword overlap. Both the query and every document chunk are encoded into high-dimensional vectors using the same embedding model. Similar meanings map to nearby points in that vector space, so ranking by vector distance surfaces conceptually relevant chunks — even when they share no words with the query.
1. Encode
The embedding model transforms text into a fixed-length dense vector (typically 384–3072 dims). Each dimension captures a latent semantic feature learned during pre-training on billions of text pairs.
2. Index
Document vectors are stored in an ANN index (HNSW, IVF-PQ, ScaNN). The index trades a small amount of recall for sub-linear search across millions to billions of vectors.
3. Score & Rank
At query time, the query vector is compared against candidates using cosine similarity, dot product, or Euclidean distance. Top-K nearest neighbors are returned as the retrieval set.
Similarity Metrics at a Glance
| Metric | Formula (intuition) | When to Use | Notes |
|---|---|---|---|
| Cosine Similarity | angle between vectors; magnitude-invariant | Default for most text embeddings (OpenAI, BGE, E5) | Robust to varying text length; values in [-1, 1] |
| Dot Product | sum of element-wise products | Models trained with normalized vectors; fastest on GPU | Equivalent to cosine when vectors are L2-normalized |
| Euclidean (L2) | straight-line distance in vector space | Image embeddings; some classical IR models | Sensitive to magnitude; rarely optimal for text |
Minimal Semantic Search Loop
Different Embedding Model Families
Not all embeddings are created equal. Model choice depends on modality, context window, language coverage, latency budget, and deployment constraints. The landscape breaks down into a handful of architectural families.
Dense Bi-Encoders
Encode query and document independently into a single dense vector. Fast retrieval via ANN. Examples: text-embedding-3, BGE-large, E5, GTE, nomic-embed.
Sparse / Learned Sparse
Produce high-dimensional sparse vectors over vocabulary terms with learned term weights. Combines keyword precision with neural context. Examples: SPLADE++, BGE-M3 sparse, uniCOIL.
Multi-Vector (ColBERT-style)
Emit one vector per token and score with MaxSim late-interaction. Higher recall on fine-grained queries at the cost of storage. Examples: ColBERTv2, Jina-ColBERT, BGE-M3 multi-vec.
Cross-Encoders (Rerankers)
Jointly encode (query, document) pairs and output a relevance score. Too slow for first-stage retrieval but ideal for reranking top-100 candidates. Examples: bge-reranker-v2, Cohere Rerank 3, Jina Reranker.
Multilingual Models
Trained on 100+ languages so queries in one language retrieve documents in another. Examples: multilingual-e5-large, BGE-M3, Cohere embed-multilingual-v3, LaBSE.
Multimodal & Code
Share a vector space across text, images, audio, or source code for cross-modal retrieval. Examples: CLIP, SigLIP, Cohere Embed v4, voyage-code-3, jina-embeddings-v3.
Choosing an Embedding Model — Decision Checklist
| Requirement | Recommended Family | Concrete Options |
|---|---|---|
| Fastest time-to-value, managed | API dense bi-encoder | OpenAI text-embedding-3-small, Cohere embed-v4, voyage-3 |
| On-prem / data residency | Open-source dense | BGE-large-en, E5-large-v2, GTE-Qwen2, nomic-embed-v1.5 |
| Multilingual corpus (50+ languages) | Multilingual dense / hybrid | BGE-M3, multilingual-e5, Cohere embed-multilingual-v3 |
| Keyword-heavy (legal, medical codes) | Sparse + dense hybrid | SPLADE++ + BGE, BGE-M3 (dense+sparse+multi-vec) |
| Highest accuracy, storage available | Multi-vector + reranker | ColBERTv2 / BGE-M3 + bge-reranker-v2 |
| Source code retrieval | Code-tuned dense | voyage-code-3, jina-embeddings-v3-code, CodeSage |
| Images + text together | Multimodal bi-encoder | CLIP, SigLIP, Cohere Embed v4, Nomic Embed Vision |
| Very long context (>32K tokens) | Long-context dense | voyage-3-large (128K), GTE-Qwen2 (131K), jina-v3 (8K+) |
Vector Database Selection
FAISS is a similarity search library, not a networked vector database. Production RAG requires distributed, replicable systems.
| Database | Architecture | Key Features | Scaling Model | Ops Burden |
|---|---|---|---|---|
| FAISS | In-memory library | Highest performance; no persistence | Single-node only | High (build/rebuild cycles) |
| Milvus | Distributed (K8s native) | Multi-replica, auto-sharding, metadata filtering | Horizontal (scale nodes) | High (K8s expertise required) |
| Pinecone | Managed SaaS | Serverless, metadata filtering, pod-type scaling | Serverless (auto) | Low (fully managed) |
| Weaviate | Hybrid (vector+BM25) | Combined dense/sparse search, replication controls | Cluster-based | Medium |
| Chroma | Lightweight SQLite/in-memory | Simple API; good for prototypes | Single-node | Low (dev only) |
| Elasticsearch | Existing infra (if already deployed) | Dense vectors + BM25 + analytics | Cluster-based | Medium |
| pgvector | PostgreSQL extension | SQL + vectors; ACID transactions | Postgres replication | Medium |
Qdrant/Milvus Production Config: HNSW + Quantization + Replication
Key Design Decisions
- HNSW vs IVF: HNSW faster recall, IVF better for billion-scale; prefer HNSW for sub-100M datasets
- Quantization: 8-bit scalar quantization saves 4x memory with <2% recall loss; essential for cost control
- Namespaces/Partitions: Isolate indices by tenant, project, or time period for multi-tenancy and retention
- Replication: RF=3 minimum for production SLA; prevents single-point failures
- TTL & Garbage Collection: Auto-expire old chunks; configure cleanup policies for cost
- Backup & Point-in-Time Recovery: Daily snapshots; test restore procedures quarterly
Query Transformation — From One Query to Many
Users ask vague, ambiguous, or narrowly-worded questions. A single embedding of that raw query often misses relevant chunks. Query transformation rewrites, decomposes, and expands the user's query into multiple targeted search queries — dramatically improving chunk filtering and retrieval quality.
Six Query Transformation Strategies
1. Query Rewriting
LLM rewrites the query to be clearer and more search-friendly. Fixes typos, expands abbreviations, makes implicit context explicit.
# Input: "how 2 fix auth"
# Output: "How to troubleshoot and fix
# authentication errors"
prompt = f"""Rewrite this query to be
clearer for a search engine.
Fix typos, expand abbreviations.
Query: {query}"""When: Always. First step in every pipeline. Cheap and fast (~50ms with Haiku).
2. Multi-Query Expansion
Generate 3–5 diverse reformulations targeting different vocabulary, specificity levels, and perspectives.
# Input: "fix auth errors"
# Output:
# - "authentication failure troubleshoot"
# - "401 403 OAuth token expired"
# - "login session invalid API key"
# - "how to debug access denied"When: Ambiguous or broad queries. Biggest recall improvement (15–30%). See deep-dive in Retrieval section.
3. Step-Back Prompting
Generate a higher-level abstract query to retrieve foundational context, then the specific query for details.
# Input: "why does JWT expire in 15min"
# Step-back: "JWT token lifecycle and
# security best practices"
# Then search BOTH queries:
# → foundational + specific chunksWhen: "Why" questions, conceptual queries. Provides background context the LLM needs to reason.
4. HyDE (Hypothetical Document)
Ask the LLM to generate a hypothetical answer, embed THAT, and search for similar real documents. Bridges the query-document embedding gap.
# Input: "fix auth errors"
# LLM generates hypothetical doc:
hypo = "To fix authentication errors,
first check if your OAuth token
has expired. Refresh using the
/auth/refresh endpoint..."
# Embed hypo → search → find real docs
# that are SIMILAR to this answerWhen: Technical queries where query language differs from document language. Adds ~300ms latency.
5. Query Decomposition
Break multi-part or complex questions into atomic sub-queries, retrieve for each independently, then merge.
# Input: "compare pricing of Plan A
# vs Plan B and which has
# better support"
# Decompose into:
# Q1: "Plan A pricing details"
# Q2: "Plan B pricing details"
# Q3: "Plan A support features"
# Q4: "Plan B support features"When: Compound questions, comparisons, multi-entity queries. Critical for completeness.
6. Metadata Filter Extraction
Extract structured filters (date, category, product, region) from the query to narrow the search pool BEFORE vector search.
# Input: "2024 return policy for EU"
# Extract:
# - filter: year=2024
# - filter: region=EU
# - query: "return policy"
# → pre-filter chunks THEN embed searchWhen: Queries with temporal, geographic, or categorical constraints. Dramatically reduces search pool.
Recommended Production Strategy: Adaptive Query Transform
Don't apply all strategies to every query — that's wasteful and slow. Instead, classify the query complexity and apply the minimum transformation needed. Simple factual queries need only rewriting; complex multi-part queries need decomposition + expansion.
AdaptiveQueryTransformer — Production Implementation
class AdaptiveQueryTransformer:
"""Classify query → apply minimum transform.
Simple queries: just rewrite (50ms).
Complex queries: full pipeline (200-400ms)."""
def __init__(self, llm, fast_llm):
self.llm = llm # strong model
self.fast = fast_llm # Haiku / mini
self.classifier = QueryClassifier()
self.cache = TransformCache(ttl=3600)
async def transform(self, query: str) -> TransformResult:
# Check cache first
cached = self.cache.get(query)
if cached:
return cached
# Step 1: Classify query complexity
qtype = self.classifier.classify(query)
# Step 2: Route to appropriate strategy
if qtype == "simple_factual":
# "What's the return policy?" → just rewrite
queries = [await self.rewrite(query)]
filters = self.extract_filters(query)
elif qtype == "ambiguous":
# "fix auth" → rewrite + expand
rewritten = await self.rewrite(query)
expanded = await self.expand(query, n=3)
queries = [rewritten] + expanded
filters = self.extract_filters(query)
elif qtype == "compound":
# "compare A vs B pricing + support"
sub_queries = await self.decompose(query)
queries = sub_queries
filters = self.extract_filters(query)
elif qtype == "conceptual":
# "why does X happen?" → step-back + specific
abstract = await self.step_back(query)
queries = [query, abstract]
filters = {}
elif qtype == "technical":
# Technical jargon → HyDE + expand
hyde_doc = await self.generate_hyde(query)
expanded = await self.expand(query, n=2)
queries = [query] + expanded
hyde_queries = [hyde_doc] # separate embed
filters = self.extract_filters(query)
else: # fallback: rewrite + 2 expansions
queries = [query] + await self.expand(query, 2)
filters = {}
result = TransformResult(
original=query,
queries=queries,
filters=filters,
strategy=qtype,
)
self.cache.set(query, result)
return resultQuery Classification — Route to Strategy
| Query Type | Example | Strategy | Latency |
|---|---|---|---|
| Simple factual | "What's the return policy?" | Rewrite only | ~50ms |
| Ambiguous | "fix auth errors" | Rewrite + Expand(3) | ~200ms |
| Compound | "compare A vs B pricing + support" | Decompose into sub-Qs | ~250ms |
| Conceptual | "why does JWT expire?" | Step-back + specific | ~150ms |
| Technical | "CORS preflight 403 nginx" | HyDE + Expand(2) | ~400ms |
| Lookup | "order #12345 status" | Extract ID → direct DB | ~5ms |
Query Classifier Implementation
class QueryClassifier:
"""Fast classifier: embedding + rules.
~5ms. No LLM call needed."""
def classify(self, query: str) -> str:
# Rule-based fast path
if re.match(r"(order|tracking|#)\s*\d+", query):
return "lookup"
if "vs" in query or "compare" in query:
return "compound"
if query.startswith(("why", "how does", "explain")):
return "conceptual"
if len(query.split()) <= 6:
return "ambiguous"
# Embedding-based classifier for rest
emb = self.encoder.encode(query)
pred = self.classifier_model.predict(emb)
return pred # SetFit / fine-tunedMetadata Filter Extraction — Pre-Filter Before Vector Search
Extract structured constraints from the query to narrow the chunk pool BEFORE embedding search. This dramatically improves precision for queries with temporal, categorical, or entity-specific constraints.
class FilterExtractor:
"""Extract structured filters from query.
Runs in parallel with query expansion."""
def extract(self, query: str) -> dict:
filters = {}
# Temporal: "2024", "last month", "recent"
date = self.parse_date(query)
if date:
filters["date_after"] = date
# Category: "pricing", "support", "API"
category = self.classify_topic(query)
if category:
filters["doc_type"] = category
# Entity: product names, plan names
entities = self.ner.extract(query)
if entities:
filters["entities"] = entities
# Region: "EU", "US", "APAC"
region = self.detect_region(query)
if region:
filters["region"] = region
return filters
# Applied to vector search:
# db.search(query_emb, filters=filters)
# → searches ONLY chunks matching filtersWhy this matters:
Without filters, "2024 EU return policy" searches ALL chunks and relies on the embedding to distinguish 2024 EU docs from 2023 US docs. Embeddings are bad at temporal and geographic precision. Pre-filtering narrows the pool from 10M chunks to maybe 50K — making vector search both faster and more accurate.
| Filter Type | Example | Extraction Method |
|---|---|---|
| Temporal | "2024", "this week", "latest" | Regex + dateparser |
| Category | "pricing", "API docs", "FAQ" | Topic classifier |
| Entity | Product names, plan names | NER (spaCy / custom) |
| Region | "EU", "US", "Germany" | Regex + geo lookup |
| Language | Query language detection | langdetect / fasttext |
| Access level | User's role / permissions | Session context (ACL) |
Raw single query: Recall@5 = 62% | + Rewrite: 68% (+6%) | + Multi-Query Expand: 82% (+20%) | + Metadata Filters: 87% (+5%) | + Cross-Encoder Rerank: 94% (+7%) | Total lift: +32 percentage points
Latency Strategy — Generating 5 Queries in <50ms
The naive approach — call an LLM to generate 5 queries — takes 200–400ms. That's unacceptable for real-time voice agents or low-latency search. Here are four production strategies to get multi-query expansion down to <50ms.
Strategy 1: Template-Based Expansion (5ms)
No LLM call at all. Use rule-based templates that generate query variants from the original query using synonym dictionaries, regex patterns, and structural transformations.
class TemplateExpander:
"""Zero-LLM query expansion. ~5ms.
Generates 5 variants using rules."""
def __init__(self):
self.synonyms = SynonymDict.load("domain_synonyms.json")
self.stopwords = set(["the", "a", "is", "how", "do", "I"])
def expand(self, query: str) -> list[str]:
tokens = query.lower().split()
keywords = [t for t in tokens if t not in self.stopwords]
variants = [query] # always include original
# V1: Synonym swap (most impactful)
for kw in keywords:
if kw in self.synonyms:
syn = self.synonyms[kw][0]
variants.append(query.replace(kw, syn))
break # one swap per variant
# V2: Keyword-only (drop question words)
variants.append(" ".join(keywords))
# V3: Reversed keyword order
variants.append(" ".join(reversed(keywords)))
# V4: Add domain context prefix
variants.append(f"documentation: {query}")
return variants[:5]
# Example:
# Input: "how do I fix auth errors"
# Output: [
# "how do I fix auth errors", # original
# "how do I fix authentication errors", # synonym
# "fix auth errors", # keywords-only
# "errors auth fix", # reversed
# "documentation: how do I fix auth errors" # prefixed
# ]Pros: Zero latency, zero cost, deterministic. Cons: Limited diversity, no semantic understanding. Best for: First-pass expansion while LLM results are pending.
Strategy 2: Fine-Tuned Small Model (10–30ms)
Distill a large LLM's query expansion capability into a small local model (T5-small, FLAN-T5-base, or a 60M-param custom model). Runs on CPU in 10–30ms.
from transformers import T5ForConditionalGeneration
class LocalQueryExpander:
"""Fine-tuned T5-small for query expansion.
~15ms on CPU. No API call."""
def __init__(self):
self.model = T5ForConditionalGeneration.from_pretrained(
"./models/query-expander-t5-small"
)
self.tokenizer = AutoTokenizer.from_pretrained(
"./models/query-expander-t5-small"
)
def expand(self, query: str, n=5) -> list[str]:
prompt = f"expand query: {query}"
inputs = self.tokenizer(prompt, return_tensors="pt")
outputs = self.model.generate(
**inputs,
num_return_sequences=n,
num_beams=n,
max_new_tokens=64,
do_sample=False,
)
return [
self.tokenizer.decode(o, skip_special_tokens=True)
for o in outputs
]
# Training data: 50K (query, expansion) pairs
# generated by GPT-4/Claude from prod logs.
# Fine-tune T5-small for 3 epochs. ~2hrs on 1 GPU.Pros: Fast, free at inference, semantic-aware. Cons: Requires training, model maintenance. Best for: High-QPS production systems.
Strategy 3: Pre-Computed Cache (0ms hit / 300ms miss)
Cache LLM-generated expansions by normalized query. First request is slow; all subsequent identical or near-identical queries are instant. Use semantic similarity for fuzzy cache matching.
class SemanticExpansionCache:
"""Cache LLM expansions. 0ms on hit.
Semantic fuzzy matching for near-dupes."""
def __init__(self, redis, encoder, llm):
self.redis = redis # exact cache
self.encoder = encoder # for fuzzy match
self.index = FAISSIndex() # query embedding index
self.llm = llm # fallback generator
async def get_expansions(self, query: str) -> list[str]:
# L1: Exact match (Redis, ~0.1ms)
key = hashlib.md5(query.lower().encode()).hexdigest()
cached = self.redis.get(key)
if cached:
return json.loads(cached)
# L2: Semantic fuzzy match (~2ms)
q_emb = self.encoder.encode(query)
hits = self.index.search(q_emb, top_k=1)
if hits and hits[0].score > 0.95:
# "fix auth errors" ≈ "fix authentication errors"
return self.redis.get(hits[0].id)
# L3: Cache miss → generate (async, don't block)
expansions = await self.llm.expand(query)
self.redis.setex(key, 3600, json.dumps(expansions))
self.index.add(q_emb, key)
return expansionsHit rate: 40–70% for most production systems (users ask similar questions). Semantic matching pushes this to 60–85%.
★ Strategy 4: Hybrid — The Recommended Approach
Combine all three: serve template-generated queries instantly (5ms), check cache for LLM-quality expansions (0ms if hit), and fire-and-forget an async LLM call to upgrade the cache for next time.
class HybridQueryExpander:
"""5ms P95 response. Best quality over time.
Template → Cache → Async LLM backfill."""
def __init__(self):
self.template = TemplateExpander() # 5ms
self.cache = SemanticExpansionCache() # 0ms hit
self.llm = LLMExpander() # 300ms
async def expand(self, query: str) -> list[str]:
# Phase 1: Instant (5ms) — always available
template_variants = self.template.expand(query)
# Phase 2: Cache check (0–2ms)
cached = await self.cache.get(query)
if cached:
# Merge template + cached LLM variants
return self.dedupe(cached + template_variants)[:5]
# Phase 3: Return templates NOW,
# fire async LLM to backfill cache
asyncio.create_task(
self._async_backfill(query)
)
return template_variants # 5ms total
async def _async_backfill(self, query):
"""Runs in background. Next identical
query will get LLM-quality expansions."""
try:
expansions = await self.llm.expand(query)
await self.cache.set(query, expansions)
except Exception:
pass # template fallback is fineResult: First request gets template variants in 5ms. Second request gets LLM-quality variants from cache in 0ms. No user ever waits for the LLM.
Complete Latency Breakdown — Query Transform Pipeline
| Step | Operation | Latency | Runs | Can Parallelize? |
|---|---|---|---|---|
| Classify | Rule-based + embedding classifier | ~3ms | Always | — |
| Template expand | Synonym swap, keyword extract, prefix | ~2ms | Always | — |
| Cache lookup | Redis exact + FAISS semantic | ~2ms | Always | ✓ parallel with templates |
| Filter extract | Regex + NER for metadata | ~5ms | Always | ✓ parallel with above |
| LLM expand | Haiku/mini generate 5 variants | ~300ms | Cache miss only | Async (fire-and-forget) |
| HyDE generate | Hypothetical doc generation | ~400ms | Technical queries only | Async (fire-and-forget) |
| Total user-facing latency (hybrid) | 5–15ms P95 | LLM runs async, result cached for next request | ||
Warm-Up Strategy
Pre-populate the expansion cache by running your top 10,000 queries from production logs through the LLM expander offline. This gives instant cache hits for the most common queries from day one.
# Offline warm-up script
for query in top_10k_queries:
expansions = await llm.expand(query)
cache.set(query, expansions)
# Run nightly. ~$3 for 10K queries.Batch LLM Calls
If you must call an LLM synchronously, batch multiple queries into a single request. Generate all 5 variants in one prompt (not 5 separate calls). This cuts 5×300ms to 1×350ms.
# One call, 5 variants:
prompt = f"""Generate 5 diverse search
queries for: "{query}"
Return as JSON array."""
# → 1 API call ≈ 300ms
# NOT: 5 calls × 300ms = 1.5s ❌Streaming + Speculative
Start retrieval with template queries immediately. If LLM expansions arrive (from cache or async), merge them into the result set before reranking. The LLM expansions enrich, never block.
# Speculative parallel execution
template_results = retrieve(template_qs)
# If LLM expansions arrive in time:
llm_results = retrieve(llm_qs) # bonus
merged = rrf_merge(template_results, llm_results)
# If not: template results alone are fine① Classify query type (3ms, rule-based)
② Generate template variants (2ms, synonym + keyword)
③ Check semantic cache for LLM variants (2ms, Redis + FAISS) — in parallel with ②
④ Extract metadata filters (5ms, regex + NER) — in parallel with ②③
⑤ If cache miss: fire-and-forget async LLM call to backfill cache for next time
⑥ Return template+cached variants immediately → start retrieval
Total: 5–15ms P95. User never waits for LLM. Quality improves over time as cache fills.
Advanced Retrieval Strategies
Sparse, dense, and hybrid retrieval each encode different failure modes; hybrid retrieval fuses signals.
Retrieval Strategy Patterns
Hybrid Search (Dense + Sparse)
Run BM25 and vector search in parallel; fuse results via Reciprocal Rank Fusion (RRF) or weighted sum.
Multi-Query Expansion
LLM generates 3–5 diverse rephrased queries targeting different aspects of the user's question. Retrieve for each, then merge and deduplicate. Detailed deep-dive below.
HyDE (Hypothetical Document Embeddings)
LLM generates hypothetical document for query; embed it; search nearest neighbors. Bridges intent-execution gap.
Query Routing
Classify query intent; route to specialized indices (e.g., FAQ vs. technical docs). Faster and more precise.
Parent Document Retrieval
Retrieve fine-grained child chunks; expand with parent (full section). Balance precision + context.
Step-Back Prompting
Ask "What high-level concept does this question ask?"; retrieve abstract info first; then detailed.
Metadata Filtering
Pre-filter chunks by date, source, or category before vector search. Reduce retrieval pool; improve relevance.
Contextual Compression
Retrieve top-K; use LLM to extract relevant sentences. Reduce context window; increase token efficiency.
Learned Sparse (SPLADE)
SPLADE-family models: learned sparse vectors; interpretable term weights; combines dense + sparse strengths.
HybridRetriever Example
Multi-Query Expansion — Deep Dive
A single user query often captures only one perspective of what they need. Multi-Query Expansion uses an LLM to generate diverse reformulations that target different angles, vocabulary, and levels of specificity — then retrieves for each and merges results. This typically improves Recall@K by 15–30%.
Production MultiQueryExpander
class MultiQueryExpander:
PROMPT = """Generate {n} diverse search queries
for the user question below. Each query should
target a DIFFERENT aspect:
- One using technical terms / error codes
- One using simple plain language
- One asking the "why" behind the issue
- One focused on the solution / fix
User question: {query}
Return as JSON array of strings."""
def __init__(self, llm, n_queries=4):
self.llm = llm
self.n = n_queries
self.cache = QueryExpansionCache(ttl=3600)
async def expand(self, query: str) -> list[str]:
# Check cache first (same query = same expansions)
cached = self.cache.get(query)
if cached:
return cached
result = await self.llm.generate(
self.PROMPT.format(n=self.n, query=query),
model="claude-haiku-4-5-20251001", # fast + cheap
temperature=0.7, # some diversity
)
variants = json.loads(result)
# Always include original query
all_queries = [query] + variants[:self.n]
self.cache.set(query, all_queries)
return all_queriesMulti-Query Retriever with RRF Fusion
class MultiQueryRetriever:
def __init__(self, expander, retriever, reranker):
self.expander = expander
self.retriever = retriever
self.reranker = reranker
async def search(self, query, top_k=5):
# Step 1: Expand query
queries = await self.expander.expand(query)
# Step 2: Parallel retrieval for all variants
all_results = await asyncio.gather(*[
self.retriever.search(q, top_k=top_k * 2)
for q in queries
])
# Step 3: Reciprocal Rank Fusion
fused = self.rrf_merge(all_results, k=60)
# Step 4: Rerank against ORIGINAL query
# (not the variants!)
reranked = self.reranker.rerank(
query=query, # original intent
candidates=fused[:top_k * 3],
top_k=top_k
)
return reranked
def rrf_merge(self, result_lists, k=60):
"""Reciprocal Rank Fusion across all
query variants."""
scores = {}
for results in result_lists:
for rank, doc in enumerate(results):
if doc.id not in scores:
scores[doc.id] = 0
scores[doc.id] += 1.0 / (k + rank)
return sorted(scores.items(),
key=lambda x: x[1], reverse=True)Expansion Strategies
Synonym expansion: Replace key terms with alternatives ("auth" → "authentication", "login")
Specificity ladder: Abstract ("security issue") + specific ("OAuth 2.0 token expired 401")
Perspective shift: Problem ("auth fails") + solution ("fix authentication") + cause ("why does token expire")
Domain injection: Add domain context ("in Kubernetes" or "for REST API")
When NOT to Use
Exact-match queries: Order ID lookups, SKU searches, specific error codes — expansion adds noise
Low-latency paths: Adds ~200–400ms for LLM expansion. Use only when retrieval quality matters more than speed
Small corpus (<1K docs): Expansion just returns the same docs repeatedly. Not worth the cost
Production Optimizations
Cache expansions: Same query → same variants. 1-hour TTL covers repeated queries
Use cheapest LLM: Haiku/GPT-4o-mini for expansion (~$0.001 per query)
Parallel retrieval: Run all variant searches simultaneously with asyncio.gather
Rerank against original: Always rerank using the ORIGINAL query, not variants — variants help recall, reranking restores precision
Low-Latency Retrieval — Hitting <30ms P95
For real-time chat and voice agents, the entire retrieval pipeline (query → embed → search → filter → return chunks) must complete in under 30ms P95. Here's how production systems achieve this.
Embedding Latency — <5ms
The query embedding step is on the critical path. Every millisecond counts.
# Strategy: Pre-warm + GPU + small model
class FastEmbedder:
def __init__(self):
# Use small model: 384d, ~3ms on GPU
self.model = SentenceTransformer(
"all-MiniLM-L6-v2",
device="cuda"
)
# Pre-warm: run dummy inference
self.model.encode("warmup")
# ONNX quantized for CPU-only deploys:
# self.model = ORTModel("model.onnx")
# → ~5ms on CPU vs ~15ms PyTorch
def encode(self, text: str):
return self.model.encode(
text, normalize_embeddings=True
)Options: all-MiniLM-L6 (3ms GPU), ONNX quantized (5ms CPU), Matryoshka 256d (2ms, -1% quality), API (10-20ms + network).
Vector Search — <10ms at Scale
HNSW indexes deliver sub-10ms search even at 10M+ vectors. Key: tune ef_search, keep quantized index in RAM.
# Qdrant: tuned for low latency
collection_config = {
"vectors": {
"size": 384, # small = faster
"distance": "Cosine",
},
"hnsw_config": {
"m": 16, # graph density
"ef_construct": 200, # build quality
},
"quantization_config": {
"scalar": {
"type": "int8", # 4x smaller
"always_ram": True, # no disk IO
}
},
"on_disk_payload": True, # metadata on disk
}
# Search params:
search_params = {
"hnsw_ef": 64, # lower = faster (vs 128)
"exact": False, # ANN, not brute force
}
# Result: ~5ms for 10M vectors, int8 quantizedParallel Hybrid Search
Run BM25 and vector search simultaneously. Both return in ~5ms. RRF merge takes ~1ms. Total hybrid: ~6ms vs 10ms serial.
# Parallel hybrid: 6ms total
dense, sparse = await asyncio.gather(
vector_db.search(q_emb, top_k=50),
bm25_index.search(q_text, top_k=50),
)
fused = rrf_merge(dense, sparse)
# NOT: dense = await ...; sparse = await ...
# That's serial: 5+5 = 10ms ❌Retrieval Result Cache
Cache the final retrieved chunks by normalized query hash. 30–50% hit rate for production systems. 0ms on hit.
# Redis retrieval cache
key = md5(normalize(query) + user_acl)
cached = redis.get(key)
if cached:
return json.loads(cached) # 0ms
# ACL in key prevents cross-user leakage
# TTL: 15min (balance freshness vs speed)Connection Pooling
Cold connections to vector DB add 20–50ms. Pool connections and keep them warm. Use gRPC over HTTP for lower overhead.
# Qdrant gRPC connection pool
client = QdrantClient(
url="qdrant:6334",
prefer_grpc=True, # not REST
grpc_options={
"grpc.keepalive_time_ms": 10000,
},
)
# Pre-warm: send dummy search on startup✓ Small embedding model (384d, GPU or ONNX quantized) — saves 10ms vs large model
✓ Int8 quantized HNSW index, always in RAM — saves 5–20ms vs disk
✓ Parallel BM25 + vector search with asyncio.gather — saves 5ms vs serial
✓ gRPC connection pooling to vector DB — saves 20–50ms cold start
✓ Retrieval result cache with ACL-aware keys (15min TTL) — 0ms on 30–50% of queries
✓ FlashRank fast reranker instead of cross-encoder for first pass — 5ms vs 50ms
✓ Metadata pre-filtering to reduce search pool before vector search
✓ Lower ef_search (64 vs 128) for HNSW — ~2ms savings, <1% recall drop
HyDE (Hypothetical Document Embeddings) — Deep Dive
The core insight behind HyDE: user queries and documents live in different embedding spaces. A question like "fix auth errors" embeds very differently from a document paragraph that explains how to fix auth errors. HyDE bridges this gap by generating a hypothetical answer first, then using THAT as the search query — because a hypothetical answer embeds much closer to the real answer documents.
Production HyDE Implementation
class HyDERetriever:
"""Hypothetical Document Embeddings.
Generates a fake answer, embeds it,
searches for real docs that match."""
PROMPT = """Write a short paragraph that
directly answers this question.
Write as if it's from a technical doc.
Do NOT say "I don't know."
Question: {query}
Answer paragraph:"""
def __init__(self, llm, embedder, vector_db):
self.llm = llm # cheap/fast model
self.embedder = embedder
self.db = vector_db
self.cache = HyDECache(ttl=3600)
async def search(self, query, top_k=10):
# Check cache (same query = same hypo doc)
cached = self.cache.get(query)
if cached:
hypo_emb = cached
else:
# Step 1: Generate hypothetical answer
hypo_doc = await self.llm.generate(
self.PROMPT.format(query=query),
model="claude-haiku-4-5-20251001",
max_tokens=150, # short paragraph
temperature=0.0, # deterministic
)
# Step 2: Embed the hypothetical doc
hypo_emb = self.embedder.encode(hypo_doc)
self.cache.set(query, hypo_emb)
# Step 3: Search using hypo embedding
results = self.db.search(
vector=hypo_emb, top_k=top_k
)
return resultsWhen HyDE Helps vs Hurts
| Scenario | HyDE Impact | Why |
|---|---|---|
| Technical jargon query | +15–25% recall | Query uses informal terms; docs use formal language. HyDE bridges the gap. |
| Short/vague query | +10–20% recall | "fix auth" → hypothetical doc expands to "authentication, OAuth, token, refresh" |
| Cross-lingual | +20–30% recall | Query in English, docs in mixed languages. HyDE generates in target language. |
| Simple factual query | ~0% change | "What's the return policy?" already matches doc language. No gap to bridge. |
| Exact-match lookup | -5–10% recall | Order IDs, error codes — HyDE adds noise. Skip it for lookups. |
| Multi-part query | Mixed | HyDE generates one doc; may miss second topic. Combine with decomposition. |
HyDE + Multi-Query: Best of Both
In production, don't choose between HyDE and Multi-Query — combine them. Use the original query + 3 expansions + 1 HyDE embedding. Five search queries total, fused with RRF.
async def hybrid_retrieve(query, top_k=5):
# Run ALL in parallel
orig, expanded, hyde = await asyncio.gather(
vector_search(embed(query), top_k=20),
multi_query_search(query, n=3, top_k=20),
hyde_search(query, top_k=20),
)
# Fuse all results via RRF
fused = rrf_merge([orig, *expanded, hyde])
# Rerank against ORIGINAL query
return rerank(query, fused[:top_k*3])[:top_k]LLM Choice for HyDE
Claude Haiku / GPT-4o-mini: Best cost/quality. ~$0.001/query. 100–200ms.
Llama 3.1 8B (local): Zero API cost. ~50ms on GPU. Slightly lower quality.
T5-small fine-tuned: ~10ms CPU. Train on (query → doc paragraph) pairs from your corpus. Best latency.
Prompt Design Matters
DO: "Write as if from a technical document." This makes the output style match your corpus.
DO: "Do NOT say I don't know." Force the LLM to generate content even if unsure.
DON'T: Ask for long answers. 1–2 paragraphs max. More text = more embedding noise.
Latency Optimization
Cache aggressively: Same query → same hypothetical doc. 1h TTL. 50–70% hit rate.
Async generation: Start HyDE in parallel with template-based retrieval. If HyDE finishes in time, merge results. If not, template results are fine alone.
Conditional: Only run HyDE for queries classified as "technical" or "ambiguous" (~20% of traffic). Skip for simple factual queries.
Reranking & Relevance Scoring
Quality Improvement Pipeline
| Reranker | Type | Latency | Accuracy | Cost / Deployment |
|---|---|---|---|---|
| Cohere Rerank v3.5 | API cross-encoder | 50–150ms | SOTA | $0.001 / 1000 queries |
| Jina Reranker v2 | API | 100–200ms | Excellent | $0.0005 / query |
| cross-encoder/ms-marco | Open-source HF | 5–20ms (A100) | Good (BERT-base) | Free; self-hosted |
| BGE Reranker v2.5 | Open-source HF | 10–30ms (A100) | Very Good | Free; self-hosted |
| RankGPT (LLM-based) | LLM proxy | 200ms–1s | SOTA (model-dependent) | API cost; slow |
| FlashRank (tiny) | Open-source distilled | 2–5ms (CPU) | Acceptable (70–80%) | Free; ultra-fast |
MultiStageReranker: Cascade Strategy
Cross-Encoder Confidence Scoring
Cross-encoders don't just rerank — they produce calibrated relevance scores that serve as the foundation for retrieval confidence. These scores drive critical downstream decisions: should the LLM answer or refuse? Should we retrieve more chunks? Is the context sufficient?
CrossEncoderScorer — Production Implementation
from sentence_transformers import CrossEncoder
import numpy as np
class CrossEncoderScorer:
"""Calibrated cross-encoder confidence scorer.
Extracts per-chunk relevance + aggregate
retrieval confidence for downstream decisions."""
def __init__(self, model_name, temperature=1.5):
self.model = CrossEncoder(model_name)
self.T = temperature # calibrated on eval set
def score_chunks(self, query: str, chunks: list) -> list:
# Score each (query, chunk) pair
pairs = [[query, c.text] for c in chunks]
raw_logits = self.model.predict(pairs)
# Calibrate: sigmoid with temperature
scores = 1 / (1 + np.exp(-raw_logits / self.T))
# Attach scores to chunks
for chunk, score in zip(chunks, scores):
chunk.relevance_score = float(score)
chunk.is_relevant = score > 0.5
# Sort by score descending
return sorted(chunks, key=lambda c: c.relevance_score, reverse=True)
def retrieval_confidence(self, scored_chunks: list) -> RetrievalConfidence:
"""Aggregate chunk scores into a single
retrieval confidence signal."""
scores = [c.relevance_score for c in scored_chunks]
return RetrievalConfidence(
# Best chunk score — primary signal
top_score=scores[0],
# Mean of top-3 — stability signal
top3_mean=np.mean(scores[:3]),
# Score gap: top vs 4th — diversity signal
score_gap=scores[0] - scores[3] if len(scores) > 3 else 0,
# Count above threshold — coverage signal
relevant_count=sum(1 for s in scores if s > 0.5),
# Overall retrieval quality tier
tier=self._classify_tier(scores),
)
def _classify_tier(self, scores):
top = scores[0]
if top > 0.85 and sum(1 for s in scores if s > 0.7) >= 2:
return "high" # confident answer
elif top > 0.5:
return "medium" # answer with caveat
else:
return "low" # refuse / re-retrieveConfidence Signals Explained
| Signal | What It Measures | How to Use |
|---|---|---|
| top_score | Best single chunk relevance | >0.85 = answer confidently. <0.5 = refuse. |
| top3_mean | Consistency of top results | If top1=0.9 but top3_mean=0.5 → only one good chunk. Context may be thin. |
| score_gap | Drop from best to 4th | Large gap (>0.3) = clear winner. Small gap = ambiguous topic, may need more context. |
| relevant_count | How many chunks are useful | 0 = can't answer. 1–2 = thin context. 3–5 = good coverage. |
| tier | Aggregate quality class | Drives LLM prompt strategy: high→concise, medium→cautious, low→refuse. |
Cross-Encoder Models for Scoring
| Model | Latency | Quality | Best For |
|---|---|---|---|
| cross-encoder/ms-marco-MiniLM-L-6 | ~8ms | Good | High-QPS, latency-critical |
| cross-encoder/ms-marco-MiniLM-L-12 | ~15ms | Better | Balanced speed/quality |
| BAAI/bge-reranker-v2-m3 | ~30ms | Very good | Multi-lingual |
| cross-encoder/nli-deberta-v3-large | ~50ms | Excellent | NLI + grounding check |
| Cohere Rerank v3.5 | ~60ms | Excellent | API-based, no GPU needed |
| Jina Reranker v2 | ~40ms | Very good | Long context support |
Confidence-Driven RAG — Adapting Behavior by Score
The most powerful use of cross-encoder scores is dynamically adapting the RAG pipeline behavior based on retrieval confidence — not just ranking chunks.
ConfidenceDrivenRAG — Adaptive Pipeline
class ConfidenceDrivenRAG:
"""Adapts RAG behavior based on cross-encoder
confidence. High confidence → fast answer.
Low confidence → expand search or refuse."""
async def answer(self, query: str) -> Response:
# Step 1: Retrieve + Rerank + Score
chunks = await self.retriever.search(query, top_k=20)
scored = self.cross_encoder.score_chunks(query, chunks)
confidence = self.cross_encoder.retrieval_confidence(scored)
# Step 2: Adapt strategy by confidence tier
if confidence.tier == "high":
# ✓ Strong context — answer directly
context = scored[:3] # top 3 only (less noise)
prompt = self.prompts.confident(query, context)
return await self.llm.generate(prompt)
elif confidence.tier == "medium":
# ~ Partial context — try harder first
# Strategy A: Expand retrieval
expanded = await self.multi_query.expand_and_retrieve(query)
re_scored = self.cross_encoder.score_chunks(query, expanded)
new_conf = self.cross_encoder.retrieval_confidence(re_scored)
if new_conf.tier == "high":
# Expanded search worked
context = re_scored[:5]
prompt = self.prompts.confident(query, context)
return await self.llm.generate(prompt)
else:
# Answer cautiously with hedge
context = re_scored[:5]
prompt = self.prompts.cautious(query, context)
# "Based on available information..."
return await self.llm.generate(prompt)
else: # tier == "low"
# ✗ No good context — refuse gracefully
if confidence.top_score < 0.2:
# Completely off-topic
return Response(
text="I don't have information on this topic.",
confidence=confidence.top_score,
action="refused"
)
else:
# Some relevance but not enough
return Response(
text="I found some related information but "
"can't give a confident answer. "
"Here's what I found: ...",
confidence=confidence.top_score,
action="hedged",
sources=scored[:2]
)Dynamic Chunk Filtering
Instead of always sending top-5 chunks, use scores to decide how many. If top-3 are all >0.8 but chunks 4–5 are <0.3, drop them. Including low-relevance chunks actually hurts faithfulness.
# Adaptive chunk count
relevant = [c for c in scored
if c.relevance_score > 0.5]
context = relevant[:5] # max 5, but only relevant
# If 0 relevant → refuse/expand
# If 1–2 → thin context warning
# If 3–5 → good coveragePrompt Strategy Switching
Use confidence tier to select different prompt templates. High confidence → concise, direct answer. Medium → "Based on available docs..." Low → "I don't have enough info to..."
PROMPTS = {
"high": "Answer directly from context.",
"medium": "Based on available info, "
"answer carefully. Note gaps.",
"low": "Context is limited. State "
"what you found and what's missing.",
}Feedback to Retrieval
If cross-encoder scores are consistently low for a topic, it signals a gap in your knowledge base — not just a bad query. Log and alert on repeated low-confidence topics.
# Track low-confidence topics
if confidence.tier == "low":
self.topic_tracker.record(
query=query,
top_score=confidence.top_score
)
# Weekly: report topics with >10
# low-confidence queries → content gapScore Calibration — Making Thresholds Reliable
Raw cross-encoder logits are NOT probabilities. A score of 0.7 doesn't mean "70% chance this is relevant." You must calibrate scores so that your thresholds (0.5, 0.85) actually mean what you think they mean.
class ScoreCalibrator:
"""Learn temperature T on held-out eval set
so that score=0.5 means 50% of chunks with
that score are actually relevant."""
def calibrate(self, eval_set):
# eval_set: [(query, chunk, is_relevant)]
logits = []
labels = []
for q, c, rel in eval_set:
logit = self.model.predict([(q, c)])[0]
logits.append(logit)
labels.append(rel)
# Optimize temperature T
from scipy.optimize import minimize_scalar
def nll(T):
probs = 1 / (1 + np.exp(-np.array(logits) / T))
return -np.mean(
np.array(labels) * np.log(probs + 1e-8)
+ (1 - np.array(labels)) * np.log(1 - probs + 1e-8)
)
result = minimize_scalar(nll, bounds=(0.1, 5.0))
self.T = result.x
print(f"Calibrated T={self.T:.2f}")
# Recalibrate monthly or when model changesWhy calibration matters:
Without calibration, the same threshold (0.5) behaves differently across models. MiniLM-L-6 might output 0.8 for a mediocre match, while DeBERTa-v3 outputs 0.6 for a great match. Temperature scaling normalizes this.
| Without Calibration | With Calibration (T=1.5) |
|---|---|
| Score 0.7 = maybe relevant? | Score 0.7 = 70% are truly relevant |
| Threshold 0.5 = different per model | Threshold 0.5 = consistent meaning |
| Can't compare models fairly | Apples-to-apples comparison |
| Must tune per deployment | One threshold works across models |
How often to recalibrate: Monthly, or whenever you change the cross-encoder model, update the embedding model, or significantly change the corpus. Use 500+ labeled (query, chunk, relevant?) pairs.
Prompt Engineering & Generation
"Prompting is a contract between retrieval and generation" — context discipline, citations, and answer modes matter.
Production RAG Prompt Template
RAGGenerator: Streaming, Fallback, Confidence Gating
Streaming (SSE/WebSocket)
Return tokens as they arrive, not end-to-end. Target <500ms TTFT (time-to-first-token). Improves perceived latency and UX.
Citation Extraction
Parse [Source N] references; validate against retrieved chunks. Enable user verification; prevent hallucinated citations.
Fallback Strategy
Route to cheaper/faster model if retrieval confidence is low. Use strong model only when context is rich. Optimize cost/quality.
Self-RAG decides whether retrieval is needed per token; critiques its own outputs. Studies show 10–15% improvements in factuality by selective retrieval. Implement using token-level confidence scores from the LLM.
LLM Orchestration Policies
- Model Routing: Classify query complexity; route simple queries to fast model (GPT-3.5), complex to strong model (GPT-4). Save 50%+ on inference cost.
- Caching Layers: Cache responses by normalized query + context hash. ACL-sensitive keys (per-user); 24-72h TTL. Reduces latency and cost for repeated queries.
- Hallucination Mitigation Toolbox: Use retrieval-augmented verification (CTRL), confidence thresholds, structured output format (JSON schema), and post-generation fact-checking against context.
Response Evaluation Layer
In production, the LLM alone is NOT trusted. Every response passes through a parallel evaluation layer — grounding verification, intent alignment, safety moderation, and confidence scoring — all within 50–200ms.
1. Grounding Check — Deep Dive
The grounding check is the single most important validator in a production RAG system. It verifies that every claim in the LLM's response is actually supported by the retrieved context — catching hallucinations before they reach the user.
Tier 1: Embedding Similarity
Fastest check (~5–15ms). Runs on every response. Converts answer + context chunks to embeddings, measures cosine similarity.
from sentence_transformers import SentenceTransformer
import numpy as np
class EmbeddingGrounder:
def __init__(self):
self.model = SentenceTransformer(
"all-MiniLM-L6-v2" # 384d, fast
)
def check(self, answer, chunks):
a_emb = self.model.encode(answer)
c_embs = self.model.encode(
[c.text for c in chunks]
)
# Max similarity across chunks
scores = np.dot(c_embs, a_emb) / (
np.linalg.norm(c_embs, axis=1)
* np.linalg.norm(a_emb)
)
score = float(scores.max())
if score > 0.75:
return Grounded(score)
elif score > 0.5:
return Ambiguous(score) # → T2
else:
return Hallucinated(score)Tools: FAISS, pgvector, Sentence Transformers, HuggingFace Embeddings, OpenAI text-embedding-3-small
Tier 2: Cross-Encoder / NLI
More accurate (~30–80ms). Only runs on ambiguous T1 results. Uses Natural Language Inference to classify each claim as entailed, neutral, or contradicted by context.
from transformers import pipeline
class NLIGrounder:
def __init__(self):
self.nli = pipeline(
"text-classification",
model="cross-encoder/"
"nli-deberta-v3-large"
)
def check(self, answer, context):
# Split answer into claims
claims = self.extract_claims(answer)
results = []
for claim in claims:
pred = self.nli(
f"{context} [SEP] {claim}"
)
label = pred[0]["label"]
# entailment/neutral/contradiction
results.append((claim, label))
contradictions = [
c for c, l in results
if l == "contradiction"
]
return NLIResult(
grounded=len(contradictions)==0,
flagged_claims=contradictions
)Models: DeBERTa-v3-large-NLI, cross-encoder/nli-MiniLM, BART-large-MNLI, Cohere Rerank v3.5
Tier 3: LLM-as-Judge
Most flexible (~300–800ms). Only runs on disputed claims from T2. Performs claim-by-claim verification with explicit reasoning.
class LLMGrounder:
PROMPT = """Verify each claim against
the context. For each claim, respond:
SUPPORTED / NOT SUPPORTED / PARTIAL
Context: {context}
Claims to verify:
{claims}
Respond as JSON:
[{{"claim": "...", "verdict": "...",
"evidence": "...", "confidence": 0.0}}]
"""
async def check(self, claims, ctx):
result = await self.llm.generate(
self.PROMPT.format(
context=ctx,
claims="\n".join(claims)
),
model="claude-haiku-4-5-20251001",
# Use cheap fast model
temperature=0,
)
verdicts = json.loads(result)
unsupported = [
v for v in verdicts
if v["verdict"] != "SUPPORTED"
]
return LLMVerdict(
grounded=len(unsupported)==0,
unsupported_claims=unsupported
)Models: Claude Haiku (cheapest), GPT-4o-mini, Gemini Flash, Llama 3.1 8B (self-hosted)
Production Grounding Service (Cascading)
class ProductionGroundingService:
"""Cascading grounding: fast → accurate → LLM
P95 latency: ~20ms (80% exit at T1)"""
def __init__(self):
self.t1 = EmbeddingGrounder() # ~10ms
self.t2 = NLIGrounder() # ~50ms
self.t3 = LLMGrounder() # ~500ms
self.metrics = GroundingMetrics()
async def verify(self, answer, chunks, query):
# Tier 1: Embedding (always runs)
t1 = self.t1.check(answer, chunks)
self.metrics.record("t1", t1.score)
if t1.score > 0.75:
return GroundingResult(
grounded=True, tier=1,
score=t1.score
)
if t1.score < 0.4:
return GroundingResult(
grounded=False, tier=1,
score=t1.score,
action="regenerate"
)
# Tier 2: NLI (ambiguous zone 0.4–0.75)
claims = self.extract_claims(answer)
t2 = self.t2.check(answer, chunks)
self.metrics.record("t2", t2)
if t2.grounded:
return GroundingResult(
grounded=True, tier=2
)
if len(t2.flagged_claims) == 0:
return GroundingResult(
grounded=True, tier=2
)
# Tier 3: LLM judge (disputed claims only)
t3 = await self.t3.check(
t2.flagged_claims,
"\n".join(c.text for c in chunks)
)
self.metrics.record("t3", t3)
return GroundingResult(
grounded=t3.grounded, tier=3,
unsupported=t3.unsupported_claims,
action="regenerate" if not t3.grounded else None
)Tools & Libraries Comparison
| Tool | Type | Latency | Best For |
|---|---|---|---|
| Sentence Transformers | Embedding | ~5ms | T1 — fast similarity |
| FAISS | Vector index | ~1ms | Batch embedding lookup |
| pgvector | Postgres ext | ~5ms | SQL-native similarity |
| DeBERTa-v3 NLI | Cross-encoder | ~50ms | T2 — NLI classification |
| BART-large-MNLI | NLI model | ~40ms | T2 — zero-shot NLI |
| Cohere Rerank | API reranker | ~60ms | T2 — relevance scoring |
| Claude Haiku | LLM API | ~400ms | T3 — claim verification |
| GPT-4o-mini | LLM API | ~500ms | T3 — claim verification |
| Guardrails AI | Framework | varies | Orchestrate all tiers |
| RAGAS | Eval framework | offline | Measure faithfulness |
| TruLens | Eval+trace | offline | Groundedness monitoring |
| DeepEval | CI eval | offline | Hallucination CI gate |
How the 80 / 15 / 5 Cascading Exit Works
In production, you do NOT run all three tiers on every response. Instead, you cascade: the fast cheap check runs first, and only ambiguous results escalate to the next tier. This is why 80% of requests cost ~10ms and only 5% ever hit the expensive LLM judge.
Tier 1 Exit (80%) — Clear Match
Most RAG answers closely paraphrase the retrieved context. Embedding similarity catches these trivially.
# Example: clear grounding
Context: "Returns accepted within 30 days
of purchase with original receipt."
Answer: "You can return items within 30 days
if you have the original receipt."
cosine_similarity = 0.91 # > 0.75
# → PASS at Tier 1. No further checks.
# Latency: ~10ms. Cost: $0.00.This covers: direct paraphrasing, factual restatement, simple summarization, exact quotes, and minor rewording. The embedding model captures semantic equivalence without needing deeper reasoning.
Tier 2 Escalation (15%) — Ambiguous Zone
When the answer uses different vocabulary or adds inference, embeddings give a middling score. NLI resolves the ambiguity.
# Example: inference from context
Context: "Premium members get free shipping
on orders over $50."
Answer: "As a premium member, your $75 order
qualifies for free shipping."
cosine_similarity = 0.62 # ambiguous zone
# → Escalate to Tier 2
NLI("Premium members get free shipping
on orders over $50",
"$75 order qualifies for free shipping")
# → entailment (0.94 confidence)
# → PASS at Tier 2. Latency: ~60ms.This covers: logical inference, numerical reasoning ("$75 > $50"), conditional application, combining info from multiple chunks, and contextual deduction.
Tier 3 Escalation (5%) — Disputed Claims
When NLI returns "neutral" (neither entailed nor contradicted) or there are mixed verdicts across claims, the LLM judge arbitrates.
# Example: mixed/complex claim
Context: "The product is available in blue
and red. Ships within 3-5 days."
Answer: "The product comes in blue, red, and
green. Usually arrives in a week."
T1 cosine_similarity = 0.58 # ambiguous
T2 NLI:
"blue and red" → entailment ✓
"green" → neutral ⚠️ # not in ctx
"arrives in a week" → neutral ⚠️
# → Escalate disputed claims to Tier 3
LLM Judge:
"green": NOT SUPPORTED # hallucination!
"week": PARTIAL # 3-5 days ≈ week
# → REJECT "green", accept "week"
# → Strip hallucinated claim, regenerateWhy This Works — The Math
The cascade works because most RAG answers are well-grounded (the retrieval pipeline already found relevant context). Only edge cases need expensive verification.
| Metric | All T3 | Cascade | Savings |
|---|---|---|---|
| Avg latency | 500ms | 40ms | 12.5x faster |
| P50 latency | 500ms | 10ms | 50x faster |
| P95 latency | 800ms | 60ms | 13x faster |
| Cost / 1K queries | $0.50 | $0.03 | 16x cheaper |
| Hallucination catch | ~98% | ~96% | -2% (acceptable) |
Key insight: You trade ~2% hallucination detection rate for a 12x latency reduction and 16x cost reduction. For the remaining 2%, user feedback loops and offline evaluation catch regressions.
Tuning the Thresholds — Production Guidance
T1 Pass Threshold (default: 0.75)
Raise to 0.80–0.85 for high-stakes domains (medical, legal, financial). Lower to 0.65–0.70 for casual Q&A where speed matters more. Tune by measuring T2/T3 escalation rate — if <5% escalate, threshold is too low.
T1 Reject Threshold (default: 0.4)
Below this, the answer is clearly unrelated to context — skip T2/T3 and regenerate immediately. Raise to 0.5 for stricter domains. Monitor false-rejection rate via user feedback.
T2→T3 Escalation (default: any contradiction)
Only escalate if T2 finds "contradiction" (not just "neutral"). Neutral means the context doesn't address the claim — which might be acceptable for partial answers. Tune per use case.
# Threshold config per use case
GROUNDING_CONFIG = {
"default": {
"t1_pass": 0.75, "t1_reject": 0.40,
"t2_escalate_on": ["contradiction"],
},
"medical": {
"t1_pass": 0.85, "t1_reject": 0.50, # stricter
"t2_escalate_on": ["contradiction", "neutral"], # always verify
},
"casual_qa": {
"t1_pass": 0.65, "t1_reject": 0.35, # faster
"t2_escalate_on": ["contradiction"], # only clear issues
},
}2. Intent Check — Response Matches User Intent
Verifies the response actually addresses what the user asked. Catches drift where the model answers a different question entirely.
User: "Track my order" → Answer: "Here are some shoes you may like" — intent mismatch!
# Intent alignment pipeline
class IntentAlignmentChecker:
def check(self, query, response):
# Classify both through intent model
query_intent = self.intent_model.predict(query)
response_intent = self.intent_model.predict(response)
# Or use embedding similarity
q_emb = self.encoder.encode(query)
r_emb = self.encoder.encode(response)
similarity = cosine_similarity(q_emb, r_emb)
if similarity < 0.8:
return IntentResult(
aligned=False,
query_intent=query_intent,
response_intent=response_intent
)Common intent models: Rasa, SetFit, fine-tuned classifiers. For production voice agents, embedding-based intent similarity with threshold >0.8 is fastest.
3. Safety Check — Content Moderation
Prevents unsafe or policy-violating responses: illegal instructions, abusive content, financial advice risks, policy violations.
A. Moderation Models
# Dedicated safety classifiers
result = moderation_api.classify(response)
# Output: {"violence": false, "hate": false, "self_harm": false}
if any(result.values()):
return block_response()B. Rule Engine
Hard rules for regulated domains: refund policies, medical/financial advice, guaranteed outcomes. Example: if answer contains "guaranteed profit" → reject.
C. Guardrail Frameworks
Production libraries: Guardrails AI, NeMo Guardrails. Enforce content policies, structured outputs, and safe responses declaratively.
4. Confidence Score — Final Decision Engine
Aggregates all evaluator scores into a weighted confidence signal. Detailed deep-dive below.
Confidence Score — How It's Calculated
The confidence engine is the final gate before a response reaches the user. It takes raw scores from every evaluator, normalizes them, applies domain-specific weights, and produces a single decision: pass, retry, or fallback.
Production ConfidenceEngine Implementation
class ConfidenceEngine:
def __init__(self, config: DomainConfig):
self.weights = config.weights
self.thresholds = config.thresholds
self.veto_rules = config.veto_rules
def calculate(self, scores: EvalScores) -> Decision:
# Step 1: Check hard veto rules first
for rule in self.veto_rules:
if rule.triggered(scores):
return Decision(
action="REJECT",
reason=rule.name,
confidence=0.0,
vetoed=True
)
# Step 2: Weighted aggregation
raw_score = sum(
self.weights[k] * getattr(scores, k)
for k in self.weights
)
# Step 3: Apply penalty for low-scoring
# individual signals (even if weighted
# average is high)
penalty = 0.0
for k, threshold in self.thresholds.min_per_signal.items():
val = getattr(scores, k)
if val < threshold:
gap = threshold - val
penalty += gap * 0.5 # 50% of gap
final = max(0.0, raw_score - penalty)
# Step 4: Map to decision
if final >= self.thresholds.pass_threshold:
return Decision("PASS", final)
elif final >= self.thresholds.retry_threshold:
return Decision("RETRY", final)
else:
return Decision("FALLBACK", final)Why These Weights?
| Signal | Weight | Rationale |
|---|---|---|
| Grounding | 0.30 | Highest — a hallucinated answer is the #1 failure mode. If grounding fails, nothing else matters. |
| Retrieval | 0.25 | If retrieval quality is low, the LLM is working with bad context. Garbage in → garbage out. |
| Intent | 0.15 | Answering the wrong question is bad but less dangerous than hallucinating facts. |
| Safety | 0.10 | Low weight in formula BUT has a hard veto — any safety flag = instant reject regardless of score. |
| Citation | 0.10 | Verifies source attribution. Important for trust but not critical for correctness. |
| Freshness | 0.10 | Only matters for temporal queries. Many questions are time-independent. |
Veto Rules — Hard Overrides
Certain conditions bypass the weighted score entirely and force an immediate reject. No amount of high scores elsewhere can compensate.
VETO_RULES = [
VetoRule("unsafe_content",
lambda s: s.safety < 0.5),
VetoRule("severe_hallucination",
lambda s: s.grounding < 0.3),
VetoRule("pii_leakage",
lambda s: s.pii_detected),
VetoRule("citation_fraud",
lambda s: s.citation_valid_pct < 0.5),
VetoRule("blocked_topic",
lambda s: s.blocked_content),
]
# If ANY veto fires → instant REJECT
# regardless of weighted scoreWorked Examples — Three Scenarios
Scenario A — PASS
"What's your return policy?"
Retrieval: 0.88 × 0.25 = 0.220
Intent: 0.95 × 0.15 = 0.143
Safety: 1.00 × 0.10 = 0.100
Citation: 0.90 × 0.10 = 0.090
Fresh: 1.00 × 0.10 = 0.100
─────────────────
Total: 0.926 → Penalty: 0
Decision: PASS ✓
Scenario B — RETRY
"Compare Plan A vs Plan B pricing"
Retrieval: 0.70 × 0.25 = 0.175
Intent: 0.90 × 0.15 = 0.135
Safety: 1.00 × 0.10 = 0.100
Citation: 0.40 × 0.10 = 0.040
Fresh: 1.00 × 0.10 = 0.100
─────────────────
Raw: 0.736 | Penalty: -0.05
Final: 0.686 → RETRY ↻
Retry with more context chunks
Scenario C — VETO REJECT
"Show me other users' orders"
Retrieval: 0.80 × 0.25 = 0.200
Intent: 0.92 × 0.15 = 0.138
Safety: 0.20 × 0.10 = 0.020
─── VETO TRIGGERED ───
safety < 0.5 → unsafe_content
Decision: REJECT ⚠
Even though weighted=0.85 would pass,
veto overrides. Response blocked.
Domain-Specific Weight Profiles
Different use cases need different weight distributions. A medical chatbot prioritizes grounding above all else; a casual Q&A bot prioritizes speed and intent alignment.
| Domain | Grounding | Retrieval | Intent | Safety | Citation | Fresh | Pass | Retry |
|---|---|---|---|---|---|---|---|---|
| General Q&A | 0.30 | 0.25 | 0.15 | 0.10 | 0.10 | 0.10 | >0.85 | >0.60 |
| Medical / Legal | 0.40 | 0.20 | 0.10 | 0.15 | 0.10 | 0.05 | >0.90 | >0.70 |
| E-commerce | 0.25 | 0.20 | 0.20 | 0.10 | 0.10 | 0.15 | >0.82 | >0.55 |
| Voice Agent | 0.30 | 0.25 | 0.20 | 0.10 | 0.05 | 0.10 | >0.80 | >0.55 |
| Internal Docs | 0.25 | 0.30 | 0.15 | 0.05 | 0.15 | 0.10 | >0.80 | >0.55 |
| Financial | 0.35 | 0.20 | 0.10 | 0.15 | 0.10 | 0.10 | >0.92 | >0.75 |
# Config per domain
DOMAIN_CONFIGS = {
"medical": DomainConfig(
weights={"grounding": 0.40, "retrieval": 0.20, "intent": 0.10,
"safety": 0.15, "citation": 0.10, "freshness": 0.05},
thresholds=Thresholds(pass_threshold=0.90, retry_threshold=0.70),
min_per_signal={"grounding": 0.7, "safety": 0.8}, # strict mins
veto_rules=VETO_RULES + [
VetoRule("medical_disclaimer_missing",
lambda s: s.has_medical_claim and not s.has_disclaimer),
]
),
"ecommerce": DomainConfig(
weights={"grounding": 0.25, "retrieval": 0.20, "intent": 0.20,
"safety": 0.10, "citation": 0.10, "freshness": 0.15},
thresholds=Thresholds(pass_threshold=0.82, retry_threshold=0.55),
min_per_signal={"grounding": 0.5},
veto_rules=VETO_RULES # standard vetos
),
}Production Microservice Architecture
Many companies deploy the evaluation layer as separate microservices for scalability and independent deployment.
class ResponseEvaluationService:
"""Runs all checks in parallel. Target: 50-200ms."""
async def evaluate(self, query, response, context):
# Run all checks in parallel
grounding, intent, safety = await asyncio.gather(
self.grounding_svc.check(response, context),
self.intent_svc.check(query, response),
self.safety_svc.check(response),
)
# Compute weighted confidence
confidence = self.confidence_engine.score(
grounding=grounding.score,
retrieval=context.retrieval_score,
intent=intent.score,
safety=safety.score,
)
# Decision
if confidence.decision == Decision.PASS:
return EvalResult(approved=True, response=response)
elif confidence.decision == Decision.RETRY:
return await self.regenerate(query, context)
else:
return EvalResult(
approved=True,
response="I'm not completely sure. "
"Let me check that for you."
)Latency Optimization
Voice systems and real-time apps run all checks in parallel to keep total evaluation under 200ms.
| Check | Method | Latency | Accuracy |
|---|---|---|---|
| Grounding | Embedding similarity | ~10ms | Good |
| Grounding | Cross-encoder | ~50ms | Better |
| Grounding | LLM-as-judge | ~500ms | Best |
| Intent | Embedding similarity | ~10ms | Good |
| Intent | Classifier model | ~20ms | Better |
| Safety | Moderation API | ~50ms | Good |
| Safety | Rule engine | ~1ms | Exact |
| Confidence | Score aggregation | ~1ms | — |
Additional Production Checks (Often Missed)
The four core checks (grounding, intent, safety, confidence) cover ~80% of failure modes. These additional checks close the remaining gaps that surface at scale.
5. Citation Verification
Validates that [Source N] references in the response actually match the claims they support. Catches "citation hallucination" where the model invents or misattributes sources.
class CitationVerifier:
def verify(self, response, sources):
citations = self.extract_citations(response)
for cite in citations:
# Does [Source N] exist?
if cite.index >= len(sources):
cite.valid = False
continue
# Does the claim match the source?
sim = cosine_sim(
cite.claim, sources[cite.index]
)
cite.valid = sim > 0.6
return citationsTools: Regex extraction + embedding verification. Run in parallel with grounding check (~5ms overhead).
6. Completeness Check
Did the answer address ALL parts of a multi-part question? Users often ask compound questions and the LLM may only answer part of it.
# Example problem:
Query: "What's the return policy
AND do you offer exchanges?"
Answer: "Returns within 30 days."
# Missing: exchange info!
class CompletenessChecker:
def check(self, query, answer):
# Decompose query into sub-questions
sub_qs = self.decomposer.split(query)
addressed = []
for sq in sub_qs:
sim = cosine_sim(sq, answer)
addressed.append(sim > 0.5)
return CompletenessResult(
complete=all(addressed),
missing=[sq for sq, a
in zip(sub_qs, addressed)
if not a]
)Tools: LLM query decomposer or spaCy clause splitting + embedding comparison.
7. PII Leakage Detection
The retrieved context may contain sensitive data (emails, SSNs, account numbers) that the LLM inadvertently surfaces in its response. Scan output before delivery.
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
class PIIGuard:
def __init__(self):
self.analyzer = AnalyzerEngine()
self.anonymizer = AnonymizerEngine()
def scan(self, response):
results = self.analyzer.analyze(
text=response,
entities=["EMAIL_ADDRESS",
"PHONE_NUMBER",
"CREDIT_CARD",
"US_SSN"],
language="en"
)
if results:
return self.anonymizer.anonymize(
text=response, analyzer_results=results
)
return response # cleanTools: Microsoft Presidio (open-source), AWS Comprehend PII, Google DLP API. Run on every response (~10ms).
8. Freshness / Staleness Check
Verify that the retrieved context is still current. An answer about "current pricing" from a 6-month-old document could be dangerously wrong.
class FreshnessChecker:
def check(self, chunks, query):
# Does query need fresh data?
needs_fresh = self.classify_temporal(query)
# "current", "latest", "now", "today"
if not needs_fresh:
return Fresh() # skip check
for chunk in chunks:
age = now() - chunk.indexed_at
if age > timedelta(days=30):
return Stale(
chunk=chunk,
age_days=age.days,
action="warn_user"
)Store indexed_at and source_updated_at in chunk metadata. Define TTL per document type (pricing: 7d, policy: 30d, FAQ: 90d).
9. Retry & Regeneration Strategy
When evaluation fails, how do you regenerate differently? Simply re-running the same prompt gets the same bad answer. Production systems modify the generation strategy on retry.
class RetryStrategy:
def regenerate(self, fail_reason, attempt):
if fail_reason == "hallucination":
# Add explicit "ONLY use context"
return self.stricter_prompt(
temp=0.0 # zero creativity
)
elif fail_reason == "incomplete":
# Retrieve MORE chunks
return self.expand_context(
top_k=10 # was 5
)
elif fail_reason == "stale":
# Force fresh retrieval
return self.re_retrieve(
freshness="7d"
)
elif attempt >= 2:
return self.fallback_response()Key: Max 2 retries. Each retry changes strategy (stricter prompt, more context, different model). After 2 fails → graceful fallback.
10. Human Feedback Loop
User feedback (thumbs up/down, corrections, follow-up queries) is the ultimate ground truth. Feed it back into evaluation thresholds and training data.
class FeedbackCollector:
def record(self, query_id, signal):
# Signals:
# thumbs_up, thumbs_down,
# correction(text), follow_up,
# escalate_to_human
self.store.save(query_id, signal)
# If thumbs_down → add to eval set
if signal == "thumbs_down":
self.eval_builder.add_negative(
query_id
)
# Weekly: retune thresholds from
# feedback distribution
# Monthly: retrain intent/NLI models
# Quarterly: full eval set refreshTrack feedback rate (aim for >5% of responses). Negative feedback → auto-add to adversarial eval set. Positive feedback → confidence calibration.
11. Evaluation Layer Monitoring — What to Dashboard
The evaluation layer itself needs monitoring. If your grounding check drifts, it will silently let hallucinations through.
Tier Exit Distribution
T1: 80% / T2: 15% / T3: 5% is baseline. Alert if T2 rises above 25% (embedding model drift) or T3 above 8% (retrieval quality degradation).
False Positive / Negative Rate
Sample 100 responses/week. Human-label as grounded or not. Compare against evaluator verdicts. Target: <3% false-positive (passes hallucination) and <8% false-negative (rejects good answer).
Retry & Fallback Rate
If retry rate exceeds 10% or fallback exceeds 3%, something upstream is broken — likely retrieval quality, prompt template, or LLM model regression. Investigate immediately.
Evaluator Latency P95
Track per-tier latency. If T1 P95 exceeds 30ms, the embedding model may need optimization or the batch size is too large. T2 P95 above 150ms → model serving issue.
PII Detection Rate
Track how often PII is found in responses. If rate spikes, investigate the retrieval pipeline — it may be pulling in documents with unredacted personal data.
User Feedback Correlation
Correlate confidence scores with user feedback. If high-confidence responses get thumbs-down, your evaluator is miscalibrated. Retune weights quarterly.
✓ Grounding Check (cascading T1/T2/T3) ✓ Intent Alignment ✓ Safety / Content Moderation ✓ Confidence Score Engine ✓ Citation Verification ✓ Completeness Check ✓ PII Leakage Detection ✓ Freshness / Staleness Check ✓ Retry / Regeneration Strategy ✓ Human Feedback Loop ✓ Evaluator Monitoring & Alerting
Self-Correction & Reflection Loops
Modern advanced RAG adds self-checking loops that detect and recover when retrieval quality is poor—rather than blindly stuffing top-k passages into prompts
Core Self-Correction Techniques
Self-RAG
Model decides whether retrieval is needed per token. Generates reflection tokens (IsRel, IsSup, IsUse) to critique its own outputs. Targets factuality and citation accuracy—10–15% improvement in studies.
CRAG
Corrective RAG evaluates retrieved documents. Triggers corrective actions (alternative retrieval, filtering, web search fallback) when retrieval quality is poor.
Adaptive Retrieval
Retrieve fewer documents when confidence is high; more when needed. Avoids indiscriminate retrieval via confidence-gated document selection.
Adaptive Retrieval with Confidence Gating
RAG Evaluation Framework
Three metric categories measure retrieval effectiveness, generation quality, and system performance
| Category | Metrics | Measurement Method | Tools |
|---|---|---|---|
| Retrieval Metrics | Precision, Recall, NDCG, MRR, MAP | Rank quality against gold standard passages | BEIR, Trec-eval |
| Generation Metrics | BLEU, ROUGE, METEOR, BERTScore, Context Relevance | Automated scoring, LLM judges, embedding similarity | TruLens, RAGAS |
| System Metrics | Latency, throughput, cost per query, user satisfaction | Production logs, user feedback, A/B tests | OpenTelemetry, Datadog, custom instrumentation |
Building Your Eval Dataset
Synthetic
LLM-generated QA from corpus. Fast, cheap. Risk of false positives.
Human-Curated
Gold standard. Expensive, slow. High quality baseline.
Production Logs
Real queries & answers. Most realistic. Requires filtering.
Adversarial
Edge cases, tricky queries. Surfaced via user feedback.
RAGAS: Automated RAG Evaluation
RAG Benchmarking & Performance Testing
Systematic benchmarking with shared metrics ensures consistent quality and enables confident deployment decisions
Benchmarking Framework Flow
Retrieval Benchmarks
| Metric | Target |
|---|---|
| Recall@5 | > 85% |
| Recall@20 | > 95% |
| NDCG@10 | > 0.70 |
| MRR | > 0.75 |
| Hit Rate | > 95% |
| BEIR Zero-Shot | Baseline |
| MTEB Rank | Top 10% |
| Latency P95 | < 100ms |
Generation Benchmarks
| Metric | Target |
|---|---|
| Faithfulness | > 0.90 |
| Answer Relevance | > 0.85 |
| Context Relevance | > 0.80 |
| Hallucination Rate | < 5% |
| Citation Accuracy | > 90% |
| Refusal Rate | > 80% |
| Completeness | > 85% |
| TTFT P95 | < 500ms |
End-to-End System
Test full pipeline: retrieval → generation → post-processing. Measure user-facing latency, cost per query, success rate.
A/B Testing & Online
Shadow deploy changes. Compare metrics vs. baseline with statistical significance. Catch production surprises before full rollout.
Adversarial & Stress
Test with typos, out-of-domain queries, adversarial prompts. Load test at 10× peak. Measure robustness.
RAG Benchmark Suite Code
class RAGBenchmarkSuite:
def __init__(self):
self.thresholds = {
"recall@5": 0.85,
"faithfulness": 0.90,
"latency_p95_ms": 100,
}
def run_benchmark(self, model, dataset):
results = {}
for q, ctx, golden_answer in dataset:
retrieved = self.retrieve(q)
generated = model.generate(q, ctx)
results["recall"] = self.compute_recall(retrieved, ctx)
results["faithfulness"] = self.compute_faithfulness(generated, ctx)
return self.check_regressions(results, self.thresholds)
Benchmark Workflow Pipeline
Run suite on current production system
Make code change, rerun benchmarks
Compare vs. baseline, flag regressions
Pass gates → deploy; fail → iterate
Monitor metrics post-deploy, alert on drift
Benchmarking Tools Comparison
| Tool | Metrics | LLM-Based | Reference | Best For |
|---|---|---|---|---|
| RAGAS | Faithfulness, Answer Rel., Context Rel. | ✓ | Paper | Gen. + Retrieval |
| BEIR | Recall@K, NDCG@K, MRR, RMSE | ✗ | Yes | Retrieval (IR) |
| MTEB | Cross-lingual Retrieval, Ranking | ✗ | Yes | Multilingual |
| TruLens | LLM-based evals + feedback | ✓ | No | Custom logic |
| DeepEval | Hallucination, Answer Rel., RAGAS | ✓ | Optional | LLM Evals |
| LangSmith | Custom evals, tracing, logging | Partial | Optional | Development + Monitoring |
| Arize Phoenix | Evals + Production Observability | ✓ | Optional | End-to-End |
| Custom Harness | Org-specific metrics & logic | Optional | Org | Control + Integration |
Guardrails & Safety
Multi-stage guardrails prevent harmful input, retrieval, generation, and post-processing risks
Guardrail Pipeline: Four Stages
1. Input
- • Prompt injection detection
- • PII masking
- • Content profanity filter
2. Retrieval
- • ACL enforcement
- • Source validation
- • Freshness checks
3. Generation
- • Hallucination checker
- • Citation validator
- • Token budget limits
4. Post-Processing
- • PII scrubbing
- • Toxicity filtering
- • Output validation
GuardrailPipeline Implementation
Production Guardrails Architecture — Models, Tools & Design
A production guardrail system is NOT a single checkpoint. It's a layered defense architecture with specialized models at each stage — some rule-based (0ms), some ML-based (~10ms), some LLM-based (~200ms). The key is running them in parallel and using the cheapest effective check first.
Guardrail Models — What to Use at Each Stage
| Check | Model / Tool | Type | Latency | Accuracy | Cost | Best For |
|---|---|---|---|---|---|---|
| Prompt Injection | deberta-v3-prompt-injection | Fine-tuned classifier | ~15ms | 92% F1 | Free (self-hosted) | Primary injection defense |
| Prompt Injection | Lakera Guard API | Managed API | ~50ms | 95%+ F1 | $0.001/req | Higher accuracy, no infra |
| Prompt Injection | ProtectAI / Rebuff | Multi-layer (heuristic+LLM) | ~80ms | High | Free OSS | Defense-in-depth |
| PII Detection | Microsoft Presidio | NER + regex | ~10ms | High | Free (OSS) | Default PII choice |
| PII Detection | AWS Comprehend PII | Managed API | ~40ms | Very high | $0.01/unit | AWS-native stacks |
| Toxicity | OpenAI Moderation | Managed API | ~30ms | Very high | Free | Default safety check |
| Toxicity | Perspective API (Google) | Managed API | ~50ms | High | Free (quota) | Multi-language toxicity |
| Toxicity | unitary/toxic-bert | Self-hosted BERT | ~12ms | Good | Free (GPU) | Air-gapped / self-hosted |
| Topic / Intent | SetFit (fine-tuned) | Few-shot classifier | ~8ms | High | Free | Domain-specific blocking |
| Grounding | DeBERTa-v3-NLI | Cross-encoder | ~50ms | Very high | Free (GPU) | Tier 2 grounding |
| Grounding | Claude Haiku / GPT-4o-mini | LLM-as-judge | ~400ms | Best | ~$0.001/req | Tier 3 disputed claims |
| Framework | Guardrails AI | Orchestration | varies | — | Free OSS | Declarative guard chains |
| Framework | NeMo Guardrails (NVIDIA) | Dialog management | varies | — | Free OSS | Conversational safety flows |
| Red Team | Promptfoo | Testing framework | offline | — | Free OSS | CI/CD injection testing |
| Red Team | Garak (NVIDIA) | Vulnerability scanner | offline | — | Free OSS | Automated LLM probing |
Production GuardrailOrchestrator
class GuardrailOrchestrator:
"""Run all guards in parallel per layer.
Total latency = max(layer checks), not sum."""
def __init__(self, config: GuardConfig):
# Layer 1: Input (parallel)
self.input_guards = [
PromptInjectionGuard(
model="deberta-v3-injection"
),
PIIScanner(engine="presidio"),
TopicBlocker(topics=config.blocked),
RateLimiter(redis=config.redis),
ContentPolicy(rules=config.rules),
]
# Layer 3: Output (parallel)
self.output_guards = [
GroundingVerifier(cascade=True),
ToxicityFilter(api="openai"),
PIIScrubber(engine="presidio"),
CitationValidator(),
IntentAligner(),
PolicyRuleEngine(config.rules),
]
async def check_input(self, query, ctx):
# Run ALL input guards in parallel
results = await asyncio.gather(*[
g.check(query, ctx)
for g in self.input_guards
], return_exceptions=True)
# Any hard block = reject immediately
for r in results:
if isinstance(r, BlockVerdict):
return r # blocked
return PassVerdict()
async def check_output(self, response, ctx):
results = await asyncio.gather(*[
g.check(response, ctx)
for g in self.output_guards
], return_exceptions=True)
# Aggregate into confidence score
return self.confidence.calculate(results)Design Principles
1. Parallel by default: Run all checks within a layer simultaneously. Latency = max(check), not sum(checks). Input layer: ~15ms. Output layer: ~50ms.
2. Cheapest first: Regex rules (0.5ms) → ML classifiers (10ms) → API calls (30ms) → LLM judges (400ms). Exit at the cheapest layer that gives a confident verdict.
3. Fail-open vs fail-closed: Safety and injection checks = fail-closed (block if check fails). PII and grounding = fail-open with degraded response (still answer, but warn).
4. Never block the user silently: Every block must include a reason. "I can't answer that because..." is better than a generic error.
5. Audit everything: Every guard verdict → immutable log with query_id, guard_name, verdict, score, latency, timestamp. Required for compliance and debugging.
Enterprise Threat Model & OWASP LLM Top 10
Map RAG attack surfaces to OWASP LLM Top 10 categories with mitigations
1. Prompt Injection
Risk: Malicious prompts override system instructions or exfiltrate data.
Mitigations: Content sanitization, instruction stripping, system prompt dominance, tool-call allowlists.
2. Data Exfiltration
Risk: Model leaks sensitive data (PII, secrets) in responses.
Mitigations: Output filtering, PII scrubbing, redaction at generation time.
3. Permission Leakage
Risk: Weak retrieval filters expose unauthorized content.
Mitigations: ACL-aware retrieval, auth-sensitive cache keys, audit trails.
4. Data Poisoning
Risk: Malicious docs inserted into corpus, spread misinformation.
Mitigations: Ingestion validation, source trust scoring, content integrity checks.
5. DoS (Expensive Prompts)
Risk: Very long contexts, recursive tool calls exhaust resources.
Mitigations: Token budgets, hard timeouts, rate limits per user.
6. Supply Chain
Risk: Compromised embedding models or dependencies.
Mitigations: Model provenance, dependency scanning, vendor security audit.
A user must never receive retrieved context (or generated content derived from it) that they are not authorised to access.
Permission-Aware Retrieval Requirements:
- • Ingest-time ACL assignment: Tag every chunk with owner/org/role ACLs
- • Query-time filter enforcement: Filter retrieved docs by user's ACL before context assembly
- • ACL-sensitive cache keys: Include user_id/org_id in cache key to prevent cross-user leakage
- • Audit trails: Log all access (who queried, what docs were retrieved, timestamps)
Grounding & Faithfulness
Ensure every generated claim is traceable to retrieved evidence — reduce hallucinations by 42-68%, enable inline citations, and build verifiable trust in production RAG systems.
What is Grounding?
Grounding is the process of anchoring every claim in the LLM's response to specific evidence from retrieved documents. An answer is grounded when each statement can be traced back to a source passage. An answer is faithful when it does not add information beyond what the context supports. Together, grounding + faithfulness are the primary defenses against hallucination in RAG systems.
Grounding Techniques
1. Prompt-Based Grounding
Instruct the LLM to cite sources inline. Simplest approach — no extra models needed.
- Inline citations: "Answer using [1], [2] notation"
- Quote extraction: "Include exact quotes from context"
- Abstain instruction: "Say 'I don't know' if context lacks answer"
- Confidence tagging: "Rate confidence [HIGH/MED/LOW] per claim"
Effectiveness: Reduces hallucination by 30-45%. Easy to implement but relies on LLM compliance.
2. NLI Verification
Use Natural Language Inference models to verify each claim is entailed by retrieved context.
- Claim decomposition: Split response into atomic claims
- Entailment check: DeBERTa-MNLI / TRUE model per claim
- Verdict: Entailed, Contradicted, or Neutral
- Action: Remove/flag unentailed claims
Effectiveness: Reduces hallucination by 50-68%. Gold standard for post-hoc verification.
3. Self-Consistency Voting
Generate multiple answers and keep only claims that appear consistently across samples.
- Sample N responses (temperature > 0)
- Extract atomic claims from each response
- Majority vote: Keep claims in ≥60% of samples
- Consensus answer: Reconstruct from agreed claims
Effectiveness: 40-55% hallucination reduction. Costs N× more tokens. Best for high-stakes queries.
4. Citation-Aware Generation
Fine-tune or prompt models to generate answers with verifiable citation markers in a structured format.
- ALCE framework: Train citation generation with NLI feedback
- AGREE approach: Tune LLM to include citations, verify with NLI
- Post-hoc attribution: Match generated sentences to source chunks
- CARGO: Citation-aware routing + grounded optimization
Effectiveness: 55-70% reduction. Requires fine-tuning or structured prompting. Best quality.
Techniques Comparison
| Technique | Hallucination Reduction | Latency Impact | Cost Impact | Implementation |
|---|---|---|---|---|
| Prompt-based citation | 30-45% | None | None | Trivial (prompt change) |
| Abstain instruction | 20-35% | None | None | Trivial (prompt change) |
| NLI post-verification | 50-68% | +50-100ms (DeBERTa) | Low ($0.001/query) | Medium (NLI model) |
| Self-consistency (N=5) | 40-55% | 5x generation time | 5x token cost | Easy (sampling) |
| RAGAS faithfulness | Eval metric (not mitigation) | +200ms | 1 extra LLM call | Medium (pipeline) |
| Citation-aware fine-tune | 55-70% | None at inference | $2-5K training | High (SFT + NLI) |
| Combined (prompt + NLI + retry) | 65-80% | +100-300ms | Low-Medium | Medium |
Grounding Metrics
Faithfulness (RAGAS)
Fraction of claims in the answer that are supported by the retrieved context. Computed via LLM or NLI entailment.
Target: ≥ 0.85 for production
Citation Precision
Fraction of inline citations that actually support the claim they're attached to. Measured via NLI on (claim, cited_passage) pairs.
Target: ≥ 0.80
Citation Recall
Fraction of claims that have at least one valid citation. Missing citations = unverifiable claims, even if correct.
Target: ≥ 0.75
Production Implementation
# === Grounding Pipeline: Prompt + NLI Verification + Retry ===
from transformers import pipeline
from ragas.metrics import faithfulness
import re
# 1. NLI model for claim verification
nli = pipeline("text-classification",
model="microsoft/deberta-v3-large-mnli",
device="cuda")
# 2. Grounding prompt template
GROUNDED_PROMPT = """Answer the question based ONLY on the provided context.
Rules:
- Cite sources using [1], [2], etc. after each claim
- If the context doesn't contain the answer, say "I don't have enough information"
- Never add information not present in the context
- Rate your overall confidence: [HIGH], [MEDIUM], or [LOW]
Context:
{context}
Question: {question}
Answer (with citations):"""
# 3. Decompose response into atomic claims
def decompose_claims(response: str) -> list[str]:
"""Split response into individual factual claims."""
sentences = re.split(r'(?<=[.!?])\s+', response)
return [s.strip() for s in sentences if len(s.strip()) > 10]
# 4. Verify each claim against retrieved context
def verify_grounding(claims: list[str], context: str) -> dict:
results = {"grounded": [], "ungrounded": [], "score": 0.0}
for claim in claims:
# NLI: does context entail this claim?
result = nli(f"{context}", candidate_labels=[claim])
label = result[0]["label"]
if label == "ENTAILMENT":
results["grounded"].append(claim)
else:
results["ungrounded"].append(claim)
total = len(claims)
results["score"] = len(results["grounded"]) / total if total > 0 else 0
return results
# 5. Full grounding pipeline with retry
def grounded_rag(query, retriever, llm, max_retries=2):
docs = retriever.invoke(query)
context = "\n".join([f"[{i+1}] {d.page_content}" for i, d in enumerate(docs)])
for attempt in range(max_retries + 1):
prompt = GROUNDED_PROMPT.format(context=context, question=query)
response = llm.invoke(prompt)
# Verify grounding
claims = decompose_claims(response)
verification = verify_grounding(claims, context)
if verification["score"] >= 0.85:
return {"answer": response, "grounding_score": verification["score"],
"ungrounded": verification["ungrounded"], "attempts": attempt + 1}
# Retry with feedback on ungrounded claims
query += f"\n\nNote: these claims were ungrounded, remove them: {verification['ungrounded']}"
return {"answer": response, "grounding_score": verification["score"],
"warning": "Below grounding threshold after retries"}
Production Recommendations
Recommended: Layered Grounding
- Prompt engineering — Always include citation instructions and abstain directive (free, 30-45% reduction)
- NLI post-check — Run DeBERTa-MNLI on claims after generation (+50ms, 50-68% reduction)
- Retry loop — If faithfulness < 0.85, regenerate with feedback on ungrounded claims (1-2 retries max)
- Fallback — If still below threshold, return partial answer with confidence warning
Combined effect: 65-80% hallucination reduction at <300ms extra latency
Monitoring & Alerts
- Track faithfulness score per query (RAGAS or NLI-based)
- Alert if daily avg drops below 0.80
- Log ungrounded claims for analysis and prompt improvement
- Sample 1% for human review — correlate with NLI scores
- Dashboard metrics: faithfulness, citation precision, citation recall, abstain rate
- Watch abstain rate: >30% means retrieval quality is poor, not grounding
Observability & Monitoring
Three monitoring layers: system SLOs, retrieval quality, and answer groundedness
Monitoring Layers
System SLOs
- • Latency (p50, p95, p99)
- • Throughput (QPS)
- • Error rate
- • Availability
Retrieval Quality
- • NDCG, MRR (rank quality)
- • Precision@k
- • Docs retrieved per query
- • Reranker acceptance rate
Answer Quality
- • Faithfulness (grounded?)
- • Answer relevance
- • Citation accuracy
- • User feedback signal
OpenTelemetry Tracing Decorator
OpenTelemetry Collector Config (with PII Scrubbing)
LangSmith
LLM tracing, debugging
Arize Phoenix
ML observability
OpenTelemetry
Core instrumentation
Datadog
Metrics, dashboards
TruLens
RAG eval metrics
- • Latency Breakdown: Query transform, embedding, search, rerank, LLM, total
- • Retrieval Quality: NDCG, docs per query, reranker effectiveness
- • LLM Metrics: Token usage, cost, temperature, model routing decisions
- • User Metrics: Active users, unique queries, satisfaction (thumbs up/down)
- • System Health: Disk usage (vector DB), index freshness, cache hit rates, error budget
Scaling & Performance
Scale RAG systems from thousands to billions of documents with architecture patterns
| Scale Tier | Document Count | Query Load (QPS) | Typical Latency (p95) | Architecture Pattern |
|---|---|---|---|---|
| Small | 10^5–10^6 chunks | <10 QPS | <2s | Single in-memory FAISS index, Python app, SQLite metadata |
| Medium | 10^7–10^8 chunks | 10–300 QPS | 1–4s | Milvus/Weaviate cluster, Kubernetes, async queue processing, multi-region replication |
| Large | 10^8–10^9+ chunks | 300–5000+ QPS | 500ms–1s | Elasticsearch sharding, GPU-accelerated search (Triton), vLLM serving, distributed caching, traffic shaping |
Multi-Layer Caching Strategy
L1: Exact Query
Hash(query) → response. TTL: 24h. Hit rate: 15–25% for repeated queries.
L2: Semantic
Embedding similarity clustering. Cache similar queries together. Hit rate: 30–40%.
L3: Embedding
Cache embeddings for large docs to avoid re-embedding on every query.
L4: LLM Response
Cache LLM outputs by (query, context hash). Reduces expensive inference calls.
SemanticCache Implementation
Scaling Architecture Patterns
Retrieval Optimization
- • Horizontal scaling: Shard index by doc_id ranges
- • GPU acceleration: Triton Inference Server for embedding
- • Connection pooling: Reuse DB connections (PgBouncer)
- • Async processing: Batch embedding requests
Generation Optimization
- • vLLM: PagedAttention for high-throughput serving
- • Model parallelism: Shard LLM across GPUs
- • Quantization: INT8/FP8 for latency reduction
- • Speculative decoding: Predict + verify next tokens in parallel
KServe Kubernetes-Native Serving (vLLM)
- Query Transform: 50ms (normalization, spell-check)
- Embedding: 20ms (vectorize query)
- Search: 30ms (FAISS/Milvus lookup)
- Rerank: 50ms (cross-encoder)
- LLM Generation: 800ms (token generation)
- Total Expected: ~950ms P50
SLO Modes: Low-latency interactive (<2–4s p95) vs High-throughput batch (10–60s acceptable)
Advanced Techniques
Agentic RAG
LLM agent decides what, when, and how to retrieve. Decomposes complex queries, chooses data sources (vector DB, SQL, API, web), iteratively refines results.
from langchain_agents import ReActAgent
agent = ReActAgent(
llm=claude,
tools=[vector_search, sql_query, web_search],
max_iterations=5
)
result = agent.run("Find latest Q1 earnings and analyst sentiment")
Graph RAG
Knowledge graphs + vector search. Entity-relationship graphs, multi-hop reasoning, community detection for summarization.
MATCH (a:Company)-[r*1..3]->(b:Company)
WHERE a.name = "Acme Inc"
WITH collect(b) as connected
CALL apoc.text.summarize(connected)
YIELD summary
RETURN summary
RAPTOR (Tree-based)
Recursively summarize clusters into a hierarchy. Query at multiple abstraction levels for top-down reasoning.
- Hierarchical abstraction layers
- Efficient multi-scale retrieval
- Reduced token cost vs flat indexing
Self-RAG (Adaptive)
LLM decides if retrieval needed, generates with self-critique tokens, iterates. Reduces unnecessary retrieval ~40%.
Outputs critique tokens, decides iterations
Multi-Modal RAG
Index images, tables, charts alongside text. Vision models for visual content, multi-modal embeddings.
- Cohere embed-v4 multi-modal
- CLIP for image-text alignment
- Unified vector space across modalities
Multi-Tenant RAG
Namespace isolation per tenant. Shared infrastructure, isolated data. Query-time ACL enforcement.
- Cost-efficient multi-tenancy
- Metadata-driven access control
- Compliance for regulated industries
Deployment & CI/CD
Docker Compose Architecture
services:
rag-api:
image: rag-api:latest
deploy:
replicas: 3
depends_on: [qdrant, redis]
embedding-service:
image: embedding-service:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
indexing-worker:
image: indexing-worker:latest
environment:
- CELERY_BROKER=redis:6379
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333"
redis:
image: redis:7-alpine
CI/CD Pipeline Stages
Code quality, type checks, fast tests
End-to-end retrieval, indexing flows
Gate metrics: Faithfulness >0.85, Relevance >0.80
Monitor cost, latency, error rate
5% → 25% → 50% → 100% traffic
- Canonical doc schema + versioning
- ACL-aware retrieval architecture
- Evaluation harness in CI/CD
- Comprehensive observability & tracing
Cost Optimization
Cost Drivers (Ranked)
Cost Breakdown per 1K Queries
- Embedding Costs: Batch processing, caching, model selection, self-hosting (>1M queries/day)
- LLM Token Costs: Context pruning, model tiering, semantic caching (50–60% reduction)
- Infrastructure: Adaptive retrieval, hybrid tuning, dimension control, vLLM serving efficiency
Failure Modes & Mitigations
Irrelevant/Misleading Retrieval
Hybrid retrieval + reranking + CRAG-like evaluation gates
Over-retrieval / Context Stuffing
Cap top-k, MMR diversity, adaptive retrieval (Self-RAG)
Permission Leakage
ACL filters, ACL in cache keys, comprehensive audit trails
Prompt Injection via Retrieved Docs
Treat as untrusted, content sanitization, system prompt dominance
Embedding Drift (Model Upgrades)
Version embeddings/indexes, shadow-build, offline eval + canary
Silent Ingestion Regressions
Ingestion QA metrics, OCR confidence alerts, hit-rate monitoring
Cost Blow-ups
Token budgets, hard timeouts, reranker gating, rate limits
Use as a practical checklist for security & compliance governance across RAG components.
Phased Implementation Roadmap
Production Readiness Checklist
Data & Indexing
- Multi-format parsing (PDF, docx, HTML, images)
- Incremental indexing with changelog tracking
- Rich metadata (source, timestamp, ownership)
- Semantic chunking strategies
- Dead letter queue for malformed data
- Data freshness tracking & SLAs
- Canonical document contract
Retrieval & Generation
- Hybrid search (dense + sparse + knowledge graph)
- Multi-stage reranking pipeline
- Query transformation & expansion
- Streaming generation with token streaming
- Citation validation & provenance
- Confidence scoring on outputs
- Self-correction loops (Self-RAG)
Safety & Compliance
- Prompt injection detection & mitigation
- PII detection (Presidio integration)
- Hallucination detection frameworks
- RBAC / ACL enforcement
- Comprehensive audit logging
- GDPR / HIPAA / EU AI Act compliance
- NIST AI RMF & data retention policies
Operations & Scale
- Multi-layer caching (query, response, embedding)
- Distributed tracing (OpenTelemetry)
- Automated evaluation in CI/CD gates
- Canary & progressive rollout strategy
- Cost tracking & budget enforcement
- User feedback loop & telemetry
- vLLM / KServe deployment optimization
"Production RAG is 20% retrieval and 80% engineering"
Data quality, evaluation frameworks, guardrails, caching strategies, comprehensive observability, and production operations are what separate a working demo from a reliable, compliant, cost-efficient product.
Context Compression for RAG
Reduce retrieved context length before generation — cut costs up to 80%, decrease latency, improve answer quality by eliminating noise, and fit more relevant information within the LLM's context window.
Compression Taxonomy
Extractive
Select the most relevant sentences or tokens from retrieved documents. No rewriting — preserves original text fidelity.
Best for: Factual QA, legal/medical where exact wording matters
Abstractive
Generate condensed summaries of retrieved context. Rewrites and merges information from multiple documents into coherent compressed text.
Best for: Multi-doc synthesis, when space is extremely limited
Hybrid / Learned
Neural models trained to compress context into summary vectors or learned soft tokens. Encode key information into fixed-size representations.
Best for: Very long context, embedding-level compression
Key Techniques Compared
| Technique | Type | Compression | Quality Retention | Latency Overhead | Best For |
|---|---|---|---|---|---|
| LLMLingua-2 | Extractive (token-level) | 3-20x | 95-98% | ~10ms (small classifier) | General-purpose; best quality/speed ratio |
| LongLLMLingua | Extractive (query-aware) | 2-10x | 97-100% (can improve +21%) | ~15ms | Multi-doc RAG; combats lost-in-middle |
| Selective Context | Extractive (sentence-level) | 2-5x | 93-96% | ~5ms | Simple baseline; minimal dependencies |
| Reranker + Top-K Filter | Extractive (document-level) | 2-5x | 95-99% | ~20-50ms (cross-encoder) | Already using reranker; simplest integration |
| RECOMP (Extractive) | Extractive (trained selector) | 5-10x | 94-97% | ~15ms | NQ/TriviaQA-style single-answer tasks |
| RECOMP (Abstractive) | Abstractive (trained summarizer) | 10-20x | 90-95% | ~100-200ms (small LM gen) | Multi-hop reasoning; extreme compression |
| AutoCompressors | Learned (summary vectors) | 20-50x | 85-92% | ~50ms | Very long documents; fixed-budget context |
| Map-Reduce Summary | Abstractive (LLM chain) | 10-50x | 80-90% | ~500ms-2s (LLM calls) | 100+ page documents; report generation |
| ECoRAG | Hybrid (evidentiality-guided) | 5-15x | 96-99% | ~20ms | Long context RAG; evidence-focused answers |
LLMLingua Family — Production Standard
LLMLingua (v1)
Uses a small language model (e.g., GPT-2, LLaMA-7B) to compute per-token perplexity. Tokens with low perplexity (highly predictable) are dropped. Budget-constrained iterative token pruning.
- Up to 20x compression
- Only 1.5% performance loss on reasoning
- Works with any LLM (black-box compatible)
LLMLingua-2
Reframes compression as a token classification problem. A small BERT-like model predicts which tokens to keep/drop. Trained on GPT-4 distilled labels.
- 3-6x faster than LLMLingua v1
- 95-98% accuracy retention
- Task-agnostic — no prompt-specific tuning
- Published at ACL 2024
LongLLMLingua (RAG-Optimized)
Specifically designed for RAG pipelines. Three key innovations:
- Question-aware coarse-to-fine: Compresses differently based on query relevance — keeps more tokens from highly relevant passages
- Document reordering: Combats the "lost-in-middle" problem by placing most relevant docs at start/end
- Dynamic compression ratios: Uses contrastive perplexity (question-conditioned vs unconditional) to decide per-document compression level
Result: Up to 21.4% RAG quality improvement using only 25% of tokens
RECOMP — Trained Compression
Two variants from Princeton/CMU research:
- Extractive: Trained selector picks most useful sentences from each document. Fast, preserves original text.
- Abstractive: Trained T5-based summarizer generates concise summaries conditioned on the query. Higher compression but rewrites text.
Both outperform no-compression baselines on NQ and TriviaQA while using 5-20x fewer input tokens.
Production Implementation
# === LLMLingua-2 with LlamaIndex ===
from llmlingua import PromptCompressor
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
# Initialize compressor (uses small model for token classification)
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
use_llmlingua2=True,
device_map="cuda"
)
# Retrieve documents (standard RAG pipeline)
index = VectorStoreIndex.from_documents(SimpleDirectoryReader("./data").load_data())
retriever = index.as_retriever(similarity_top_k=10)
nodes = retriever.retrieve("What are the key findings?")
# Compress retrieved context before sending to LLM
context = "\n\n".join([n.get_content() for n in nodes])
compressed = compressor.compress_prompt(
context,
instruction="Answer the question based on the context.",
question="What are the key findings?",
target_token=500, # target compressed length
rate=0.5, # 50% compression ratio
force_tokens=["?", "."], # always keep these tokens
)
print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Ratio: {compressed['ratio']:.1f}x")
print(f"Saving: {compressed['saving']}")
# Use compressed context for generation
compressed_prompt = compressed["compressed_prompt"]
# === LangChain Contextual Compression Retriever ===
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import (
LLMChainExtractor,
EmbeddingsFilter,
DocumentCompressorPipeline,
)
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
# Strategy 1: LLM-based extraction (highest quality, highest latency)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)
# Strategy 2: Embeddings filter (fast, no LLM call)
embeddings = OpenAIEmbeddings()
embeddings_filter = EmbeddingsFilter(
embeddings=embeddings,
similarity_threshold=0.76 # drop docs below threshold
)
# Strategy 3: Pipeline — split → filter → extract (recommended)
splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)
pipeline = DocumentCompressorPipeline(
transformers=[splitter, embeddings_filter]
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=pipeline,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)
# Use in chain
docs = compression_retriever.invoke("What are the quarterly results?")
print(f"Retrieved {len(docs)} compressed docs")
Production Recommendations
Recommended: Tiered Compression
Combine multiple techniques in a pipeline for best results:
- Document-level: Reranker filters top-k → top-3-5 docs
- Sentence-level: LLMLingua-2 or EmbeddingsFilter removes irrelevant sentences
- Token-level: LLMLingua-2 prunes redundant tokens (optional, for aggressive compression)
Typical result: 5-10x compression, 96%+ quality, <30ms overhead
When to Use Each Approach
- Low latency budget (<10ms): Embeddings filter or top-k reranker cutoff
- Moderate latency (<50ms): LLMLingua-2 token classification — best all-around
- Maximum quality: LongLLMLingua with question-aware compression
- Extreme compression (>10x): RECOMP abstractive or map-reduce
- Cost-sensitive: LLMLingua-2 + small local model (no API calls)
- Multi-hop QA: ECoRAG evidentiality-guided compression
Cost & Latency Impact
| Scenario | Input Tokens | With Compression | Token Savings | Cost Savings (GPT-4o) |
|---|---|---|---|---|
| 5 docs @ 500 tokens | 2,500 | 625 (4x) | 1,875 | $4.69 / 1K queries |
| 10 docs @ 800 tokens | 8,000 | 1,600 (5x) | 6,400 | $16.00 / 1K queries |
| 20 docs @ 1000 tokens | 20,000 | 2,000 (10x) | 18,000 | $45.00 / 1K queries |
| 1M queries/month (10 docs) | 8B tokens | 1.6B tokens | 6.4B tokens | $16,000/month saved |
Framework Integration
LangChain
ContextualCompressionRetriever wraps any base retriever + compressor pipeline. Built-in: LLMChainExtractor, EmbeddingsFilter, DocumentCompressorPipeline.
pip install langchain
LlamaIndex
LongLLMLinguaPostprocessor integrates directly into query pipeline as a node postprocessor. Supports LLMLingua-2.
pip install llmlingua llama-index
Direct (Microsoft)
PromptCompressor from the llmlingua library. Framework-agnostic — works with any pipeline. Supports CUDA acceleration.
pip install llmlingua
RAG Taxonomy — The Complete Map
A hierarchical taxonomy of RAG architectures showing how different approaches relate, evolve, and specialize. From naive foundations to advanced agentic systems.
Evolution Timeline
2020-2022: Foundation
- Naive RAG emerges as standard approach
- Embedding models (BERT, Sentence-BERT) become practical
- Simple chunk → embed → retrieve → generate pattern
2023-2024: Maturation
- Advanced RAG techniques (reranking, query rewriting)
- Modular approaches (LangGraph, DSPy) gain adoption
- Self-RAG papers published, agentic patterns emerge
2024-2025: Specialization
- Multimodal and hybrid systems
- Adaptive routing based on query complexity
- CAG with extended context windows (200K+)
Future: Integration
- Unified frameworks combining multiple techniques
- Automatic approach selection via meta-reasoning
- Stronger metrics for measuring RAG quality
Quick Taxonomy Comparison
| Type | Complexity | Best For | Latency |
|---|---|---|---|
| Naive RAG | Low | Prototyping, simple Q&A | Fast (100-500ms) |
| Advanced RAG | Medium | Production systems, accuracy | Moderate (500ms-2s) |
| Modular/Agentic | High | Complex reasoning, multi-step | Slower (2-10s) |
| CAG | Low (setup) | Small corpus, low latency | Fastest (<100ms) |
Naive RAG — The Foundation Pattern
The simplest RAG architecture: chunk documents → embed → retrieve → generate. Powerful for basic Q&A but suffers from lost-in-the-middle, no query transformation, and no reranking. A great starting point, but not production-ready alone.
Core Limitations
Lost in the Middle
Models attend less to information in the middle of long contexts. With naive RAG returning k=5 documents, the first and last chunks get more weight. Reranking solves this.
No Query Transformation
Complex questions aren't rewritten. A query like "How does RAG work?" gets no transformation — you retrieve with the literal user text, missing semantic variation.
No Reranking
Retrieval rank is final. If the embedding metric ranks doc #3 high but it's actually irrelevant, there's no second-pass reranker to fix it.
No Fallback Strategy
If retrieval fails or returns low-confidence results, the model still generates based on whatever was retrieved. No threshold checks or secondary retrieval.
Code Example: Minimal Naive RAG
# === Minimal Naive RAG ===
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
# 1. Load and chunk documents
documents = ["doc1 text...", "doc2 text..."]
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text("\n\n".join(documents))
# 2. Embed and index
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_texts(chunks, embeddings)
# 3. Create naive RAG chain
llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # Just stuff context into prompt
retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)
# 4. Query
result = qa_chain.invoke({"query": "What is RAG?"})
print(result["result"])
When to Use Naive RAG
Advanced RAG — Production Optimization
Layer 3 techniques onto naive RAG: query transformation, hybrid retrieval, reranking, context compression, and self-correction. Most production systems live here. Better accuracy with manageable complexity.
Three Optimization Layers
Pre-Retrieval
- Query Rewriting: Rephrase for clarity
- HyDE: Generate hypothetical doc
- Multi-Query: Ask multiple ways
- Contextual Expansion: Add domain context
Retrieval
- Hybrid Search: BM25 + vectors
- RRF Fusion: Merge rankings
- Semantic Router: Route by topic
- Metadata Filtering: Pre-filter
Post-Retrieval
- Reranking: Cross-encoder scoring
- Context Compression: Distill docs
- Diversity: Remove redundancy
- Self-Correction: Validate output
Code: Advanced RAG with Reranking
# === Advanced RAG: Query Rewriting + Hybrid + Reranking ===
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_community.document_compressors import CrossEncoderReranker
from langchain.retrievers import ContextualCompressionRetriever
# 1. Query rewriting (LLM-based)
def rewrite_query(query, llm):
prompt = f"Rewrite this for clarity: {query}"
return llm.invoke(prompt)
# 2. Hybrid retrieval: BM25 + Vector
bm25_retriever = BM25Retriever.from_documents(docs)
vector_retriever = vectorstore.as_retriever(k=10)
ensemble = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.5, 0.5] # RRF fusion
)
# 3. Reranking with cross-encoder
compressor = CrossEncoderReranker(
model="cross-encoder/ms-marco-MiniLM-L-12-v2",
top_n=5
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=ensemble
)
# 4. Execute advanced RAG
query = "What is prompt engineering?"
rewritten = rewrite_query(query, llm)
docs = compression_retriever.invoke(rewritten)
# Best docs are now ranked, rewritten query improved retrieval
Tools & Frameworks
Pre-Retrieval Tools
- HyDE — Generate hypothetical answers
- Query2doc — Multi-query expansion
- Prompt chaining — Step-by-step rewriting
Reranking Models
- ColBERT — Token-level scoring
- LLM-Rank — Use LLM as judge
- Jina Reranker — Free, open API
Modular RAG — Composable Components
Break RAG into pluggable, reusable modules: routing, retrieval, reranking, generation. Mix and match for different scenarios. Powers DSPy, LangGraph, and production orchestration systems. Enables rapid experimentation and A/B testing of components.
Core Modules
Router
- Semantic Router: Route by topic
- Rule-Based: Pattern matching
- LLM Router: Let model decide
- Multi-Query: All paths
Retrievers
- Vector: Semantic search
- BM25: Keyword search
- Graph: Relationship-based
- Fusion: Merge multiple
Processors
- Reranker: Score/reorder
- Compressor: Shrink context
- Filter: Remove irrelevant
- Validator: Check quality
Frameworks & Tools
LangGraph
State machine-based orchestration. Define nodes (modules) and edges (flow). Built on LangChain. Great for explicit control flow and multi-step pipelines.
DSPy
Stanford framework for modular composition. Signatures define input/output contracts. Optimizers auto-tune prompts. Excellent for experimentation.
LlamaIndex
Query engines compose retrievers and query fusion. Component-based architecture. Strong integrations with vector DBs and LLM APIs.
Custom Orchestration
Build from scratch with Python. Explicit control, minimal dependencies. Good when you need very specific workflows or want to avoid framework lock-in.
Code: Modular RAG with LangGraph
# === Modular RAG with LangGraph ===
from langgraph.graph import StateGraph
from typing import TypedDict
class RAGState(TypedDict):
query: str
route: str
retrieved_docs: list
answer: str
# Define modules as functions
def router_module(state):
"Route query: simple, complex, or multi-hop"
route = "simple" if len(state["query"].split()) < 5 else "complex"
return {"route": route}
def retrieve_module(state):
"Use appropriate retriever based on route"
if state["route"] == "simple":
docs = simple_retriever.invoke(state["query"])
else:
docs = hybrid_retriever.invoke(state["query"])
return {"retrieved_docs": docs}
def generate_module(state):
"Generate answer from retrieved docs"
context = "\n".join([d.page_content for d in state["retrieved_docs"]])
prompt = f"Context: {context}\n\nQ: {state['query']}\nA:"
answer = llm.invoke(prompt)
return {"answer": answer}
# Wire modules into graph
graph = StateGraph(RAGState)
graph.add_node("router", router_module)
graph.add_node("retriever", retrieve_module)
graph.add_node("generator", generate_module)
graph.add_edge("router", "retriever")
graph.add_edge("retriever", "generator")
graph.set_entry_point("router")
graph.set_finish_point("generator")
# Invoke
app = graph.compile()
result = app.invoke({"query": "What is RAG?"})
print(result["answer"])
Agentic RAG — LLM as Orchestrator
The LLM decides when, what, and how to retrieve. Uses ReAct (Reasoning + Action), tool calling, and iterative multi-step reasoning. Can perform complex workflows: plan → retrieve → refine → retrieve again → generate. Closest to human problem-solving.
Why Agentic RAG?
Multi-Step Reasoning
Complex questions often need multiple retrievals. "Who won the 2024 Oscars and what's their next film?" → Retrieve Oscars → Retrieve actor bio → Retrieve filmography. Agentic handles this naturally.
Tool Composition
The LLM decides which tool to use. Combine retrieval, web search, SQL, calculators, APIs. The model figures out the workflow instead of you hard-coding it.
Uncertainty Handling
If the model is uncertain, it can retrieve more docs, search the web, or ask for clarification. No fixed pipeline — it adapts to the problem.
Explainability
You see the chain of thought: "I need to find... then I'll retrieve... then I'll compute...". Why the model did something is transparent.
Code: Agentic RAG with LangGraph
# === Agentic RAG with Tool Use ===
from langgraph.prebuilt import create_react_agent
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4-turbo")
# Define tools as functions
@tool
def retrieve_docs(query: str) → str:
"Search the knowledge base for documents matching the query"
docs = vector_store.similarity_search(query, k=5)
return "\n".join([d.page_content for d in docs])
@tool
def web_search(query: str) → str:
"Search the web for current information"
results = tavily_search(query, max_results=3)
return "\n".join(results)
@tool
def query_database(sql: str) → str:
"Execute a SQL query against the database"
result = db.execute(sql)
return str(result)
# Create agent with ReAct pattern
tools = [retrieve_docs, web_search, query_database]
agent = create_react_agent(llm, tools)
# Invoke with a complex question
result = agent.invoke({
"input": "Find documents about RAG, then search web for latest developments, then tell me the top 3 trends"
})
print(result["output"])
# LLM decides which tools to call, in what order
Frameworks
LangGraph Agents
- create_react_agent — ReAct out-of-the-box
- Custom graphs for specialized workflows
- Built-in memory and persistence
Anthropic API
- Claude tool_use_activated for agentic flows
- Native support for multi-turn conversations
- Guaranteed tool calling in Claude 3.5+
Self-RAG & Corrective RAG (CRAG) — Self-Reflective Retrieval
The model reflects on its own retrieval and generation. Self-RAG evaluates retrieved doc relevance and output correctness. CRAG adds automatic fallback to web search if retrieval confidence is low. Both reduce hallucination through self-correction.
Self-RAG Decisions
Retrieve?
Can I answer from my weights? Or do I need external knowledge? Smart models skip retrieval for "What is 2+2?" but retrieve for "Latest AI trends."
Relevant Docs?
Are the retrieved docs actually answering the question? If not, re-retrieve or retrieve differently. This prevents using irrelevant context.
Correct Output?
Does my answer follow from the retrieved docs? Or did I hallucinate? Self-check before outputting. This is explicit hallucination detection.
Code: Self-RAG Pattern
# === Self-RAG: Query Routing with Verification ===
from langchain.prompts import PromptTemplate
# Step 1: Decide whether to retrieve
decide_to_retrieve = PromptTemplate.from_template(
"""Given this question, should we retrieve documents?
Question: {query}
Answer Yes or No. Be decisive. Questions about 'latest' or 'current' → Yes."""
)
# Step 2: Retrieve and verify relevance
verify_relevance = PromptTemplate.from_template(
"""Are these documents relevant to the question?
Question: {query}
Documents: {docs}
Rate as RELEVANT, PARTIALLY_RELEVANT, or NOT_RELEVANT"""
)
# Step 3: Generate and verify correctness
verify_generation = PromptTemplate.from_template(
"""Based on these documents, generate an answer. Then verify it.
Documents: {docs}
Question: {query}
Answer:
[Your answer]
Supported_by_docs: Yes or No (is the answer grounded in the documents?)"""
)
# Full Self-RAG flow
def self_rag(query, llm):
# 1. Decide to retrieve
should_retrieve = llm.invoke(decide_to_retrieve.format(query=query))
if "Yes" not in should_retrieve.content:
return llm.invoke(f"Q: {query}\nA:") # Answer without retrieval
# 2. Retrieve
docs = vectorstore.similarity_search(query, k=5)
# 3. Verify relevance
relevance = llm.invoke(verify_relevance.format(query=query, docs=str(docs)))
if "NOT_RELEVANT" in relevance.content:
# Fallback: web search for CRAG
web_docs = tavily_search(query)
docs = web_docs
# 4. Generate with verification
result = llm.invoke(verify_generation.format(query=query, docs=str(docs)))
if "No" in result.content: # Not supported by docs
return "I cannot answer this based on available information."
return result.content
Self-RAG vs CRAG
| Aspect | Self-RAG | CRAG |
|---|---|---|
| Self-Reflection | Decides retrieve, eval relevance, verify output | Same + web fallback on low confidence |
| Data Source | Only knowledge base + model weights | Knowledge base + web search fallback |
| Currency | Limited to indexed knowledge | Can access real-time web data |
| Best For | Internal knowledge, hallucination prevention | Questions needing current info |
Adaptive RAG — Dynamic Strategy Selection
Classify query complexity and dynamically select retrieval strategy. Simple questions skip retrieval. Moderate questions use single-step retrieval. Complex questions trigger multi-step retrieval and reasoning. Optimizes latency and accuracy on a per-query basis.
Three Routing Strategies
Simple
- No retrieval
- LLM answers from weights
- Lowest latency
- Examples: "What is 2+2?", "Who is Elon Musk?"
Moderate
- Single retrieval step
- Hybrid search (BM25+vector)
- Rerank top-5
- Examples: "Explain RAG", "Latest AI news"
Complex
- Multi-step agentic flow
- Multiple retrievals + reasoning
- Web search fallback
- Examples: Comparative analysis, multi-part questions
How to Classify Complexity
Rule-Based
- Word count < 5 → Simple
- Contains "compare", "vs" → Complex
- Contains "how", "why" → Moderate+
- Fast, deterministic
LLM-Based
- Use LLM to classify query
- More accurate but slower
- Handles nuance and edge cases
- Cache classification results
Code: Query Routing
# === Adaptive RAG: Route by Complexity ===
def classify_complexity(query: str) → str:
"Simple rule-based classifier"
words = query.lower().split()
if len(words) < 5:
return "simple"
complex_indicators = ["compare", "versus", "vs", "trade-off", "analyze"]
if any(ind in query.lower() for ind in complex_indicators):
return "complex"
return "moderate"
def adaptive_rag(query, llm):
# Step 1: Classify
complexity = classify_complexity(query)
# Step 2: Route
if complexity == "simple":
# Direct LLM answer
return llm.invoke(f"Q: {query}\nA:")
elif complexity == "moderate":
# Single retrieval + generation
docs = hybrid_retriever.invoke(query)
context = "\n".join([d.page_content for d in docs])
prompt = f"Context: {context}\n\nQ: {query}\nA:"
return llm.invoke(prompt)
else: # complex
# Multi-step agentic RAG
agent = create_react_agent(llm, [retrieve_docs, web_search, analyze_tool])
return agent.invoke({"input": query})
# Usage
answer = adaptive_rag("What is RAG?", llm) # → Simple route, fast
answer = adaptive_rag("Explain vector RAG with embeddings", llm) # → Moderate
answer = adaptive_rag("Compare all RAG types with latency trade-offs", llm) # → Complex
Multimodal RAG — Text, Images, Audio, Video
Extend RAG beyond text to images, tables, audio, video. Use multimodal embeddings (CLIP, GPT-4V) for cross-modal retrieval. Unified indexing allows querying like "find images of dogs" or "transcript sections about AI." Emerging but powerful for rich media corpora.
Multimodal Embedding Models
CLIP
- Text ↔ Image alignment
- Open-source (OpenAI)
- Fast inference
- Good for product images
GPT-4V / Claude
- Vision + language understanding
- API-based (cost)
- Excellent description
- Complex visual reasoning
LLaVA / Falcon
- Open-source vision LLMs
- Self-hosted option
- Decent accuracy
- Lower cost than APIs
Use Cases
E-commerce
Upload product photo → find similar products. Retrieve docs describing materials. Both text and image results ranked together.
Scientific Research
Search for papers + retrieve figures/tables. "Find papers about protein folding with diagrams." Text + images indexed together.
Video Content
Retrieve video sections by transcript. "Find the part where they explain embeddings" → Return timestamp + transcript excerpt.
Documentation
Index docs + diagrams. "How do I deploy on AWS?" → Text guide + architecture diagram retrieved together.
Tools & Frameworks
Multimodal Indexing
- LlamaIndex MultiModal — Multi-doc indexes
- Vespa — Text + image vectors
- Qdrant multimodal plugin — Native support
- Weaviate — Multi-modal indexing
Embedding APIs
- OpenAI CLIP — Multi-modal embeddings
- Google Gemini Vision — Image understanding
- Anthropic Claude Vision — Rich analysis
- Hugging Face models — Open source options
Hybrid RAG — Fusing Multiple Retrieval Methods
Combine sparse (BM25, keyword) + dense (vector embeddings) + structured (graph, SQL). Use fusion algorithms (RRF, learned fusion) to merge rankings. Eliminates single point of failure. Captures both exact matches and semantic similarity. Production RAG standard.
Retrieval Method Combinations
| Combination | Strengths | Cost | Use When |
|---|---|---|---|
| BM25 + Vector | Keywords + semantic, high recall, no gaps | Low | Production standard. Always start here. |
| BM25 + Vector + Graph | Keywords, semantic, entity relationships | Medium | Structured data: knowledge graphs, ontologies |
| Multiple Dense | Different embedding models, perspectives | Medium-High | Unclear best embedding model. Ensemble approach. |
| Full Hybrid | All modalities covered, highest recall | High | Complex domain, diverse corpus types |
Code: RRF Fusion
# === Hybrid RAG with RRF Fusion ===
from langchain.retrievers import BM25Retriever, EnsembleRetriever
# 1. Create two retrievers
bm25_retriever = BM25Retriever.from_documents(docs)
vector_retriever = vectorstore.as_retriever(k=10)
# 2. Ensemble with RRF (built-in)
ensemble = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.5, 0.5], # RRF uses reciprocal rank
k=5 # Return top-5 after fusion
)
# 3. Use in RAG chain
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=ensemble,
return_source_documents=True
)
result = qa.invoke({"query": "What is RAG?"})
# 4. Manual RRF scoring (if needed)
def rrf_fusion(bm25_docs, vector_docs, k=60):
"""Reciprocal Rank Fusion"""
scores = {}
for rank, doc in enumerate(bm25_docs, 1):
scores[doc.metadata["id"]] = 1 / (k + rank)
for rank, doc in enumerate(vector_docs, 1):
doc_id = doc.metadata["id"]
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
# Sort by score
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return ranked[:5] # Top-5
Cache-Augmented Generation (CAG) — Pre-load Knowledge into KV Cache
Instead of retrieving at runtime, pre-compute embeddings of your entire corpus and cache them in the model's KV cache. Eliminates retrieval latency. Only feasible for small corpora (<100 pages) that fit in extended context windows. Fastest possible RAG approach for fitting problems.
When CAG Makes Sense
Small Knowledge Base
Entire corpus <100 pages, <50K tokens. Product docs, internal policies, FAQs. Fits in 200K context windows easily.
Real-Time Latency Critical
Sub-100ms response time needed. Chatbots, real-time assistants. Retrieval overhead is unacceptable.
Static or Rarely Updated
Knowledge base changes <once/week. One-time setup, no daily cache invalidation. Stable reference docs.
Implementation Approaches
Context Stuffing
Simplest: Put all docs in system prompt or context. Claude 200K window easily fits 50-100 pages. Model uses in-context attention. No external retrieval.
KV Cache Caching
Pre-compute model's key-value cache for corpus. Anthropic API supports prompt caching. Only compute KV once, reuse for 100s of queries.
Prefix Caching
Cache common prefixes (docs, instructions) across requests. Saves API costs. Supported by Claude, OpenAI, Anthropic APIs.
Embedding Summary
Generate summaries of each doc, cache summaries. Query against summaries, then in-context search. Hybrid approach.
Code: CAG with Prompt Caching
# === Cache-Augmented Generation (Prompt Caching) ===
from anthropic import Anthropic
client = Anthropic()
# 1. Load entire corpus
with open("knowledge_base.txt", "r") as f:
corpus = f.read()
print(f"Corpus size: {len(corpus):,} tokens (~{len(corpus)//4})")
# 2. Create message with cached corpus
# First request: cache is populated
response1 = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
system=[
{
"type": "text",
"text": "You are a helpful assistant with access to the following knowledge base:"
},
{
"type": "text",
"text": corpus,
"cache_control": {"type": "ephemeral"} # Enable caching
}
],
messages=[
{"role": "user", "content": "What is RAG?"}
]
)
print(f"First query latency: {response1.usage.elapsed}ms")
print(f"Cache created size: {response1.usage.cache_creation_input_tokens}")
# 3. Subsequent requests reuse cache
for query in ["Explain embeddings", "What is retrieval?", "Tell me about vectors"]:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
system=[
{
"type": "text",
"text": "You are a helpful assistant with access to the following knowledge base:"
},
{
"type": "text",
"text": corpus,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": query}
]
)
print(f"{query}: {response.usage.elapsed}ms, cache_read: {response.usage.cache_read_input_tokens}")
# Subsequent queries much faster, cache tokens reused
CAG vs Traditional RAG
| Metric | Traditional RAG | CAG |
|---|---|---|
| Latency Per Query | 500ms-2s (retrieval + gen) | <100ms (gen only) |
| Setup Latency | None (on-demand) | 1-2s (first request, cache KV) |
| Corpus Size Limit | Unlimited (external retrieval) | <200K tokens (context window) |
| Cost Per Query | Retrieval (DB) + LLM tokens | LLM tokens (cached cheaper) |
| Knowledge Updates | Instant (next retrieval) | Requires cache invalidation |
| Scalability | Scales to GB+ of docs | Limited to context window |
Use Cases
Customer Support
Cache product docs, FAQs, policies. Every agent query reuses cache. 10x faster than retrieval-based RAG. Lower API costs per conversation.
Internal QA Bots
Onboarding docs, internal policies, company handbook. Cache once, serve employees instantly. No external DB needed.
Real-Time Chat
Where latency is critical. Academic papers summary cache. Medical reference guides cache. Sub-100ms response times.
Mobile/Edge Apps
Local knowledge base cached in app. Offline-first architecture. Sync when online. No dependency on external retrieval service.
Graph RAG — Knowledge Graph Enhanced Retrieval
Augment vector retrieval with structured knowledge graphs to enable multi-hop reasoning, entity-aware retrieval, traceable answers, and dramatically reduced hallucinations — especially in entity-rich domains like finance, healthcare, legal, and enterprise knowledge bases.
Why Graph RAG?
Baseline RAG Limitations
- Can only reason within a single retrieved chunk
- Fails on multi-hop questions ("Who is the CEO of the company that acquired X?")
- No understanding of entity relationships
- Hard to trace why a chunk was retrieved
- Global summarization questions return fragmented answers
Graph RAG Advantages
- Multi-hop reasoning: Traverse entity → relation → entity paths
- Entity awareness: Disambiguate "Apple" (company vs fruit)
- Traceable answers: Show the graph path that supports each claim
- Reduced hallucination: Grounded in verified structured facts
- Global queries: Community summaries answer "What are the main themes?"
Baseline RAG vs Graph RAG
| Dimension | Baseline (Vector) RAG | Graph RAG |
|---|---|---|
| Retrieval | Semantic similarity (embedding cosine) | Semantic + structural (graph traversal + embeddings) |
| Reasoning | Single-hop (within chunk) | Multi-hop (across entity chains) |
| Explainability | Low — "matched chunk X" | High — "followed path A→B→C" |
| Global queries | Poor (fragmented across chunks) | Good (community summaries) |
| Entity resolution | None | Built-in (graph deduplication) |
| Hallucination rate | 10-25% | 3-10% (grounded in facts) |
| Setup cost | Low ($100s) | Medium-High ($1K-10K, 3-5x baseline) |
| Latency | 50-200ms | 100-500ms (graph + vector) |
| Maintenance | Re-embed on doc update | Re-extract entities + re-embed |
Implementation Approaches
Microsoft GraphRAG
LLM-based entity/relation extraction → Leiden community detection → hierarchical summaries. Best for global queries and corpus-level understanding.
Cost: 3-5x baseline (LLM extraction)
Neo4j + LangChain
LLMGraphTransformer for entity extraction → Neo4j for storage/traversal → Cypher query generation → hybrid vector+graph retrieval.
Best for production enterprise deployments
LlamaIndex PropertyGraph
PropertyGraphIndex with auto-extraction. Supports Neo4j, Nebula, or in-memory graph store. Integrates with existing LlamaIndex pipelines.
Easiest integration if already using LlamaIndex
KG Construction Pipeline
Entity extraction: LLM-based (GPT-4o / Claude) or dependency-based (spaCy + custom rules — 10x cheaper, comparable quality). Relation extraction: Two-stage approach (KGGEN) — entities first, then relations — reduces error propagation. Community detection: Leiden algorithm creates hierarchical clusters for global summarization.
Implementation: Neo4j + LangChain
# === Graph RAG with Neo4j + LangChain ===
from langchain_community.graphs import Neo4jGraph
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
# 1. Connect to Neo4j
graph = Neo4jGraph(
url="bolt://localhost:7687",
username="neo4j",
password="password"
)
# 2. Extract entities and relations from documents
llm = ChatOpenAI(model="gpt-4o", temperature=0)
transformer = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Person", "Company", "Product", "Technology"],
allowed_relationships=["WORKS_AT", "ACQUIRED", "USES", "FOUNDED"],
)
# 3. Chunk and transform
splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
graph_docs = transformer.convert_to_graph_documents(chunks)
# 4. Store in Neo4j
graph.add_graph_documents(graph_docs, baseEntityLabel=True)
print(f"Nodes: {len(graph_docs[0].nodes)}, Rels: {len(graph_docs[0].relationships)}")
# === Hybrid Retrieval: Graph + Vector ===
from langchain_community.vectorstores import Neo4jVector
from langchain.chains import GraphCypherQAChain
# Vector index on chunk embeddings in Neo4j
vector_store = Neo4jVector.from_existing_graph(
embedding=embeddings,
node_label="Document",
text_node_properties=["text"],
embedding_node_property="embedding",
)
# Graph Cypher chain for structured queries
cypher_chain = GraphCypherQAChain.from_llm(
llm=llm,
graph=graph,
verbose=True,
allow_dangerous_requests=True, # needed for Cypher generation
)
# Hybrid retrieval function
def hybrid_graph_rag(query: str):
# 1. Vector retrieval (semantic)
vector_results = vector_store.similarity_search(query, k=5)
# 2. Graph retrieval (structured)
graph_result = cypher_chain.invoke({"query": query})
# 3. Fuse results (Reciprocal Rank Fusion)
context = f"""Graph facts: {graph_result['result']}
Retrieved passages:
{chr(10).join([d.page_content for d in vector_results])}"""
# 4. Generate with fused context
answer = llm.invoke(
f"Based on the following context, answer: {query}\n\n{context}"
)
return answer
# Multi-hop query that baseline RAG fails on
result = hybrid_graph_rag("Who founded the company that acquired Instagram?")
# Graph path: Instagram -[ACQUIRED_BY]-> Meta -[FOUNDED_BY]-> Mark Zuckerberg
Production Recommendations
Use Graph RAG When
- Entity-rich domains: Finance (companies, people, transactions), healthcare (drugs, conditions, treatments), legal (cases, entities, rulings)
- Multi-hop questions are common: "What drugs interact with medications prescribed to patients with condition X?"
- Explainability required: Regulated industries need traceable reasoning paths
- Global/thematic queries: "What are the main themes across all documents?"
- Entity disambiguation matters: Same name = different entities across documents
Stick with Baseline RAG When
- Simple factual QA: Single-hop lookups within documents
- Budget-constrained: KG extraction costs 3-5x more than baseline
- Rapidly changing corpus: KG maintenance overhead is significant
- Small document set: <100 docs — graph overhead not justified
- Latency-critical: Graph traversal adds 50-300ms per query
Cost Comparison
| Component | Baseline RAG | Graph RAG | Delta |
|---|---|---|---|
| Indexing (10K docs) | $5-15 (embeddings) | $50-200 (LLM extraction + embeddings) | 3-15x more |
| Storage | $10-30/mo (vector DB) | $50-150/mo (Neo4j + vector DB) | 3-5x more |
| Query latency | 50-200ms | 100-500ms | 2-3x slower |
| Per-query cost | $0.001-0.005 | $0.002-0.01 | 2x more |
| Answer quality (multi-hop) | 40-60% accuracy | 75-90% accuracy | +30-50% better |
| Hallucination rate | 10-25% | 3-10% | 50-70% less |
Tools & Libraries
Graph Databases
- Neo4j — Industry standard; Cypher query language
- Amazon Neptune — Managed; good for AWS stacks
- NebulaGraph — Open source; scales to billions of edges
- FalkorDB — Redis-based; ultra-low latency
KG Construction
- LLMGraphTransformer — LangChain; LLM-based
- microsoft/graphrag — Full pipeline; community detection
- spaCy + custom — Dependency-based; 10x cheaper
- Diffbot NLU — API-based entity linking
Frameworks
- LangChain — GraphCypherQAChain, Neo4jVector
- LlamaIndex — PropertyGraphIndex, KnowledgeGraphIndex
- RAGatouille — ColBERT + graph integration
- Haystack — Knowledge graph retriever component
Vectorless RAG — Retrieval Without Embeddings
Vectorless RAG approaches bypass traditional embedding-based retrieval entirely, using techniques like BM25, structured SQL queries, LLM-native context stuffing, or direct API calls to retrieve relevant information — eliminating the need for vector databases, embedding models, and index maintenance.
Vectorless Retrieval Approaches
BM25 / Full-Text Search
Classic keyword-based retrieval using term frequency and inverse document frequency (TF-IDF). Works through Elasticsearch, OpenSearch, PostgreSQL full-text, or SQLite FTS5. Excels at exact-match queries, domain-specific terminology, and code search where semantic similarity fails.
Text-to-SQL
LLM translates natural language questions into SQL queries against structured databases. Ideal for analytics, reporting, and questions with precise filters (dates, ranges, aggregations). Leverages existing relational data without any embedding pipeline.
Long-Context Stuffing
With models supporting 128K-1M+ token windows (GPT-4o, Claude, Gemini), feed entire document collections directly into the prompt. Eliminates retrieval entirely for small-to-medium corpora. The LLM itself acts as the retriever and reasoner simultaneously.
Agentic Tool Use / API Calls
LLM agents call external APIs, search engines, or tools (web search, code interpreters, database connectors) to retrieve information on demand. Each query dynamically selects the right data source. No pre-built index required — retrieval is just-in-time.
Vector RAG vs Vectorless Approaches
| Dimension | Vector RAG | BM25 | Context Stuffing | Text-to-SQL |
|---|---|---|---|---|
| Setup complexity | Medium (embeddings + vector DB) | Low (search index) | None | Low (schema + prompt) |
| Semantic understanding | High | None (keyword match) | High (LLM-native) | Structured only |
| Exact match / filters | Poor | Excellent | Good | Excellent |
| Corpus size limit | Millions of docs | Millions of docs | ~500 pages (1M tokens) | Unlimited (DB) |
| Latency | 50-200ms | 5-50ms | Slow (large prompt) | 50-500ms |
| Cost per query | $0.001-0.005 | $0.0001 | $0.01-0.10 (token cost) | $0.001-0.01 |
| Infra required | Vector DB + embedding API | Search engine | LLM API only | SQL database |
| Best for | Semantic similarity | Keyword, code, exact terms | Small corpora, prototyping | Structured data, analytics |
When to Go Vectorless
Vectorless Works Well When
- Small corpus (<500 pages): Context stuffing is simpler and often more accurate than chunking + retrieval
- Structured data: SQL databases with well-defined schemas — Text-to-SQL beats embedding-based retrieval
- Exact-match queries: Technical terms, product codes, error messages — BM25 outperforms semantic search
- Rapid prototyping: Skip the vector pipeline entirely — just stuff context and iterate
- Real-time data: API/tool calls fetch live data that can't be pre-indexed
- Budget-constrained: No embedding model costs, no vector DB hosting
Vectors Still Better When
- Large corpus (>10K docs): Context stuffing is infeasible; BM25 misses semantic matches
- Semantic similarity matters: "How do I fix a slow API?" matching "performance optimization for endpoints"
- Multilingual: Embedding models handle cross-language retrieval natively
- Fuzzy/conceptual queries: Questions that don't contain the exact keywords present in documents
- Cost at scale: Context stuffing becomes very expensive with large token windows
Implementation: BM25 with Rank-BM25
# === Vectorless RAG: BM25 Full-Text Retrieval ===
from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize
# 1. Prepare corpus
documents = [doc.page_content for doc in loaded_docs]
tokenized_corpus = [word_tokenize(doc.lower()) for doc in documents]
# 2. Build BM25 index (no embeddings needed!)
bm25 = BM25Okapi(tokenized_corpus)
# 3. Retrieve
def bm25_retrieve(query: str, k: int = 5):
tokenized_query = word_tokenize(query.lower())
scores = bm25.get_scores(tokenized_query)
top_k = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
return [(documents[i], scores[i]) for i in top_k]
# 4. Generate answer
results = bm25_retrieve("How to configure rate limiting?")
context = "\n\n".join([doc for doc, score in results])
answer = llm.invoke(f"Answer based on context:\n{context}\n\nQuestion: {query}")
# === Vectorless RAG: Long-Context Stuffing ===
from pathlib import Path
# 1. Load all documents into a single context
all_docs = []
for f in Path("./docs").glob("*.md"):
all_docs.append(f"\n--- {f.name} ---\n{f.read_text()}")
full_context = "\n".join(all_docs)
print(f"Total chars: {len(full_context):,}") # Check fits in context window
# 2. Stuff everything into the prompt — no retrieval step!
response = llm.invoke(
f"""You are a helpful assistant. Use the following documents to answer.
Documents:
{full_context}
Question: {query}
Answer concisely, citing the document name."""
)
# Works great for <500 pages with 128K+ context models
# Trade-off: higher token cost but zero retrieval infrastructure
# === Vectorless RAG: Text-to-SQL ===
from langchain_community.utilities import SQLDatabase
from langchain.chains import create_sql_query_chain
# 1. Connect to your database
db = SQLDatabase.from_uri("sqlite:///products.db")
print(db.get_usable_table_names()) # ['products', 'reviews', 'orders']
# 2. Create text-to-SQL chain
chain = create_sql_query_chain(llm, db)
# 3. Natural language → SQL → Answer
query = "What are the top 5 products by average rating with more than 100 reviews?"
sql_query = chain.invoke({"question": query})
print(f"Generated SQL: {sql_query}")
result = db.run(sql_query)
answer = llm.invoke(
f"Given SQL result: {result}\nAnswer: {query}"
)
# Precise, aggregated answers impossible with vector retrieval
Hybrid: Best of Both Worlds
The most effective production systems combine vectorless and vector approaches:
Reciprocal Rank Fusion (RRF) merges BM25 and vector results: score = Σ 1/(k + rank_i). This captures both exact keyword matches and semantic similarity. Many vector databases (Elasticsearch, Weaviate, Qdrant) support hybrid search natively. Adding BM25 to vector search typically improves recall by 10-20% with near-zero additional latency.
Tools & Libraries
BM25 / Full-Text
- rank-bm25 — Pure Python; great for prototyping
- Elasticsearch — Production-grade; built-in BM25
- PostgreSQL FTS — Built into Postgres; zero new infra
- SQLite FTS5 — Embedded; perfect for small apps
Text-to-SQL
- LangChain SQL — create_sql_query_chain
- LlamaIndex NLSQL — NLSQLTableQueryEngine
- Vanna.ai — OSS text-to-SQL with training
- DuckDB — In-process analytics + LLM pairing
Hybrid Search
- Weaviate — Native hybrid (BM25 + vector)
- Qdrant — Sparse + dense vector fusion
- Elasticsearch 8+ — kNN + BM25 in one query
- Vespa — Advanced ranking with hybrid retrieval
Distillation Overview — The Teacher-Student Paradigm
Knowledge distillation transfers the dark knowledge from large teacher models (GPT-4o, Claude Opus) into smaller, faster student models. The student learns not just to match labels, but to mimic the teacher's probability distributions, enabling 10-100x cost reduction with 85-95% quality retention. Essential for production RAG systems serving millions of requests.
Core Distillation Concepts
What is Knowledge Distillation?
A training technique where a large teacher model teaches a smaller student model to approximate its behavior. The student learns from soft probability distributions (soft labels) rather than just hard ground-truth labels, capturing the teacher's confidence and uncertainty patterns—the "dark knowledge."
Why Distill for Production RAG?
Cost: 20-100x cheaper inference. Latency: 10-50x faster. Privacy: Run locally without API calls. Edge Deployment: Fits on mobile/edge devices. Reliability: No rate limits or service dependencies.
Key Terminology
- Teacher: Large, high-quality model that teaches
- Student: Smaller model that learns
- Soft Labels: Teacher's probability distributions (softmax with temperature)
- Hard Labels: Ground truth class labels
- Temperature (T): Controls softness of probability distribution (higher T = softer, more gradual gradients)
- Dark Knowledge: Teacher's learned correlations between outputs beyond ground truth
Quality Retention Mechanics
Typical results: Embedding models retain 90-96% quality at 15-50x compression. Rerankers retain 94-97% at 3-10x compression. Generation models retain 85-92% at 10-30x compression. Quality loss is primarily in nuanced reasoning and rare edge cases; core competencies remain strong.
Distillation Techniques — Seven Methods Explained
Different distillation techniques target different components of the teacher's knowledge. Response-based distillation matches final outputs; feature-based captures intermediate representations; relation-based preserves data point relationships; synthetic data generation scales to new domains. Choosing the right technique depends on your architecture, available teacher access, and quality targets.
1. Logit/Response-Based Distillation
Student learns the teacher's final output probability distributions (logits) using soft label matching with temperature scaling. The KL divergence loss makes gradients smoother, allowing the student to learn from the teacher's confidence patterns.
Formula: L = α·KL(softmax(T/τ), softmax(S/τ)) + (1-α)·CE(Y, S)
Best for: BERT, RoBERTa, embeddings. Speed: ~20% training overhead.
2. Feature/Intermediate Distillation
Student mimics teacher's intermediate hidden states and attention maps, not just final outputs. Matches layer activations via mean-squared error loss. Essential for encoder models where intermediate representations matter.
Used by TinyBERT, MobileBERT. Loss: L = Σ ||H_student - H_teacher||²
Best for: BERT-family, rerankers. Quality: 90%+ retention even at 10x compression.
3. Relation-Based Distillation
Preserves relationships between data points rather than individual predictions. Contrastive distillation for embeddings: student embeddings maintain the same relative distances and similarities as teacher embeddings. Critical for semantic search.
Loss: L = Σ sim(e_student, e_teacher) matching pairwise relationships
Best for: E5, BGE embeddings. Benefit: Preserves ranking structure.
4. Synthetic Data Distillation
Teacher generates training data (Q&A pairs, reasoning chains, labeled examples) that student fine-tunes on. Does not require access to teacher weights—API-based. Most practical for LLM distillation. Examples: Alpaca (from Davinci), Orca, Vicuna.
Process: Generate 5K-10K examples → filter for quality → fine-tune student on synthetic data
Best for: Generation models, RAG readers. Cost: ~$50-500 API calls per million tokens.
5. Progressive/Multi-Stage Distillation
Distill through intermediate-size models in stages: GPT-4 (175B) → Llama 13B → Phi 3.8B → TinyBERT 14M. Each stage acts as both student and teacher. Enables extreme compression (1000x) with graceful quality degradation.
Why: Knowledge at each stage is closer to student's architecture, easier to learn.
Best for: Mobile/edge, extreme latency constraints. Trade-off: More training stages but better final quality.
6. Self-Distillation
Model distills from itself: larger layers teach smaller layers (Born-Again Networks), or early-exit heads teach final heads. Used for progressive inference and efficient early stopping. Requires no external teacher.
Variant: Ensemble of differently-sized versions of the same architecture.
Best for: Improving single models, progressive inference. Benefit: 2-5% quality boost at same size.
7. Domain Adaptation Distillation
Teacher fine-tuned on domain (biomedical, legal, code) teaches student. Combines in-domain expert knowledge with compact student architecture. Teacher learns domain patterns; student compresses domain knowledge into fewer parameters.
Process: Domain-FT teacher → generate domain synthetic data → student learns domain + generalization
Best for: Specialized domains (biotech, legal, code). Result: Small domain-expert models.
Distillation Techniques Comparison
| Technique | Teacher Weights? | Architecture | Quality Retention | Training Time | Best For |
|---|---|---|---|---|---|
| Logit Distillation | Yes (inference) | Same/different | 90-97% | +20% | Classifiers, embeddings |
| Feature Distillation | Yes (full) | Encoder-only | 92-98% | +40% | BERT models, rerankers |
| Relation Distillation | Yes (inference) | Same/different | 94-97% | +30% | Embeddings, ranking |
| Synthetic Data | No (API only) | Any decoder | 85-92% | 1-10 days | LLM generation, RAG |
| Progressive | Yes (multi-stage) | Any | 88-95% | 2-4 weeks | Extreme compression |
| Self-Distillation | No (internal) | Same (variants) | 102-105% | +10% | Model improvement |
| Domain Adaptation | Yes (domain FT) | Domain-expert | 87-94% | 2-5 days | Specialized domains |
Distillable Models for RAG — The Complete Catalog
The RAG pipeline has four critical components, each with a specialized set of distilled models. Embedding models for retrieval, rerankers for ranking, generation models for answering, and routers for intent classification. This section catalogs production-ready models for each stage with their distillation lineage, performance characteristics, and deployment costs.
RAG Component Models
Embedding Models (Bi-Encoder)
Dense vector representations for semantic retrieval. Distilled from larger encoder models to 33-335M parameters. Deployed at scale for every document query.
- E5-small/base/large — 33M/110M/335M params; MTEB top-tier; distilled from Mistral-7B
- BGE-small/base/large — 33M/110M/1.1B params; BAAI; multilingual; contrastive learning
- GTE-Qwen2-1.5B-instruct — 1.5B params; strong instruction-following; instruction-tuned embeddings
- Nomic Embed 1.5 384 — 137M params; 8192 context; Matryoshka dimensions
- all-MiniLM-L6-v2 — 22M params; fastest; SBERT distillation
- Alibaba Gte-base — 110M params; multilingual; strong on code/technical
Deployment: $0.05-0.20/M queries at scale
Reranker Models (Cross-Encoder)
Score query-document pairs for relevance. Compact cross-encoders (568M-1B params). Applied to top-K from retriever for precision ranking.
- BGE-reranker-v2-m3 — 568M params; multilingual; distilled from large cross-encoder
- ms-marco-MiniLM-L-12 — 33M params; ultra-compact; MS MARCO trained
- Jina Reranker v2 — 137M params; code + text; Jina-1.5-large distillation
- ColBERTv2-hnswlib — Late interaction; sub-ms latency; token-level matching
- Cohere Rerank v3 — API-based; production-grade; handles 20 languages
- mxbai-rerank-xsmall-v1 — 66M params; ultra-light; Mistral base
Deployment: Applied to top-50 docs; $0.10-0.30/M queries
Generation (Reader) Models
Small LLMs for grounded answer generation. 2-8B parameters, trained on domain/RAG-specific data. Distilled from frontier models (GPT-4o, Claude, Llama 405B).
- Phi-3-mini (3.8B) — Microsoft; curated textbook data; strong reasoning; 4K context
- Llama 3.1 8B — Meta; instruction-tuned; 128K context; Apache 2.0 license
- Mistral 7B / Nemo 7B — Sliding window; 32K context; 60% faster inference
- Gemma 2 2B/9B — Google; distilled from Gemini; excellent on factual QA
- Qwen2.5 7B — Alibaba; 128K context; multilingual; strong on code
- DeepSeek-R1-Distill 7B — Reasoning capability; chain-of-thought; 16K context
Deployment: $0.20-0.50/M tokens at scale
Router / Classifier Models
Tiny models for query routing, intent classification, content moderation. 14-66M parameters. Applied early in pipeline to route or filter.
- DistilBERT-base — 66M params; 60% faster than BERT; 97% performance retention
- TinyBERT-6L-768H — 14.5M params; 7.5x faster; distilled 4-layer
- MobileBERT — 25M params; mobile-optimized; real-time classification
- DeBERTa-v3-small — 44M params; NLI + classification; superior to DistilBERT
- ALBERT-base-v2 — 12M params; parameter sharing; cross-layer distillation
- Sentence-BERT-tiny — 14M params; semantic classification; STS benchmark trained
Deployment: <1ms per request; $0.01/M queries
Speculative Decoding Draft Models
Tiny models that propose tokens quickly; larger model verifies. Enables 2-3x generation speedup. Draft model distilled from main generator.
- Phi-3-mini as draft for Llama 70B — 3.8B proposes; 70B verifies; 2.5x speedup
- Gemma 2 2B as draft for 9B — Same family; better latency savings
- Draft-only models (research) — Models trained specifically to be draft models
Use case: High-throughput RAG backends; lower inference cost 30-40%
Mixture-of-Experts (MoE) Distillation
Distill sparse MoE models (Mixtral, GLaM) into dense models. Teacher has 46B params but uses only 12B per token; student is fully dense 7-8B.
- Mixtral 8x7B → Mistral 7B — Route expert knowledge into dense model
- Mixtral 8x22B → Llama 13B — Compress expert routing to dense layers
- Approach: Teacher routes on examples → student learns all routes as single dense model
Benefit: No expert overhead; simpler deployment; better VRAM efficiency
Model Selection Flowchart
Distilled Models for RAG — Full Comparison
| Model | Component | Params | Context | Quality | Cost/1M | Latency |
|---|---|---|---|---|---|---|
| Embedding Models | ||||||
| all-MiniLM-L6-v2 | Retrieval | 22M | 512 | ~85% | $0.02 | 2ms |
| E5-small | Retrieval | 33M | 512 | ~90% | $0.04 | 5ms |
| E5-base | Retrieval | 110M | 512 | ~95% | $0.08 | 12ms |
| BGE-base | Retrieval | 110M | 512 | ~93% | $0.07 | 11ms |
| Nomic Embed 1.5 | Retrieval | 137M | 8192 | ~94% | $0.10 | 18ms |
| Reranker Models | ||||||
| ms-marco-MiniLM-L-12 | Reranking | 33M | 512 | ~91% | $0.02 | 3ms/pair |
| BGE-reranker-v2-m3 | Reranking | 568M | 512 | ~96% | $0.08 | 8ms/pair |
| Jina Reranker v2 | Reranking | 137M | 8192 | ~94% | $0.05 | 6ms/pair |
| Generation Models | ||||||
| Phi-3-mini | Generation | 3.8B | 4096 | ~88% | $0.20 | 50ms/token |
| Gemma 2 2B | Generation | 2B | 8192 | ~85% | $0.15 | 35ms/token |
| Llama 3.1 8B | Generation | 8B | 128K | ~92% | $0.35 | 80ms/token |
| Mistral 7B | Generation | 7B | 32K | ~90% | $0.30 | 60ms/token |
| DeepSeek-R1-Distill 8B | Generation | 8B | 16K | ~88% (reasoning) | $0.40 | 120ms/token |
| Router/Classifier Models | ||||||
| DistilBERT | Classification | 66M | 512 | ~97% | $0.01 | 1ms |
| TinyBERT | Classification | 14.5M | 512 | ~92% | $0.005 | 0.5ms |
Quantization & Compression — Post-Distillation Optimization
Distillation reduces model size 10-50x. Quantization (4-bit, 2-bit), pruning, and low-rank factorization reduce it another 2-8x. Combined effects are multiplicative: a 405B model distilled to 8B (50x) then quantized to 2-bit (8x) becomes equivalent to a 1.6B full-precision model—nearly 250x reduction with 85-90% quality retention. This section covers every compression technique for production RAG.
Quantization Methods
GPTQ (4-bit)
Post-training quantization: 32-bit weights → 4-bit integers. Quantizes one layer at a time, using Hessian information to minimize loss. No retraining needed. Fast inference with vLLM.
- 8x model size reduction (32GB → 4GB)
- Quality retention: 97-99%
- Latency: 20-30% faster than FP32
- Training time: 30 min - 2 hours per model
Best for: Production inference on consumer GPUs
AWQ (Activation-Aware)
Like GPTQ but considers activation patterns. Moves quantization errors to less important weights based on actual data distributions. Better quality at extreme compression.
- 8x model size reduction (32GB → 4GB)
- Quality retention: 98-99%
- Latency: 15-25% faster than FP32
- Training time: 1-4 hours per model
Best for: Max quality at 4-bit; preferred for generation models
GGUF (llama.cpp)
Quantization format for CPU inference. Multiple quantization levels (Q2, Q3, Q4, Q5, Q8). Minimal dependencies; runs on CPU without GPU. Popular for local/edge deployment.
- 2-8x reduction depending on level
- Quality: Q4 = 95-98%, Q2 = 85-90%
- Latency: 50-300ms/token on CPU
- No GPU required; runs anywhere
Best for: Local inference, privacy-critical apps, edge devices
BitsAndBytes / QLoRA
Load 4-bit model, add small LoRA adapters. Training-friendly. Model stored in 4-bit; adapters in float32 for gradient computation. Great for fine-tuning distilled models.
- 8x reduction + memory-efficient training
- Quality: 98%+ (no inference-time loss)
- Fine-tune 70B on single 40GB GPU
- Adapters portable; base model quantized
Best for: Fine-tuning distilled models at scale
Structured Pruning
Remove entire attention heads or feed-forward neurons. Maintains model architecture; reduces FLOPs. Combines well with quantization for 2-4x additional speedup.
- 2-4x latency reduction (removes FLOPs)
- Quality retention: 92-96%
- Works with standard inference frameworks
- Usually done during fine-tuning or distillation
Best for: Latency-critical systems; combines with quantization
SparseGPT & Magnitude Pruning
Remove 20-50% of weights (unstructured). Requires sparse inference libraries for speedup. SparseGPT uses Hessian-aware pruning for minimal quality loss at high sparsity.
- Up to 2-3x reduction (not all hardware supports)
- Quality at 50% sparsity: 92-96%
- Requires sparse-aware inference (e.g., Mochi)
- Combined effect with quantization: 4-6x
Best for: Custom hardware; extreme compression research
Compression Methods Comparison
| Method | Size Reduction | Speed Boost | Quality Loss | GPU Required? | Training Time | Best Use |
|---|---|---|---|---|---|---|
| GPTQ 4-bit | 8x | 1.2-1.3x | 1-3% | Yes (calibration) | 30min - 2hr | Production inference |
| AWQ 4-bit | 8x | 1.15-1.25x | 1-2% | Yes (calibration) | 1-4hr | Quality-critical generation |
| GGUF Q4 | 8x | 0.2-0.5x (CPU) | 2-5% | No (inference) | 5-30min | Local/edge deployment |
| BitsAndBytes 4-bit | 8x | 1.1x | 0% (lossless) | Yes (inference + training) | 0min (inference) | Fine-tuning + inference |
| Structured Pruning | 2-4x | 2-4x | 4-8% | Yes (training) | 1-3 days | Latency-critical |
| Magnitude Pruning | 2-5x | 1-2x (sparse HW) | 4-10% | Maybe (sparse HW) | 1 hour - 1 day | Custom hardware |
| Distil + Q4 + Prune | 50x + 8x + 3x = 1200x | 100x overall | 10-15% | Yes | 1-2 weeks | Ultimate compression |
Code Example: Quantize a Distilled Model with AutoGPTQ
# Quantize Llama 8B (distilled) to 4-bit GPTQ with AutoGPTQ
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import torch
# Model to quantize (your distilled Llama)
model_name = "meta-llama/Llama-2-8b-hf"
# Quantization config: 4-bit, group size 128, symmetric
quantize_config = BaseQuantizeConfig(
bits=4, # 4-bit quantization
group_size=128, # Weight grouping
desc_act=False, # Don't sort by activation
sym=True, # Symmetric quantization
)
# Load and quantize (takes 30min-2hr on single GPU)
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config,
device_map="auto", # Auto place on GPU
)
# Save quantized weights (4GB instead of 32GB)
model.save_pretrained("./llama-8b-gptq-4bit")
# Load and use in production with vLLM
# vLLM auto-detects GPTQ format and uses optimized kernels
from vllm import LLM
llm = LLM(model="./llama-8b-gptq-4bit", quantization="gptq")
# Result: 32GB → 4GB storage, 20-30% faster inference
# Cost reduction: $0.35/M tokens → $0.15/M tokens
Cumulative Compression: Pipeline Savings
| Stage | Example | Model Size | Cumulative Reduction | Cost/M Tokens |
|---|---|---|---|---|
| 1. Original | GPT-4o | 405B params, 1.6TB | 1x | $15.00 |
| 2. Distillation | Llama 8B | 8B params, 32GB | 50x | $0.35 |
| 3. + Quantization (4-bit) | Llama 8B-GPTQ | 8B params, 4GB | 50x × 8x = 400x | $0.15 |
| 4. + Pruning (30%) | Llama 5.6B-Q4-Pruned | 5.6B eff. params, 2.8GB | 400x × 3x = 1200x | $0.08 |
| 5. Quality Retention | ~85-88% of original GPT-4o quality on RAG tasks | 98x cheaper | ||
Domain-Specific Distillation — Specialized Models for Specialized Domains
Generic distilled models work well for most tasks, but specialized domains (biomedical, legal, code, finance) have unique terminology, conventions, and reasoning patterns. Domain-specific distillation fine-tunes the teacher on domain data first, then distills into a compact student. The result: small, specialized models that understand nuanced domain knowledge without the cost of frontier APIs.
Healthcare & Biomedical
BioMistral & PubMed Models
Mistral 7B fine-tuned on 20M biomedical papers, then distilled. PubMedBERT pre-trained on 18M PubMed abstracts. Domain vocabulary includes medical terminology, drug names, pathways.
- BioMistral 7B — Generation, QA over medical literature
- PubMedBERT — Embeddings, retrieval from PubMed corpus
- ClinicalBERT — Clinical notes, discharge summaries
- SciBERT — General scientific papers, methodology extraction
Regulatory: HIPAA-compliant fine-tuning; FDA 21 CFR Part 11 for records
Use Cases & Quality
- Patient record QA: "What medications is the patient allergic to?"
- Drug interaction retrieval: Find papers on specific drug combinations
- Clinical trial matching: Match patients to relevant trials
- Literature synthesis: Summarize findings across papers
Quality on domain tasks: 92-96% (vs 85% for generic models). Latency: 50-100ms per query.
Legal & Compliance
LegalBERT & SaulLM
LegalBERT trained on 12M legal documents (contracts, case law). SaulLM-7B fine-tuned for legal reasoning. Understands statutes, precedent citations, contract clauses.
- LegalBERT — Embeddings, contract clause retrieval
- SaulLM 7B — Legal reasoning, opinion generation
- Legal-BERT-small — Compact classification, ruling prediction
- Case Law BERT — Precedent similarity, case law search
Compliance: Audit trails required; document all reasoning steps
Use Cases & Quality
- Contract review: Identify risky clauses, flag deviations
- Due diligence: Retrieve relevant contracts by clause type
- Case law retrieval: Find precedent for legal arguments
- Compliance checking: Verify contracts against templates
Quality: 94-98% on legal classification. Cost: $0.30/doc for GPT-4, $0.02/doc distilled.
Finance & Trading
FinBERT & BloombergGPT Distillations
FinBERT trained on 10K SEC filings, earnings calls, financial news. Understands ticker symbols, financial ratios, sentiment about markets. Distilled down to 66M-110M parameters.
- FinBERT — Sentiment analysis, embeddings from SEC filings
- BloombergGPT-distilled — Financial reasoning, earnings summarization
- SEC Retriever BERT — Find relevant filings by section type
- FraudBERT — Anomaly detection in financial documents
Regulatory: SEC requires documentation of AI systems for financial advice
Use Cases & Quality
- Earnings analysis: Extract guidance, management commentary
- SEC filing search: Find risk factors, related party transactions
- Sentiment scoring: Score news and analyst reports
- Fraud detection: Flag unusual disclosures or language patterns
Quality: 96%+ on classification; 90%+ on sentiment. Real-time processing: <100ms.
Code & Engineering
CodeLlama & StarCoder Distillations
CodeLlama 7B/13B trained on 500B tokens of code from GitHub. StarCoder2 3B/7B distilled from larger model. Understand syntax, APIs, dependencies, documentation patterns across 80+ languages.
- CodeLlama 7B — Code generation, completion, infilling
- StarCoder2 3B/7B — Fill-in-middle, multi-language, low latency
- DeepSeek-Coder 6.7B — Code search, documentation generation
- Granite-code 3B — IBM's distilled code model
Licensing: Verify open-source compatibility (CodeLlama uses Llama license)
Use Cases & Quality
- Codebase RAG: "Find usage of this function across repos"
- Code completion: Autocomplete functions, fix syntax
- Documentation: Generate docs from docstrings, code comments
- Bug detection: Identify common patterns, security issues
Quality: 85-90% on HumanEval. Latency: 30-60ms. Cost: $0.20/1M tokens.
Scientific & Research
SciBERT & Domain-Specific Models
SciBERT trained on 1.2M scientific papers. MatSciBERT for materials science papers. ChemBERT for chemistry. Each understands domain-specific terminology, experimental methodologies, result reporting conventions.
- SciBERT — General scientific papers, citation context
- MatSciBERT — Materials science, synthesis conditions
- ChemBERT — Chemistry, molecular structures, reactions
- AstroGLUE — Astronomy papers, telescope data analysis
Citation tracking: Models can retrieve papers cited by retrieved papers
Use Cases & Quality
- Paper search: Find papers by methodology, findings
- Citation analysis: Extract key citations, author networks
- Result extraction: Parse numerical results, comparisons
- Meta-analysis: Summarize findings across papers
Quality: 93-97% on citation prediction. Enables research synthesis at scale.
Multilingual & Cross-Lingual
mBERT & XLM-RoBERTa Distillations
Multilingual BERT trained on 104 languages. XLM-RoBERTa small (124M params) distilled from large model. Enable cross-lingual embeddings and retrieval—documents in one language can retrieve queries in another.
- mBERT-base — 104 languages, unified embedding space
- XLM-RoBERTa-small — Lightweight, 44M params, 100+ languages
- LaBSE — Cross-lingual semantic search
- mDPR — Multilingual dense passage retrieval
Zero-shot: Train on English, deploy on any language in the model's coverage
Use Cases & Quality
- Cross-lingual search: Query in French, retrieve Chinese docs
- Multilingual customer support: Route queries to knowledge base
- International legal: Match contracts across jurisdictions
- Academic search: Unified search across multiple languages
Quality: 85-92% on multilingual MTEB; zero-shot performance good for high-resource languages.
Distillation Implementation Guide — From Teacher to Production
Distillation is a systematic process: select teacher, generate or curate training data, prepare dataset, configure student, train with distillation loss, evaluate, quantize, and deploy. This section walks through the full pipeline with code examples for each stage, covering practical production concerns like data quality, training stability, and evaluation metrics.
Step-by-Step Implementation
1. Select Teacher Model
- For generation: GPT-4o ($0.015/K tokens), Claude 3.5-Sonnet, Llama 405B
- For embeddings: E5-Mistral-7B, BGE-large, sentence-transformers
- Criteria: High accuracy on your domain, affordable API access, reproducible outputs
- Cost estimate: 5K-10K examples ≈ $50-500 in API calls
2. Generate Training Data
- Synthetic data: Teacher generates Q&A, reasoning chains from corpus
- Data quality: Set temperature 0.3-0.5, filter low-confidence outputs
- Diversity: Sample from different topics, difficulty levels
- Deduplication: Remove near-duplicates (use embedding similarity)
3. Prepare Dataset
- Format: JSON Lines, each line: {"instruction": "...", "output": "..."}
- Train/val split: 90/10 or 85/15 for stratified sampling
- Tokenization: Truncate to max_length (4096 for Llama, 512 for BERT)
- Class balance: For classification, stratify by label
4. Student Architecture
- Generation: Phi-3-mini (3.8B) or Llama 8B start point
- Embeddings: all-MiniLM-L6-v2 (22M) → E5-base (110M)
- Reranker: ms-marco-MiniLM-L-12 (33M) → BGE-m3 (568M)
- Classifier: TinyBERT (14.5M) → DistilBERT (66M)
Code Example 1: Generate Synthetic Data at Scale
# Generate 10K Q&A pairs from your corpus using teacher API
import json, random
from openai import OpenAI
client = OpenAI()
# Load your domain corpus (documents, passages)
documents = [load_corpus()] # Your doc chunks
training_data = []
for doc in random.sample(documents, 10000):
# Teacher generates diverse questions
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "Generate 3 diverse questions from this doc."
}, {
"role": "user",
"content": doc["content"]
}],
temperature=0.3 # Low temp for consistency
)
# Generate answers for each question
questions = parse_questions(response.choices[0].message.content)
for q in questions:
answer = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"Doc: {doc['content']}\n\nQ: {q}"
}],
temperature=0.3
)
training_data.append({
"instruction": f"Answer from doc:\n{doc['content']}\n\nQ: {q}",
"output": answer.choices[0].message.content
})
# Deduplication & quality filtering
def is_high_quality(example):
return len(example["output"]) > 20 and "\n" not in example["output"][:50]
training_data = [e for e in training_data if is_high_quality(e)]
# Save to JSONL
with open("training_data.jsonl", "w") as f:
for ex in training_data:
f.write(json.dumps(ex) + "\n")
Code Example 2: Fine-tune with Unsloth + LoRA
# Fine-tune student on synthetic data with QLoRA
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
# Load base student model (4-bit quantized)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=4096,
load_in_4bit=True, # QLoRA for memory efficiency
dtype=None,
)
# Add LoRA adapters (16 rank, ~0.5% additional params)
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
lora_alpha=16,
lora_dropout=0.05,
bias="none",
use_gradient_checkpointing=True,
)
# Load training data
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
# Supervised fine-tuning trainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="instruction", # or your output field
max_seq_length=4096,
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=100,
num_train_epochs=3,
learning_rate=2e-4,
logging_steps=10,
save_steps=500,
output_dir="./distilled-model",
),
)
trainer.train()
# Save merged model (compress with GPTQ after)
model.save_pretrained("./distilled-final")
Code Example 3: Evaluate Distillation Quality
# Compare teacher vs student quality on held-out test set
from rouge_score import rouge_scorer
import numpy as np
# Load test data (not seen during training)
test_data = load_test_set()
# Get predictions from both models
teacher_outputs = [get_teacher_response(ex["input"]) for ex in test_data]
student_outputs = [get_student_response(ex["input"]) for ex in test_data]
# Evaluate with ROUGE (generation) or F1 (classification)
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'])
teacher_scores = []
student_scores = []
for ref, t_out, s_out in zip([ex["reference"] for ex in test_data], teacher_outputs, student_outputs):
t_score = scorer.score(ref, t_out)['rougeL'].fmeasure
s_score = scorer.score(ref, s_out)['rougeL'].fmeasure
teacher_scores.append(t_score)
student_scores.append(s_score)
quality_retention = (np.mean(student_scores) / np.mean(teacher_scores)) * 100
print(f"Teacher ROUGE-L: {np.mean(teacher_scores):.3f}")
print(f"Student ROUGE-L: {np.mean(student_scores):.3f}")
print(f"Quality Retention: {quality_retention:.1f}%")
Production Tips & Best Practices
Training Stability
- Batch size: 16-32 for 8B models, 4 for 13B+
- Learning rate: 2e-4 to 5e-4 (start conservative)
- Warmup: 5-10% of total steps to prevent instability
- Loss curves: Should decrease smoothly; spikes indicate issues
- Gradient clipping: max_grad_norm=1.0 to prevent explosion
Evaluation Metrics
- Generation: ROUGE-L, BLEU, F1 vs reference outputs
- Embeddings: MRR, NDCG, MAP on retrieval task
- Classification: Accuracy, precision, recall per class
- Human eval: Sample 100-200 outputs, rate quality 1-5
- Latency: Track inference time vs quality tradeoff
Deployment & Monitoring
- A/B testing: Compare 10% teacher, 90% student for 1-2 weeks
- Shadow mode: Log student predictions, compare offline
- Quantization: Post-train GPTQ or AWQ after distillation
- Cost monitoring: Track cost-per-query before/after distillation
- Quality drift: Monitor quality metrics in production weekly
Common Issues & Fixes
- Collapse to mode: Student predicts same output for all inputs → lower learning rate
- Quality gap >15%: More training data, better teacher, larger student
- Overfitting: Large gap between train/val loss → add dropout, regularization
- Slow convergence: Use cosine schedule with warmup, longer training
- Divergence: Loss becomes NaN → reduce batch size, use gradient clipping
Distillation Summary & Production Decision Framework
Distillation enables production RAG at 1/100th the cost of frontier APIs. The decision framework below guides model selection based on your constraints: budget, latency, privacy, quality floor. Combined with quantization and pruning, distillation achieves extreme compression—405B to <1B with 85-90% quality retention—enabling deployment on edge devices and consumer hardware.
Key Takeaways
1. Cost Reduction — Distillation alone: 10-50x. With quantization: 50-400x. Combined savings compound multiplicatively.
2. Quality Retention — Well-distilled models keep 85-95% of teacher quality. 10-15% loss is rare; usually <5% on closed-domain RAG.
3. Technique Selection — 80% of use cases: logit distillation (encoders) + synthetic data (LLMs). Progressive distillation only for extreme compression.
4. Domain Matters — Generic models work well (80%+ quality). Domain-specific teachers matter only for specialized fields (biomedical, legal, code).
5. Deployment Path — GPU → GPTQ/AWQ → Pruning → GGUF. Each step trades quality for speed/size. Stop when you hit your requirements.
6. Evaluation Essential — Never ship without A/B testing teacher vs. student on 100-1000 held-out examples. 2-week shadow period recommended.
Quick Reference: Scenario → Recommendation
| Scenario | Constraints | Recommended Approach | Expected Outcome | Timeline |
|---|---|---|---|---|
| API-to-Local | Zero API deps, privacy | Phi-3-mini (synthetic data) + GGUF Q4 | 85% quality, on-device | 2 weeks |
| Cost Reduction | Budget <$5/1M queries | E5-small + ms-marco + Phi-3 (GPTQ) | 90% quality, 50x cost cut | 1 week |
| Latency Critical | P95 <50ms end-to-end | all-MiniLM + DistilBERT + Phi-3 (speculative) | 88% quality, 5ms avg latency | 1 week |
| Domain-Specific | Biomedical/legal/code | Fine-tune teacher → Distill to 7-8B + domain data | 94%+ domain quality, 10x cost cut | 3 weeks |
| Scale Inference | 1B+ queries/day | E5-base + BGE-m3 (batch) + Llama 8B (vLLM) | 92% quality, $8/1M tokens | 2 weeks |
| Extreme Compression | Mobile/edge, <100MB | Distil → Q4 → Prune → GGUF Q2 | 80-85% quality, 250x smaller | 4 weeks |
Top 5 Mistakes & How to Avoid Them
❌ Mistake 1: Skipping Evaluation
Shipping student without A/B testing against teacher. Can lose 15-30% quality silently.
✓ Fix: Evaluate on 200+ held-out examples. Human eval for 50 outputs. 2-week shadow mode (log but don't use student).
❌ Mistake 2: Low-Quality Training Data
Generating synthetic data at high temperature (0.8+) or without filtering. Student learns inconsistent/noisy examples.
✓ Fix: Use temperature 0.3-0.5. Filter outputs <50 chars. Dedupe with embedding similarity (thr=0.95).
❌ Mistake 3: Overshrinking Student
Going straight from 405B to 1B. Quality drops 20-30%. Better to go 405B → 13B → 7B progressively.
✓ Fix: Start with 50% reduction (405B → 7B). If quality ok, shrink more. Progressive distillation for extreme sizes.
❌ Mistake 4: Wrong Temperature
Temperature T too low (<2) → student just memorizes. T too high (>8) → too soft, easy to learn but flat gradients.
✓ Fix: Start with T=4. If training unstable, increase to 6-8. If converging too fast, lower to 2-3.
❌ Mistake 5: Insufficient Training
Training for 1 epoch over 5K examples. Student hasn't converged; performance is suboptimal.
✓ Fix: Train for 3-5 epochs. Monitor train/val loss. Stop when val loss plateaus (typically day 2-3 for 8B on single GPU).
⚠️ Challenge: Quality Degradation
Student loses 15-20% quality despite everything looking right. Common in open-domain reasoning, edge cases.
✓ Mitigation: Increase training data (10K → 50K). Use larger student. Add domain-specific hard examples. Accept loss on reasoning tasks.
Resources & Tools
Key Papers
- Hinton et al. 2015 — Distilling Knowledge in Neural Networks (original KD)
- Jiao et al. 2019 — TinyBERT (layer + feature distillation)
- Anil et al. 2023 — Large Language Model Distillation (Gemini)
Tools & Frameworks
- Unsloth — 2-5x faster distillation (QLoRA)
- vLLM — Batch inference for evaluation
- AutoGPTQ — Easy GPTQ quantization
- HuggingFace SFT Trainer — Supervised fine-tuning
Benchmark & Datasets
- MTEB — Embedding evaluation (56 tasks)
- HumanEval — Code generation quality
- MMLU — Knowledge/reasoning benchmark
- SuperGLUE — NLU/classification tasks
Fine-Tuning vs RAG — When to Use Which
Fine-tuning bakes knowledge into model weights; RAG retrieves it at runtime. The right choice depends on whether your knowledge is static or dynamic, whether you need behavioral changes or factual grounding, and your budget for maintenance.
Head-to-Head Comparison
| Dimension | Fine-Tuning | RAG | Fine-Tune + RAG |
|---|---|---|---|
| Knowledge freshness | Frozen at training time | Always up-to-date | Up-to-date |
| Hallucination control | Hard to control | Grounded in sources | Best of both |
| Source citation | Not possible | Built-in | Built-in |
| Output style control | Excellent | Limited (prompt-based) | Excellent |
| Setup cost | $100-10K (GPU training) | $50-500 (indexing pipeline) | $200-10K |
| Per-query cost | Low (small model) | Medium (retrieval + LLM) | Medium |
| Maintenance | Retrain on new data | Re-index documents | Both |
| Data volume needed | 1K-100K examples | Any number of documents | 1K+ examples + documents |
| Latency | Fastest (single forward pass) | +50-200ms (retrieval) | +50-200ms |
| Best for | Tone, style, format, domain jargon | Facts, docs, real-time data, QA | Enterprise production systems |
Decision Guide
Choose Fine-Tuning When
- Custom output format: JSON schemas, specific templates, branded voice
- Domain adaptation: Medical terminology, legal language, code style
- Behavioral changes: Response length, reasoning approach, safety rules
- Latency-critical: No retrieval overhead; single forward pass
- Stable knowledge: Information that won't change often
- Cost at scale: Fine-tuned small model cheaper than large model + RAG
Choose RAG When
- Dynamic knowledge: Documents updated daily/weekly
- Source attribution: Users need to verify where answers come from
- Large corpus: Thousands of documents that can't fit in training data
- Compliance: Audit trails, explainability, data governance
- Multi-tenant: Different knowledge bases per user/org
- Rapid prototyping: No training loop; index and query immediately
RAG Prompt Engineering — Optimizing Generation
The prompt template connecting retrieved context to the LLM is the most underappreciated component of RAG. Small prompt changes can swing answer quality by 20-40%. Master these patterns to eliminate hallucination, improve faithfulness, and control output format.
Essential Prompt Patterns
Grounding Instructions
Force the model to answer only from provided context, reducing hallucination.
"""Answer the question based ONLY on the
provided context. If the context does not
contain enough information to answer,
say "I don't have enough information to
answer this question."
Do NOT use prior knowledge.
Context:
{retrieved_chunks}
Question: {query}
Answer:"""
Citation / Attribution
Require inline citations that map back to source documents.
"""Answer using ONLY the numbered sources
below. Cite each claim with [Source N].
Sources:
[1] {chunk_1} (from: {doc_name_1})
[2] {chunk_2} (from: {doc_name_2})
[3] {chunk_3} (from: {doc_name_3})
Question: {query}
Answer (with citations):"""
Chain-of-Thought RAG
Ask the model to reason through the context step by step before answering.
"""Given the context, answer step by step:
1. Identify relevant information
2. Check for contradictions
3. Synthesize a coherent answer
4. Cite your sources
Context: {chunks}
Question: {query}
Step-by-step reasoning:"""
Refusal / Uncertainty
Teach the model to express confidence levels and refuse gracefully when unsure.
"""Rate your confidence (HIGH/MEDIUM/LOW)
based on how well the context supports
your answer.
- HIGH: Direct answer in context
- MEDIUM: Inferred from context
- LOW: Partially supported
If LOW, say: "Based on limited context,
..." and suggest what additional info
would help.
Context: {chunks}
Question: {query}"""
Common Anti-Patterns to Avoid
| Anti-Pattern | Problem | Fix |
|---|---|---|
| No grounding instruction | Model mixes retrieved facts with parametric knowledge, causing subtle hallucinations | Always include "answer ONLY from context" |
| Context before system prompt | Long context pushes instructions out of attention window ("lost in the middle") | Place instructions first, then context, then question |
| Too many chunks | Dilutes relevant info; model struggles to find the answer in noise | Rerank and limit to top 3-5 most relevant chunks |
| No refusal path | Model invents answers when context doesn't contain the answer | Explicitly instruct "say I don't know if unsupported" |
| Missing metadata | Model can't distinguish document sources or dates | Include doc title, date, source URL with each chunk |
| Vague output format | Inconsistent response structure across queries | Specify exact output format (JSON, bullets, paragraphs) |
Advanced Prompt Techniques
Multi-Document Synthesis
When chunks come from multiple documents, instruct the model to identify agreements, contradictions, and gaps between sources before synthesizing.
Structured Output
Use JSON mode or XML tags to get consistent, parseable output. Define the schema in the prompt: {"answer": "...", "sources": [...], "confidence": "HIGH"}.
Few-Shot RAG Examples
Include 2-3 example context→answer pairs in the prompt to demonstrate the expected citation style, reasoning depth, and refusal behavior.
Caching Strategies for Production RAG
Multi-layer caching is the single highest-ROI optimization for production RAG — reducing latency by 60-90%, cutting LLM costs by 40-70%, and improving user experience with near-instant responses for repeated or similar queries.
Cache Layer Deep Dive
Exact Match Cache
Hash the normalized query + metadata filters as cache key. Store the full response (answer + citations + confidence score). Best for FAQ-style queries and repeated searches.
def cache_key(query, filters):
normalized = query.lower().strip()
key_str = f"{normalized}|{sorted(filters.items())}"
return hashlib.sha256(key_str.encode()).hexdigest()
# Redis with TTL + invalidation hooks
cached = redis.get(cache_key(q, f))
if cached:
return json.loads(cached) # <5ms
Semantic Cache
Embed the query, search a cache-specific vector index for similar past queries. If cosine similarity exceeds threshold (0.95+), return the cached response. Handles paraphrases and near-duplicates.
class SemanticCache:
def lookup(self, query_embedding):
results = self.cache_index.search(
query_embedding, top_k=1
)
if results[0].score > 0.95:
return self.response_store[
results[0].id
]
return None # cache miss
Embedding Cache
Cache computed embeddings keyed by content hash. Avoids re-embedding unchanged documents during re-indexing. Critical for cost control at scale (embedding APIs charge per token).
def get_embedding(text):
content_hash = hash_content(text)
cached = redis.get(f"emb:{content_hash}")
if cached:
return np.frombuffer(cached)
vec = embedding_model.encode(text)
redis.setex(
f"emb:{content_hash}",
ttl=86400 * 7, # 7 days
value=vec.tobytes()
)
return vec
Cache Invalidation Strategies
Time-Based (TTL)
Set TTL based on data volatility. Static docs: 24h+. News/feeds: 1-4h. Real-time data: 5-15min. Always pair with event-based invalidation.
Event-Driven Invalidation
On document update/delete, invalidate all cache entries referencing that doc_id. Use CDC (Change Data Capture) or webhook triggers from source systems.
Versioned Keys
Include index version or embedding model version in cache keys. Model upgrade = automatic full invalidation without manual flush.
Confidence-Gated Caching
Only cache responses with confidence score above threshold (e.g., >0.85). Low-confidence answers should always be regenerated fresh.
Metadata Filtering & Hybrid Search
Pure vector similarity search is rarely sufficient in production. Metadata filtering adds structured constraints (date ranges, access levels, document types, departments) to narrow the search space before or after vector retrieval — improving precision, enforcing security, and reducing noise.
Pre-Filter vs Post-Filter Architecture
Pre-Filtering (Recommended)
Apply metadata constraints before vector search. The vector DB only searches within the filtered subset. Faster at query time, but requires indexed metadata fields. Supported natively in Qdrant, Pinecone, Weaviate, Milvus.
# Qdrant pre-filter example
results = client.search(
collection="docs",
query_vector=query_vec,
query_filter=Filter(must=[
FieldCondition(
key="department",
match=MatchValue(value="engineering")
),
FieldCondition(
key="created_at",
range=Range(gte="2025-01-01")
),
FieldCondition(
key="acl_groups",
match=MatchAny(any=user.groups)
)
]),
limit=10
)
Post-Filtering
Retrieve top-k results first, then filter by metadata. Simpler to implement but risks returning fewer results than requested (top-10 search → filter → 3 results). Use when filter selectivity is low or metadata isn't indexed.
# Post-filter: over-fetch then trim
raw = vector_db.search(
query_vec, top_k=50 # 5x over-fetch
)
filtered = [
r for r in raw
if r.meta["dept"] == user.dept
and r.meta["access"] <= user.level
][:10] # trim to final top-k
Production Metadata Schema
| Field | Type | Purpose | Index? | Example |
|---|---|---|---|---|
doc_id | string | Unique document identifier | Yes | doc_a3f8c1 |
source_type | keyword | Filter by document origin | Yes | confluence, gdrive, s3 |
department | keyword | Org-level filtering | Yes | engineering, legal, hr |
acl_groups | keyword[] | Access control enforcement | Yes | ["eng-team", "all"] |
created_at | datetime | Freshness filtering | Yes | 2025-11-15T10:30:00Z |
updated_at | datetime | Staleness detection | Yes | 2026-02-01T08:00:00Z |
language | keyword | Multilingual support | Yes | en, fr, de |
doc_type | keyword | Content type filtering | Yes | policy, runbook, faq |
chunk_index | integer | Ordering within parent doc | No | 3 |
parent_doc_id | string | Link chunks to parent | Yes | doc_a3f8c1 |
confidence | float | Ingestion quality score | No | 0.92 |
version | integer | Document version tracking | Yes | 3 |
Conversational & Multi-Turn RAG
Single-turn RAG treats each query independently. Production chat applications require multi-turn awareness — resolving pronouns, maintaining topic context, handling follow-up questions, and deciding when to re-retrieve vs reuse prior context.
Multi-Turn Resolution Strategies
1. Query Rewriting with History
Use the LLM to rewrite the latest query into a standalone query by resolving coreferences from chat history. This is the most reliable approach for production.
def rewrite_query(history, current_query):
prompt = f"""Given this conversation:
{format_history(history)}
Rewrite this follow-up into a standalone
search query: "{current_query}"
Standalone query:"""
return llm.generate(prompt)
2. Context Carryover Window
Append the last N retrieved chunks to the new generation context. Simple and effective for follow-ups that reference previous answers. Risk: context window bloat after many turns.
# Sliding window: keep last 3 turns
context_window = []
for turn in conversation[-3:]:
context_window.extend(turn.retrieved_chunks)
# Deduplicate by chunk_id
context_window = dedupe(context_window)
# Add new retrieval results
context_window += new_retrieved_chunks
3. Retrieval Decision Gate
Not every follow-up needs new retrieval. Use an LLM classifier or heuristics to decide: re-retrieve, reuse context, or answer from conversation history alone. Saves 30-50% of retrieval calls.
def needs_retrieval(history, query):
# Classify intent
intent = classify(query, labels=[
"new_topic", # retrieve
"follow_up", # maybe
"clarification", # no
"chitchat" # no
])
return intent == "new_topic"
4. Memory-Augmented RAG
Maintain a structured memory store alongside the vector DB. Track user preferences, established facts from the conversation, and topic threads. Enables personalized retrieval over sessions.
class ConversationMemory:
entities: dict # extracted entities
preferences: dict # user prefs
topic_stack: list # active topics
facts: list # established facts
# Enrich retrieval with memory context
filters = build_filters(memory.entities)
boost = build_boost(memory.preferences)
User Feedback & Continuous Improvement
A production RAG system without a feedback loop is flying blind. User signals (thumbs up/down, click-through, reformulations, explicit corrections) are the ground truth for measuring real-world quality and driving iterative improvement.
Feedback Signal Taxonomy
Explicit Signals
Thumbs up/down, star ratings, "this was helpful" clicks, written corrections, citation relevance ratings. Highest quality signal but lowest volume (2-5% of queries).
Implicit Signals
Query reformulations (user wasn't satisfied), click-through on citations (answer was useful), copy-paste actions, session duration, follow-up patterns. High volume, noisier signal.
System Signals
Low confidence scores, retrieval misses (no results above threshold), hallucination detection triggers, timeout/fallback activations. Automated quality indicators.
Closing the Loop: Improvement Actions
Build Eval Datasets from Feedback
Convert thumbs-down responses into test cases. The query + bad answer + user correction becomes a regression test. Target: 500+ labeled examples for statistical significance.
Identify Failure Patterns
Cluster negative feedback by root cause: retrieval misses (wrong docs), grounding failures (hallucination), formatting issues, stale data, permission errors. Fix the highest-impact category first.
Targeted Improvements
Retrieval misses → adjust chunking, add synonyms, tune hybrid weights. Hallucinations → strengthen grounding prompts, lower confidence thresholds. Stale data → fix ingestion pipeline, reduce TTLs.
A/B Test & Measure
Deploy improvements behind feature flags. Run A/B tests comparing new vs old pipeline. Measure: answer acceptance rate, reformulation rate, confidence scores, latency. Promote only if metrics improve across the board.
Structured Data RAG: Text2SQL & Table QA
Not all knowledge lives in documents. Production RAG systems often need to query structured data — relational databases, data warehouses, spreadsheets, and APIs. Text2SQL converts natural language into SQL queries, while Table QA reasons over tabular data directly.
Text2SQL Pipeline
Convert natural language to SQL using schema-aware prompting. Key: provide table schemas, column descriptions, sample values, and example query pairs in the LLM prompt.
class Text2SQLPipeline:
def query(self, question: str):
# 1. Retrieve relevant schemas
schemas = self.schema_retriever.search(
question, top_k=5
)
# 2. Generate SQL
sql = self.llm.generate(
self.prompt_template.format(
schemas=schemas,
question=question,
examples=self.few_shot_examples
)
)
# 3. Validate & sanitize
sql = self.sql_validator.check(sql)
# 4. Execute (read-only!)
results = self.db.execute(sql)
# 5. Synthesize answer
return self.synthesizer.answer(
question, sql, results
)
Table QA (Direct Reasoning)
For smaller tables or CSV data, pass the table directly into the LLM context. The model reasons over rows and columns without SQL. Best for aggregations, comparisons, and trend analysis on <100 rows.
# Serialize table as Markdown
table_md = df.to_markdown(index=False)
prompt = f"""Given this data table:
{table_md}
Answer: {question}
Rules:
- Only use data from the table
- Show your calculation steps
- If data is insufficient, say so"""
answer = llm.generate(prompt)
Text2SQL Safety & Guardrails
SQL Injection Prevention
Always use read-only DB connections. Parse and validate generated SQL against an allowlist of operations (SELECT only). Block DROP, DELETE, UPDATE, INSERT, GRANT.
Query Cost Guards
Add EXPLAIN before execution to estimate row scans. Set query timeouts (5-30s). Block full table scans on large tables. Limit result set size (LIMIT 1000).
Column-Level Access Control
Enforce column-level permissions in the schema retriever. Don't expose salary, SSN, or PII columns to unauthorized users. Redact sensitive columns from schema context.
Data Lifecycle, Freshness & Deletion
Production RAG cannot stop at ingestion. You need deterministic handling for updates, deletes, retention, cache invalidation, tombstones, and legal erasure requests so the system never serves stale or non-compliant content.
Lifecycle Rules
- Every chunk must carry `doc_id`, `version`, `source_updated_at`, `retention_class`, and `delete_by` metadata.
- Deletes should write tombstones first, then purge vector rows, cache entries, and derived artifacts asynchronously.
- Freshness is an SLO, not a hope: define targets like "95% of updates searchable within 5 minutes".
- Legal erasure must verify downstream deletion, not just remove the primary source record.
Failure Cases to Prevent
- Updated source document but stale semantic cache still serving old answer.
- Delete event lost, leaving orphaned chunks in the vector index.
- Embedding model upgrade without full lineage causing mixed-version retrieval.
- Retention policy applied to source DB but not to traces, audit logs, and feedback datasets.
Delete Propagation Pattern
class LifecycleManager:
async def handle_delete(self, doc_id, tenant_id, version):
tombstone = {
"doc_id": doc_id,
"tenant_id": tenant_id,
"version": version,
"deleted_at": now_utc(),
}
await self.audit_log.write(tombstone)
await self.vector_index.delete(filter={
"doc_id": doc_id,
"tenant_id": tenant_id,
})
await self.cache.invalidate_prefix(f"{tenant_id}:{doc_id}:")
await self.blob_store.purge(doc_id)
await self.metrics.increment("rag.delete.completed")
Tenant Isolation & Authorization Propagation
Multi-tenant RAG fails dangerously when identity is lost between the API edge and retrieval. Authorization must propagate through query rewriting, retrieval filters, cache keys, reranking, citations, and structured data access.
Identity Context
Normalize identity into a signed request context: tenant, user, groups, region, classification clearance, data residency, and session purpose.
Policy Resolution
Compile ABAC/RBAC decisions once per request and pass concrete filters downstream. Do not let each service reinterpret permissions differently.
Output Enforcement
Filter citations, schema context, and tool outputs after retrieval as well. A safe retriever can still leak through a broad synthesizer prompt.
Authorization Contract
request_context = {
"tenant_id": "acme",
"user_id": "u-123",
"groups": ["support", "tier2"],
"region": "us",
"purpose": "customer_support",
"allow": {
"doc_types": ["kb", "ticket"],
"classifications": ["public", "internal"]
}
}
filters = {
"tenant_id": request_context["tenant_id"],
"region": request_context["region"],
"classification": {"$in": request_context["allow"]["classifications"]},
}
cache_key = sha256(json.dumps({
"query": normalized_query,
"tenant": request_context["tenant_id"],
"policy_hash": hash_policy(request_context),
}, sort_keys=True).encode()).hexdigest()
Human Review Ops & Golden Datasets
Evaluation frameworks are not enough by themselves. Production teams need a disciplined review loop: sample traffic, adjudicate failures, curate regression sets, and assign ownership for fixing systematic defects.
Review Program Design
- Sample at least three buckets: top traffic, low-confidence responses, and high-risk policy domains.
- Require labels for retrieval quality, groundedness, citation quality, and user task completion.
- Track reviewer agreement and escalate ambiguous cases to adjudication.
- Promote only adjudicated examples into the golden regression set.
Dataset Operating Model
- Keep separate sets for smoke, regression, hard edge cases, and release blocking policy cases.
- Version datasets like code and record model, prompt, and index version used to generate them.
- Retire stale eval samples when source policy or corpus semantics change materially.
- Assign owners for every recurring failure cluster, not just every model.
Minimal Review Schema
review_record = {
"query_id": "q-20260416-001",
"query": user_query,
"retrieved_chunks": chunk_ids,
"answer": answer,
"labels": {
"grounded": True,
"intent_match": True,
"citation_quality": "partial",
"task_success": "no",
"root_cause": "stale_source_data",
},
"reviewer_id": "rev-17",
"adjudicated": False,
}
Reliability, Failover & Degraded Modes
A production RAG system must keep answering safely when dependencies fail. Define fallback order, circuit breakers, restore targets, and degraded modes before you need them during an incident.
Primary Path
Hybrid retrieval + reranker + response evaluation + citations. Highest quality, highest dependency count.
Degraded Path
BM25-only retrieval, smaller local model, cached answers, or template response if vector DB, reranker, or API model is down.
Fail-Safe Path
Refuse cleanly, escalate to human, or serve a narrow verified FAQ set. Never silently drop safety checks.
Dependency Failure Matrix
| Dependency | Failure Signal | Fallback | Hard Rule |
|---|---|---|---|
| Vector DB | timeout / error budget burn | BM25 index or cached answer set | Disable claims needing fresh retrieval |
| Reranker | high latency / no replicas | lower `top_k`, rely on retrieval scores | Mark answer confidence lower |
| LLM API | provider outage / 429 storm | secondary model or local distilled model | Preserve same guardrails and filters |
| Policy Engine | cannot resolve permissions | fail closed | Never answer with missing auth context |
Reliability Controls
async def answer_query(query, ctx):
if not policy_engine.is_available():
raise FailClosed("authorization unavailable")
try:
docs = await vector_search.with_timeout(250).run(query, ctx.filters)
except TimeoutError:
docs = await bm25_fallback.search(query, ctx.filters)
ctx.mode = "degraded_retrieval"
answer = await generator.run(query, docs, ctx)
verdict, safe_answer = await response_eval.run(query, answer, docs)
if verdict == "fallback":
return human_handoff_or_verified_faq(query)
return safe_answer
Citation UX & Source Attribution
Grounding is only useful if users can inspect it. Production RAG should define how claims map to sources, how conflicting evidence is shown, and how citations differ across chat, search, copilots, and agent workflows.
Claim-Level Citations
Attach citations to atomic claims, not just the whole answer. One answer can have mixed evidence quality across sentences.
Source Preview
Show document title, snippet, timestamp, source system, and anchor location. Users should not need to open the full document to trust the claim.
Conflict Handling
When sources disagree, say so explicitly and rank by freshness, authority, and tenant-approved source priority.
Answer Contract with Citations
{
"answer": "Refunds are allowed within 30 days for unopened items.",
"claims": [
{
"text": "Refunds are allowed within 30 days",
"citations": [
{"doc_id": "policy-12", "anchor": "p3#refund-window", "confidence": 0.94}
]
}
],
"source_summary": [
{"title": "Returns Policy", "updated_at": "2026-04-01", "authority_rank": 1}
]
}
Multilingual & Locale-Aware RAG
Multilingual retrieval is more than using a multilingual embedding model. You need locale-aware routing, translation policy, source preference by market, and evaluation sliced by language and script.
Serving Policy
- Prefer native-language retrieval when the corpus exists in that locale.
- Use translation only as a fallback, and keep both original and translated evidence IDs.
- Apply locale-specific ranking for policy, legal, pricing, and compliance content.
Evaluation Requirements
- Track metrics by language, script, market, and translated-vs-native path.
- Maintain hard test sets for code-switching, transliteration, and named-entity spelling variants.
- Never hide poor minority-language performance behind global averages.
Personalization, Memory Boundaries & Deletion
Personalization improves usefulness, but it creates new correctness and compliance risks. The system must define what memory is allowed, how long it persists, who can see it, and how user corrections or deletions propagate.
Allowed Memory
Preferences, saved entities, work context, and prior explicit corrections. Keep this separate from shared knowledge retrieval.
Boundary Controls
Do not let user memory silently override system facts. Personal memory can bias ranking, not rewrite source-of-truth records.
Deletion Semantics
A user memory delete must remove embeddings, cache entries, summaries, and feedback traces tied to that memory object.
Secrets Management & Credential Rotation
Connectors, model providers, vector stores, and observability backends all introduce credentials. Production RAG needs explicit controls for secret storage, scoping, rotation, and auditability.
Required Controls
- Use a secret manager or workload identity, never hardcoded env files committed to the repo.
- Scope credentials per service and connector, not per environment.
- Rotate provider and connector tokens on a schedule and on incident.
- Log secret access events and failed decrypt attempts.
Common Failures
- Shared API key across ingestion, retrieval, and agent tools.
- Long-lived connector tokens without revocation flow.
- Secrets leaking into traces, prompts, or failed job payloads.
- Rotation that breaks warm instances because caches never refresh credentials.
RAG Framework Selection: What Each Is Best For
Framework choice should match the job. The wrong abstraction layer slows teams down just as much as the wrong model. Use this as a default selection guide, then override it only with clear constraints.
| Framework / Approach | Best For | Why | Use When |
|---|---|---|---|
| LlamaIndex | Data indexing + retrieval | Strong abstractions for ingestion, indexing, retrievers, node parsers, graph/property indexes, and retrieval composition. | You need to stand up robust retrieval quickly without building every data primitive yourself. |
| LangChain | Full LLM apps | Broad ecosystem for prompts, tools, chains, agents, integrations, and app-level orchestration. | You are building an end-to-end LLM product, not just a retriever. |
| Haystack | Production pipelines | Pipeline-oriented design, component composition, and strong production ergonomics for retrieval/generation systems. | You want explicit, maintainable, production-ready pipeline graphs. |
| LangGraph / AutoGen | Agents | Stateful orchestration and multi-step agent workflows with tool use, branches, retries, and explicit control flow. | You need agentic execution, not just one-pass RAG. |
| DSPy | Auto-optimized pipelines | Signature-driven modules and optimizers make it strong for prompt/program search and systematic quality tuning. | You are iterating experimentally and want the pipeline to optimize itself against metrics. |
| Custom stack | Performance + control | Minimal overhead, exact ownership of latency, storage, auth, and reliability behavior. | You have strict production constraints or framework abstraction is becoming the bottleneck. |
Default Rule
Pick the highest-level framework that does not hide a production constraint you care about.
Migration Rule
Start with a framework, then peel off hot or risky components into custom services once the bottlenecks are proven.
Anti-Pattern
Do not use an agent framework to solve a retrieval problem, or a retrieval framework to solve orchestration complexity.
Glossary of RAG Technical Terms
355 technical terms, tools, models, metrics, and concepts — click a letter or search to jump directly.
A
| Term | Definition |
|---|---|
| A/B Testing | Comparing model variants in production by routing traffic splits and measuring metrics to determine which version performs better on grounding, latency, and user satisfaction. |
| Access Control | Mechanism restricting who can query which documents; critical for multi-tenant RAG systems where different users have access to different knowledge bases. |
| Accuracy | Fraction of correct predictions out of total predictions; measures overall classification or retrieval quality. |
| ACL-sensitive cache keys | Cache keys incorporating access control preventing leakage. |
| Adaptive chunk count | Dynamically adjusts retrieved chunks by query complexity. |
| adversarial testing | Probing systems with malicious inputs to find weaknesses. |
| Agentic RAG | Pattern where LLM agent autonomously decides when/how to retrieve, orchestrating multi-step loops rather than following a fixed pipeline. |
| AGREE approach | Automated grounding evaluation framework. |
| ALCE approach | Attribution and loss-aware evaluation. |
| alert thresholds | Boundaries triggering notifications on metric violations. |
| ALiBi (Attention with Linear Biases) | Positional encoding adding linear biases to attention scores for length extrapolation beyond training sequences. |
| all-mpnet | Sentence-transformer combining multiple pooling strategies for versatile embeddings. |
| amazon-neptune | AWS managed graph database for property graphs and RDF in Graph RAG. |
| ANN (Approximate Nearest Neighbor) | Algorithms like HNSW and IVF that trade exactness for speed in vector search, enabling sub-linear retrieval. |
| anomaly detection | Identifies unusual patterns suggesting failures. |
| answer correctness | Evaluates generated answer accuracy against ground truth. |
| Answer Relevancy | RAGAS metric measuring how well the generated answer addresses the original question. |
| answer similarity | Compares generated answers to references using embedding or semantic similarity. |
| AnswerCorrectness | LLM-based metric scoring generated answer accuracy and completeness. |
| Apache Tika | Java library extracting text from 1000+ file formats with OCR support for multimodal RAG. |
| ArgoCD | GitOps tool managing Kubernetes applications and RAG infrastructure changes. |
| Arize Phoenix | ML observability platform monitoring embeddings, LLM outputs, and performance drift. |
| asymmetric search | Different encodings for queries vs documents. |
| async processing | Non-blocking operation handling. |
| Attention Mechanism | Neural component allowing tokens to selectively focus on other tokens via Q·K^T/√d → softmax → V. |
| audit trails | Logging retrieval/generation for compliance and transparency. |
| Autoregressive Decoding | Sequential generation conditioning each token on all previously generated ones. |
| Adaptive RAG | RAG pattern that dynamically selects retrieval strategies based on query complexity — routing simple queries to direct retrieval and complex ones to multi-step. |
| Advanced RAG | Enhanced RAG with query transformation, hybrid retrieval, reranking, context compression, and self-correction loops for production quality. |
| Agentic Chunking | Using an LLM to decide chunk boundaries based on semantic content rather than fixed rules — highest quality but most expensive. |
| AnswerCorrectness | RAGAS metric combining factual correctness and semantic similarity of the generated answer against a ground-truth reference. |
| Asymmetric Search | Retrieval where queries and documents are encoded differently — short queries mapped to the same space as long documents. |
B
| Term | Definition |
|---|---|
| Batching | Grouping multiple queries for efficient parallel processing on GPU. |
| BEIR | Benchmarking IR — zero-shot evaluation across 18 diverse retrieval datasets. |
| BentoML | Framework for productionizing and deploying ML models including embeddings. |
| BGE (BAAI General Embedding) | Family of open-source embedding and reranker models. |
| bge-m3 | BAAI's multilingual embedding supporting dense, sparse, and colbert-style retrieval simultaneously. |
| Bi-Encoder | Model that independently encodes queries and documents into separate vectors for fast retrieval. |
| binarization | Converts continuous to binary. |
| BLEU | Bilingual Evaluation Understudy — metric for evaluating generated text against references. |
| Bloom Filter | Probabilistic data structure for fast membership testing with no false negatives. |
| blue-green deployment | Parallel versions enabling instant rollback. |
| BM25 | Best Matching 25 — probabilistic sparse retrieval algorithm using TF-IDF-like scoring. |
| Binary Quantization | Reducing embedding vectors to binary bits (0/1) for ultra-fast retrieval with ~32x memory reduction at moderate quality cost. |
C
| Term | Definition |
|---|---|
| Caching | Storing computed results for reuse — semantic cache, exact cache, and embedding cache reduce latency and cost. |
| calibration | Adjusts confidence matching actual accuracy. |
| Canary Deployment | Gradually routing traffic to a new model version while monitoring for regressions. |
| CARGO approach | Cascading grounding optimization. |
| Chain-of-Thought | Prompting technique eliciting step-by-step reasoning before final answer. |
| Chroma | Lightweight open-source embedding database for AI applications. |
| Chunking | Splitting documents into smaller segments — strategies include fixed-size, recursive, semantic, sentence-window. |
| Circuit Breaker | Resilience pattern preventing cascading failures by short-circuiting calls to failing services. |
| Citation | Reference to a specific source passage supporting a generated claim. |
| Citation Precision | Fraction of inline citations that actually support their attached claim; target ≥0.80. |
| Citation Recall | Fraction of claims that have at least one valid supporting citation; target ≥0.75. |
| Clustering | Grouping similar items without labels — used for topic modeling and document organization. |
| Cohere | AI company providing embedding and reranking models via API. |
| ColBERT | Contextualized Late Interaction over BERT — 10-100x faster than cross-encoders. |
| Community Detection | Algorithm like Leiden that identifies clusters of densely connected entities in knowledge graphs. |
| compliance and governance | Policies ensuring RAG meets regulatory requirements. |
| Compression | Reducing context length before generation — extractive, abstractive, or hybrid. |
| confidence calibration | Ensures predicted confidence matches correctness. |
| Confidence tagging | Tags claims by credibility based on retrieval confidence. |
| confidence-based weighting | Weights by model confidence scores. |
| connection pooling | Reuses connections reducing overhead. |
| Consensus answer | Combines multiple answers via voting reducing individual hallucinations. |
| Consistency Checking | Verifying generated content agrees with source material. |
| content quality evaluation | Assesses retrieved content quality. |
| Context Injection | Adding retrieved passages into the LLM prompt as grounding context. |
| context recall | Fraction of all relevant information successfully retrieved in top-K results. |
| Context Stuffing | Anti-pattern of including excessive context that confuses the model. |
| Context Window | Maximum tokens an LLM can process in one pass — determines how much retrieved context fits. |
| Contrastive Learning | Training embeddings by pulling similar pairs closer and pushing dissimilar pairs apart. |
| Cosine Similarity | Similarity metric computing cos(θ) between two vectors; standard for embedding comparison. |
| CPU optimization | Optimizes for CPU and parallelism. |
| Cross-Encoder | Reranking model processing query-document pairs jointly via full cross-attention; more accurate but slower. |
| Cypher | Neo4j's graph query language used for structured graph retrieval in Graph RAG. |
| Code-Aware Chunking | Chunking that respects code structure — splitting at function/class boundaries rather than mid-expression for technical documentation. |
| Context Precision | RAGAS metric measuring the proportion of relevant retrieved chunks among all retrieved chunks — higher means less noise. |
| Context Recall | RAGAS metric measuring the proportion of required information that was successfully retrieved from the knowledge base. |
| Contextual Chunking | Anthropic's approach prepending a short context summary to each chunk describing its position and role in the parent document. |
| ContextualCompressionRetriever | LangChain's wrapper combining a base retriever with a document compressor pipeline for automatic context reduction. |
| Corrective RAG (CRAG) | RAG pattern that evaluates retrieval quality after each step and triggers alternative retrieval strategies when confidence is low. |
D
| Term | Definition |
|---|---|
| Data Poisoning | Adversarial attack introducing corrupted data into the knowledge base to manipulate outputs. |
| data residency | Data never leaves geographic regions or infrastructure. |
| DeBERTa | Decoding-enhanced BERT — used as NLI model for grounding verification. |
| Decomposition | Breaking complex queries into simpler sub-questions for independent retrieval. |
| DeepEval | Evaluation framework offering pre-built metrics for RAG without manual labels. |
| Dense Embedding | High-dimensional continuous vector representing text semantics. |
| Dense Retrieval | Retrieval using learned dense vectors where similarity = cosine/dot-product. |
| dependency scanning | Automated scanning for known vulnerabilities. |
| Diffbot | Web intelligence API providing entity extraction and knowledge graph construction from web content. |
| dimensionality reduction | Reduces features via PCA/SVD. |
| Disambiguation | Resolving ambiguity when the same term refers to different entities. |
| DiskANN | Microsoft's disk-based ANN algorithm enabling billion-scale vector search. |
| distance metrics | Similarity functions (cosine, L2, dot, Hamming). |
| distillation loss | Objective comparing student to teacher. |
| distributed tracing | Records request paths across services for latency analysis. |
| diversity-based weighting | Balances relevance and diversity. |
| Docker | Containerization technology packaging RAG applications with dependencies. |
| Document Loader | Component ingesting raw files into the pipeline — LangChain loaders, Unstructured.io, Apache Tika. |
| Document reordering | Rearranges compressed documents putting most relevant content first. |
| Document Sharding | Partitioning documents across nodes for horizontal scaling. |
| document-type router | Routes queries to specialized pipelines by document type. |
| Dot Product | Sum of element-wise multiplication — used as fast similarity metric for normalized vectors. |
E
| Term | Definition |
|---|---|
| ECoRAG | Evidentiality-guided Compression for long-context RAG — 5-15x compression with 96-99% quality. |
| Elasticsearch | Distributed search engine supporting both keyword and vector search. |
| element-aware parsing | Preserves document structure (tables, code, lists) during parsing. |
| Embedding | Dense vector representation mapping text to continuous high-dimensional space. |
| Embedding Model | Neural network encoding text into fixed-size vectors for similarity comparison. |
| ensemble methods | Combines multiple models for robustness. |
| Entailment check | NLI-based verification confirming context entails generated claims. |
| Entity Linking | Connecting entity mentions to entries in a knowledge base or graph. |
| Entity Recognition | NER — identifying named entities and their types in text. |
| error budgets | Allowable errors before breaching SLAs. |
| euclidean distance | L2 distance between vectors. |
| Evaluation Framework | Systematic approach for measuring RAG quality — RAGAS, ARES, custom suites. |
| Eventual Consistency | Distributed system property where all nodes converge to consistent state over time. |
| Exact Match Cache | Caching strategy storing results for identical query strings. |
| Exponential Backoff | Progressively increasing wait time between retries to avoid overloading. |
| Extractive Compression | Selecting most relevant sentences/tokens from context without rewriting. |
| Embedding Drift Detection | Monitoring technique tracking how embedding model outputs change over time, triggering re-indexing or retraining when drift exceeds thresholds. |
F
| Term | Definition |
|---|---|
| FActScore | Fact-level metric decomposing claims and scoring verifiable facts. |
| FAISS | Facebook AI Similarity Search — library for efficient similarity search, supports CPU and GPU. |
| Faithfulness | Core grounding metric — fraction of generated claims supported by retrieved context; RAGAS target ≥0.85. |
| FalkorDB | Graph database specialized for knowledge graphs and multi-hop reasoning in RAG. |
| Fallback strategies | Alternative approaches on low confidence. |
| Few-Shot Learning | Performing a task with minimal examples provided in the prompt. |
| Filtering | Selecting subset of results based on metadata, relevance threshold, or safety criteria. |
| Fine-Tuning | Adapting a pretrained model to a specific task or domain with task-specific data. |
| FlashRank | Fast approximate reranker for initial filtering before expensive cross-encoders. |
| FP16 computation | Half-precision reducing memory. |
| Fusion | Combining results from multiple retrievers/rankers — typically via RRF or weighted scoring. |
| Fuzzy Matching | Finding approximately matching items allowing minor differences in spelling or phrasing. |
| FlagEmbedding | BAAI's training framework for state-of-the-art embedding and reranker models with support for retrieval-augmented fine-tuning. |
G
| Term | Definition |
|---|---|
| GPU | Graphics Processing Unit — hardware for parallel computation powering embedding generation and LLM inference. |
| Grafana | Visualization platform creating dashboards from Prometheus and other metric sources. |
| Graph Database | Database storing data as nodes and relationships — Neo4j, Amazon Neptune, NebulaGraph. |
| Graph RAG | RAG enhanced with knowledge graphs for multi-hop reasoning, entity disambiguation, and traceable answers — reduces hallucination 50-70%. |
| Graph Traversal | Navigating connected nodes in a knowledge graph to find multi-hop answers. |
| Grounding | Anchoring every LLM claim to specific evidence from retrieved documents — primary defense against hallucination. |
| gRPC | Google's high-performance RPC framework for low-latency service communication. |
| GTE-Qwen (7B) | Qwen-based general text embedding model supporting multiple languages and modalities. |
| Guardrails | Input/output validation rules enforcing safety, compliance, and quality — PII detection, topic filtering, toxicity checks. |
H
| Term | Definition |
|---|---|
| Hallucination | LLM generating plausible but factually incorrect information; baseline RAG: 10-25%, with grounding: 3-10%. |
| hamming distance | Distance for binary strings. |
| hard negative mining | Selects challenging negatives improving discrimination. |
| hard timeouts | Maximum operation duration limits. |
| hard veto rules | Absolute blocking rules preventing certain responses. |
| harmfulness | Evaluates if generated content violates ethical, legal, or safety guidelines. |
| Haystack | End-to-end RAG framework with retrieval, reranking, generation. |
| Helm | Kubernetes package manager enabling templated RAG infrastructure deployment. |
| HNSW | Hierarchical Navigable Small World — ANN algorithm building multi-layer graph for O(log N) search with high recall. |
| Hybrid Retrieval | Combining dense/semantic and sparse/keyword retrieval via RRF fusion — production best practice. |
| HyDE | Hypothetical Document Embeddings — generates a hypothetical answer first, then embeds it as the query vector. |
| HNSW ef Parameter | HNSW search parameter controlling beam width during query — higher ef means more accurate but slower search. |
| HNSW M Parameter | HNSW build parameter controlling graph connectivity — higher M means better recall but more memory per node. |
I
| Term | Definition |
|---|---|
| IDF | Inverse Document Frequency — weighting factor reducing importance of common terms. |
| In-Context Learning | Model learning from examples in the prompt without weight updates. |
| incident response | Procedures for detecting and resolving failures. |
| Index | Data structure optimizing lookup — vector indexes like HNSW, IVF; keyword indexes like inverted index. |
| infrastructure as code | Version-controlled infrastructure definitions. |
| Ingestion Pipeline | Offline workflow: load → parse → clean → chunk → embed → store in vector DB. |
| Instructor | Large embedding model pre-trained on diverse tasks with explicit instruction support for asymmetric search. |
| Intent Recognition | Understanding user's goal from their query to route to appropriate retrieval strategy. |
| Inverted Index | Data structure mapping terms to documents containing them — backbone of keyword search. |
| IVF | Inverted File — ANN indexing that clusters vectors, searches only nearest clusters. |
| Instruction-Tuned Embeddings | Embedding models fine-tuned to follow task-specific instructions prepended to queries, improving retrieval for specific use cases. |
| IVF-PQ | Combined index using Inverted File clustering with Product Quantization — enables billion-scale vector search with reduced memory. |
J
| Term | Definition |
|---|---|
| Jitter | Small random delay added to prevent thundering herd problems in distributed systems. |
K
| Term | Definition |
|---|---|
| KNN | K-Nearest Neighbors — finding K closest vectors to a query in embedding space. |
| knowledge distillation | Trains student models to mimic teachers. |
| Knowledge Graph | Structured entity-relationship representation enabling multi-hop reasoning in Graph RAG. |
| knowledge transfer | Leverages pre-trained models for downstream tasks. |
| KServe | Kubernetes-native platform deploying embeddings and LLM models at scale. |
| Kubernetes | Container orchestration deploying, scaling, and managing RAG services in production. |
| KV cache | Stores key/value matrices from previous tokens. |
L
| Term | Definition |
|---|---|
| lambda loss | Learning-to-rank loss optimizing metrics. |
| LangChain | Framework for building LLM applications — provides document loaders, text splitters, retrievers, chains, agents. |
| LangFuse | Open-source LLM observability platform with tracing, metrics, and cost analysis. |
| LangSmith | LangChain's tracing and monitoring platform for debugging LLM applications in production. |
| late interaction | Token-level interactions at reranking stage. |
| Latency | Time from query submission to response delivery — measured as P50/P95/P99 percentiles. |
| Leiden Algorithm | Community detection algorithm used in Graph RAG for hierarchical clustering of entities. |
| listwise ranking | Ranks entire lists jointly. |
| LlamaIndex | Data framework for LLM apps — VectorStoreIndex, PropertyGraphIndex, LongLLMLinguaPostprocessor. |
| LLMGraphTransformer | Constructs knowledge graphs from documents using LLM. |
| LLMLingua | Microsoft's prompt compression: v1 perplexity-based 20x compression; v2 token classification 3-6x faster. |
| Load Balancing | Distributing requests across servers — round-robin, least connections, weighted. |
| LongLLMLingua | RAG-optimized compression with question-aware coarse-to-fine, document reordering, dynamic ratios. |
| LoRA | Low-rank fine-tuning with minimal parameters. |
| Lost-in-the-Middle | Phenomenon where LLMs disproportionately attend to beginning/end of long contexts, ignoring middle. |
| low-rank approximation | Approximates with lower rank. |
| Late Chunking | Chunking strategy that first embeds the full document, then segments into chunks preserving cross-boundary context in the embeddings. |
| LongLLMLinguaPostprocessor | LlamaIndex's node postprocessor integrating LLMLingua compression directly into the query pipeline. |
M
| Term | Definition |
|---|---|
| manhattan distance | L1 distance summing absolute differences. |
| MAP | Mean Average Precision — average of precision values at each relevant document position. |
| matrix factorization | Decomposes matrices into factors. |
| matryoshka representation learning | Trains embeddings for multiple dimensionalities. |
| Maximal Marginal Relevance | MMR — balancing relevance and diversity in retrieved results to reduce redundancy. |
| Metadata Filtering | Pre-filtering vector search by structured fields: date, source, category, access level. |
| metric collection | Systematic gathering of performance metrics across systems. |
| Milvus | Open-source vector database for scalable similarity search with HNSW, IVF, DiskANN indexes. |
| MiniLM | Compact Transformer family — all-MiniLM-L6-v2 is popular for fast production embedding. |
| MLflow | ML lifecycle platform for experiment tracking and model registry. |
| MMR | Maximal Marginal Relevance — see above. |
| model provenance | Tracking model origin, training data, and modifications. |
| model routing decisions | Selects models by query type or constraints. |
| Monitoring | Continuous observation of system health: latency, throughput, quality metrics, error rates. |
| MRR | Mean Reciprocal Rank — average of 1/rank of first relevant result across queries. |
| MTEB | Massive Text Embedding Benchmark — standard leaderboard across 8 tasks and 50+ datasets. |
| Multi-Hop Reasoning | Answering questions requiring traversal across multiple connected facts or documents. |
| Multi-Query Retrieval | Generating multiple rephrasings of a query, retrieving for each, and deduplicating results. |
| multi-tenancy | Single system serving isolated organizations. |
| Multilingual E5 | E5 family supporting 100+ languages for cross-lingual RAG and multilingual retrieval. |
| mxbai-rerank | Mixedbread AI reranker providing efficient ranking of retrieved documents. |
| Markdown Header Chunking | Splitting documents at markdown header boundaries (H1, H2, H3) to create topically coherent chunks matching document structure. |
| Modular RAG | Architecture decomposing RAG into interchangeable modules (retrieval, reranking, compression, generation) that can be independently upgraded or swapped. |
| Multi-Tenancy | Single vector database instance serving multiple isolated organizations/users with separate data partitions and access controls. |
N
| Term | Definition |
|---|---|
| namespaces | Logical partitions organizing data by tenant. |
| NDCG@K | Normalized Discounted Cumulative Gain — ranking metric weighting higher positions more heavily. |
| NebulaGraph | Distributed graph database optimized for large-scale knowledge graphs. |
| NER | Named Entity Recognition — identifying people, organizations, locations in text. |
| NLI | Natural Language Inference — entailment classification used for grounding verification via DeBERTa-MNLI. |
| Nomic (embed model) | Open-source embedding model optimized for long-context sequences up to 8K tokens. |
| Normalization | Standardizing vectors to unit length for cosine similarity, or standardizing data formats. |
| Nucleus Sampling | Top-P sampling — selecting from smallest token set exceeding cumulative probability P. |
| Naive RAG | The simplest RAG pattern: retrieve top-K chunks, concatenate into prompt, generate answer. No reranking, no query transformation, no self-correction. |
| Namespaces | Logical partitions within a vector database organizing data by tenant, project, or use case for isolated retrieval. |
O
| Term | Definition |
|---|---|
| Observability | Understanding system internal state from metrics, logs, and distributed traces. |
| OpenTelemetry | Observability framework collecting distributed traces and metrics from RAG systems. |
| ORTModel | ONNX Runtime for hardware-optimized inference. |
| Overlap | Duplication between adjacent chunks in sliding-window chunking to preserve cross-boundary context. |
| OWASP | Open Web Application Security Project — LLM Top 10 threats for RAG security. |
P
| Term | Definition |
|---|---|
| PagedAttention | vLLM's memory management technique that pages KV cache like virtual memory for efficient batching. |
| pairwise ranking | Compares pairs for relative relevance. |
| parameter-efficient | Fine-tuning with few parameters vs full tuning. |
| Parent Document Retrieval | Searching on small chunks but returning the full parent document for complete context. |
| Passage Ranking | Ordering text passages by relevance to a query. |
| pdfplumber | Python library for precise PDF text extraction and table parsing with layout awareness. |
| pgvector | PostgreSQL extension for vector similarity search — convenient when already using Postgres. |
| PII | Personally Identifiable Information — must be detected and redacted from documents and outputs. |
| Pinecone | Managed cloud vector database with serverless and pod-based deployment. |
| Pipeline | Sequence of processing stages — ingestion pipeline, query pipeline, evaluation pipeline. |
| Pointwise Ranking | Scoring each document independently vs pairwise or listwise approaches. |
| Precision | Fraction of retrieved items that are relevant. |
| Preprocessing | Data cleaning steps before indexing: normalize unicode, remove boilerplate, extract text from formats. |
| Prometheus | Time-series metrics database collecting system and application performance data. |
| Prompt Engineering | Designing effective prompts with system instructions, few-shot examples, and constraints. |
| Prompt Injection | Adversarial attack embedding malicious instructions in documents or queries — top OWASP threat. |
| Prompt Tuning | Learning task-specific soft tokens prepended to input. |
| PromptCompressor | LangChain wrapper applying compression to retrieved context. |
| Pruning | Removing unnecessary model weights for compression and speed. |
| Parent-Child Retrieval | Indexing small child chunks for precise matching but returning the larger parent document for complete generation context. |
| Product Quantization (PQ) | Vector compression technique factorizing high-dimensional space into independent low-dimensional subspaces, each quantized separately. |
Q
| Term | Definition |
|---|---|
| Qdrant | Vector database with advanced filtering, payload indexing, and hybrid search. |
| QLoRA | Quantized LoRA combining compression and efficiency. |
| quality regression detection | Detects accuracy/relevance drops in production. |
| Quantization | Reducing model precision to decrease memory and increase speed — GPTQ, AWQ, GGUF. |
| Query Decomposition | Breaking complex queries into simpler sub-questions for independent retrieval and synthesis. |
| Query Expansion | Enriching queries with synonyms, related terms, or LLM-generated reformulations. |
| Query Rewriting | Transforming queries for better retrieval — conversational-to-standalone, typo correction, clarification. |
| Query routing | Classifies queries and directs them to specialized retrieval strategies by domain. |
| Query Routing | Directing queries to different retrieval backends based on intent classification — e.g., keyword search for codes/IDs, semantic for concepts. |
R
| Term | Definition |
|---|---|
| RAG | Retrieval-Augmented Generation — architecture combining document retrieval with LLM generation for grounded answers. |
| RAGAS | RAG Assessment — evaluation framework scoring faithfulness, answer relevancy, context precision/recall. |
| ranking algorithms | Orders results (BM25, neural, learning-to-rank). |
| Rate Limiting | Controlling request frequency to prevent system overload. |
| Ray Serve | Distributed serving framework scaling RAG models across multiple nodes. |
| Recall@K | Fraction of relevant documents in the top-K results; target ≥0.90. |
| Reciprocal Rank Fusion | RRF — combining ranked lists from multiple retrievers: score = Σ 1/(k + rank_i); standard for hybrid search. |
| RECOMP | Trained compression: extractive variant selects sentences; abstractive variant generates summaries; 5-20x compression. |
| recursive character splitting | Recursively uses delimiters preserving semantic units. |
| red teaming | Adversarial testing discovering vulnerabilities (injection, jailbreaks). |
| Red-Teaming | Adversarial testing to discover vulnerabilities — prompt injection, jailbreaks, data extraction. |
| redundancy reduction | Deduplicates results avoiding repetition. |
| regression detection | Automated alerting when metrics fall below baselines. |
| regulatory requirements | Legal constraints (GDPR, HIPAA, SOC2) affecting design. |
| Relevance | Degree to which a retrieved document addresses the user's information need. |
| Reranker | Model rescoring retrieved documents for better ranking — cross-encoders like BGE-reranker, mxbai-rerank. |
| Reranking | Re-ordering initially retrieved results using a more accurate but slower model. |
| retraining triggers | Metrics initiating model retraining. |
| Retry Logic | Automatically re-attempting failed operations with backoff. |
| Retry loops | Repeats failed operations with strategy variations. |
| ROUGE | Recall-Oriented Understudy — metric for evaluating summarization quality. |
| RRF | Reciprocal Rank Fusion — see above. |
| RAG-Fusion | Query transformation technique generating multiple query variants, retrieving for each, and fusing results via Reciprocal Rank Fusion for improved recall. |
S
| Term | Definition |
|---|---|
| Sampling | Selecting tokens during generation — temperature, top-k, top-p/nucleus control diversity. |
| score gap analysis | Analyzes difference between top-1 and top-K guiding reranking necessity. |
| secrets management | Secure storage and rotation of credentials and tokens. |
| Self-Consistency | Grounding technique: generate N responses, keep claims appearing in ≥60% — reduces hallucination 40-55%. |
| Self-Correction Loop | RAG pattern evaluating output quality and retrying retrieval/generation if below threshold. |
| Semantic Cache | Caching results for semantically similar queries using embedding similarity threshold. |
| Semantic Chunking | Splitting at natural topic boundaries using embedding similarity between adjacent sentences. |
| Semantic Search | Retrieval based on meaning rather than keyword matching, using dense embeddings. |
| Sentence Transformers | Library for sentence embeddings and semantic search. |
| Sentence Window Retrieval | Indexing individual sentences but returning surrounding window ±N sentences for context. |
| SetFit | Few-shot learning framework enabling supervised embedding fine-tuning with minimal labeled data. |
| Sharding | Partitioning data across multiple nodes for horizontal scaling. |
| similarity metrics | Functions measuring vector similarity. |
| SLA | Service Level Agreement — contractual performance guarantees for latency, uptime, accuracy. |
| Sliding Window | Chunking strategy using fixed-size window with overlap stepping across the document. |
| soft targets | Probabilistic targets from teacher vs hard labels. |
| Softmax | Function converting logits to probability distribution summing to 1. |
| SpaCy | Industrial NLP library for entity recognition, dependency parsing, and document preprocessing. |
| Sparse Retrieval | Keyword-based retrieval using BM25/TF-IDF term matching — excels at exact terms, acronyms, proper nouns. |
| speculative decoding | Drafts tokens in parallel reducing latency. |
| SPLADE | Sparse Lexical and Expansion model — learned sparse retrieval combining term matching with expansion. |
| Step-Back Prompting | Generating a more abstract query version before retrieval for broader context. |
| student model | Smaller model trained to mimic teacher. |
| Sub-Question Decomposition | Breaking multi-part queries into simpler questions for independent retrieval. |
| supply chain security | Evaluating security of dependencies and models. |
| symmetric search | Identical encoding for both queries and documents. |
| Scalar Quantization | Reducing embedding precision from FP32 to INT8 or lower, achieving 4x memory reduction with minimal quality loss. |
| Self-RAG | RAG pattern where the LLM decides when to retrieve, what to retrieve, and self-evaluates whether retrieved passages are relevant before generating. |
| Sentence Transformers | Python library for computing dense vector representations of sentences using pre-trained transformer models. Powers most embedding pipelines. |
| Symmetric Search | Retrieval where both items are encoded identically — used for similar document finding, deduplication, and clustering. |
T
| Term | Definition |
|---|---|
| T2 escalation | Support tickets indicating embedding drift. |
| T3 rejection | System rejections of low-confidence responses. |
| teacher model | Large model training smaller student models. |
| Temperature | Parameter scaling logits: higher = more random/diverse, lower = more deterministic/focused. |
| TensorRT-LLM | NVIDIA's inference optimization engine with optimized GPU kernels for LLM serving. |
| Terraform | Infrastructure-as-code for provisioning cloud resources and RAG systems. |
| TF-IDF | Term Frequency-Inverse Document Frequency — classical term weighting scheme for keyword retrieval. |
| Threat Model | Systematic analysis of security risks — OWASP LLM Top 10 covers injection, data leakage, excessive agency. |
| Throughput | Requests processed per unit time — tokens/sec for LLMs, queries/sec for retrieval. |
| tier-based retrieval | Routes queries to different strategies by complexity and confidence. |
| Token | Fundamental text unit in LLMs — subword pieces produced by tokenizers; ~0.75 English words per token. |
| Tokenization | Converting text into tokens via BPE, SentencePiece, or WordPiece algorithms. |
| Top-K | Returning K most similar results from vector search; also sampling strategy limiting to K highest-probability tokens. |
| toxicity detection | Identifies harmful or abusive content for filtering. |
| Triton Inference Server | NVIDIA's production model serving with dynamic batching, model ensembles, multi-GPU. |
| TruLens | Feedback framework for evaluating and improving RAG systems with LLM-based metrics. |
| TruthfulQA | Benchmark evaluating truthfulness on challenging factual questions. |
| TTL | Time To Live — cache expiration duration after which entries are refreshed. |
| text-embedding-3-large | OpenAI's highest-quality text embedding model (3072 dimensions) for dense retrieval in RAG systems. |
| text-embedding-3-small | OpenAI's compact embedding model (1536 dimensions) balancing quality and speed for cost-efficient production RAG. |
| TruthfulQA | Benchmark evaluating LLM tendency to generate truthful answers vs common misconceptions — important for RAG quality assessment. |
U
| Term | Definition |
|---|---|
| uncertainty estimation | Quantifies model confidence and ambiguity. |
| Unstructured.io | Platform processing diverse file types with element-aware parsing and metadata extraction. |
| uptime requirements | Availability targets (e.g., 99.99%) for services. |
V
| Term | Definition |
|---|---|
| Vector | Ordered array of numbers representing a point in high-dimensional space. |
| Vector Database | Specialized database for storing, indexing, and searching embeddings — Pinecone, Weaviate, Milvus, Qdrant, Chroma, pgvector. |
| Vector Search | Finding nearest neighbors in embedding space using ANN algorithms. |
| Vectorization | Converting text to numerical vectors via embedding models. |
| vendor security audit | Security evaluation before integrating external services. |
| version tracking | Maintains model versions and performance. |
| vLLM | High-throughput inference engine using PagedAttention and continuous batching — 10-50x faster than naive HuggingFace. |
| Voyage AI | Commercial embedding API providing Voyage-large and Voyage-code models optimized for enterprise retrieval tasks. |
W
| Term | Definition |
|---|---|
| Warm-Up | Initial cache/index loading phase before system reaches peak performance. |
| Weaviate | Open-source vector database with built-in vectorization, hybrid search, and GraphQL API. |
| weighted aggregation | Combines by importance weights. |
| Weights & Biases | Experiment tracking platform for RAG training and evaluation runs. |
| WhyLabs | Model monitoring platform for tracking embedding quality and anomaly detection. |
Z
| Term | Definition |
|---|---|
| Zero-Shot Learning | Performing tasks without task-specific training examples — relying on model's general knowledge. |
Production-Grade RAG Pipeline Implementation Guide
Research-informed • 64 Sections • Architecture Diagrams • Code Examples