Data Pipelines for RAG & AI Agents
A practical guide to designing, building, and operating the data pipelines that feed retrieval-augmented generation systems and tool-using agents — from raw documents and APIs all the way to production retrieval, memory, and evaluation.
01 Overview
A RAG or agent system is only as good as the data it sees. The pipeline is the part of the system that decides what data exists, in what shape, when it updates, and how the model finds it.
Most production failures of RAG systems and AI agents trace back to the pipeline, not the model. "Bad answers" are usually really one of:
- The right document existed but wasn't retrieved (bad chunking, wrong embedding, missing filter).
- The retrieved chunk was stale or duplicated (bad freshness or deduplication).
- The chunk was retrieved but the metadata was wrong (bad enrichment).
- The agent had the right tool but wrong context (bad context construction).
02 Why pipelines matter
Three axes drive the design: scale (millions of documents), freshness (seconds to days), and quality (clean, deduplicated, attributed).
Scale
Modern corpora are 10K–100M+ documents. You cannot embed and re-embed everything on every change — incremental, idempotent pipelines are mandatory.
Freshness
Stale data poisons retrieval. Customer-support agents need new tickets in minutes; legal-research agents tolerate hours; archival corpora tolerate days.
Quality
Deduplication, PII redaction, content typing, language detection, structural parsing — every quality control you skip shows up as a hallucination later.
03 RAG pipeline architecture
The canonical RAG data pipeline is a 7-stage flow from raw source to answerable retrieval. Each stage has its own failure modes and tools.
Sections 04–10 below walk through each stage in detail.
04 Stage 1 — Ingestion
Ingestion is where data enters the pipeline. The dominant question: push or pull, batch or stream.
Common source types
| Source | Pattern | Notes |
|---|---|---|
| Object storage (S3, GCS, Azure Blob) | Pull / event-driven | S3 events → SQS/Lambda is the most common ingest trigger. |
| SaaS APIs (Notion, Confluence, Zendesk, Salesforce) | Pull (poll) + webhook | Use webhooks for freshness, periodic pulls as a backstop. |
| Databases (Postgres, MySQL, Mongo) | CDC (Debezium, Fivetran) | Change Data Capture beats periodic snapshot for large tables. |
| Message queues (Kafka, Kinesis, Pub/Sub) | Stream consumer | Natural fit for high-volume event streams (logs, transactions). |
| Web (sites, sitemaps, feeds) | Crawler (Scrapy, Firecrawl) | Respect robots.txt; budget concurrency; cache aggressively. |
| Email / chat (Slack, Teams, IMAP) | Push via webhook / app | Watch for PII; usually requires per-thread ACL propagation. |
| Code repos (GitHub, GitLab) | Git clone + webhook | Sparse-checkout, file-type filters, language detection. |
Design rules
- Idempotent. Replaying the same source must not produce duplicates downstream — use deterministic IDs (e.g.
sha256(source_url + version)). - Preserve raw. Land raw bytes in object storage before any transformation. Re-derivation is the cheapest way to fix a bad parser.
- Atomic units. Each ingest unit (a document, a row, an event) carries a stable ID and a source-version stamp.
- Backpressure. A bursty source (a CDC catch-up, a re-crawl) must not knock the embedding service over — queue between stages.
05 Stage 2 — Parsing & cleaning
Raw bytes become structured text. The quality of this stage caps the quality of every downstream stage — a garbage parser produces garbage chunks.
By format
| Format | Recommended tools | What to extract |
|---|---|---|
| Unstructured.io, PyMuPDF, pdfplumber, Marker, Reducto, AWS Textract | Text + tables + figures + reading order. Scanned PDFs need OCR (Tesseract, Textract). | |
| HTML | Trafilatura, Readability, BeautifulSoup, jusText | Main content, headings, links, code blocks; strip nav/footer/ads. |
| DOCX / PPTX / XLSX | python-docx, python-pptx, openpyxl, Unstructured | Headings, lists, comments, speaker notes, sheet structure. |
| Markdown | markdown-it, mistune, frontmatter | Heading tree → use as natural chunk boundaries; preserve code blocks intact. |
| Code | Tree-sitter, AST parsers per language | Functions, classes, docstrings — chunk by symbol, not by line count. |
| Audio / video | Whisper, AssemblyAI, Deepgram | Transcript with timestamps; speaker diarization for meetings. |
| Images | OCR (Tesseract, GPT-4V, Claude vision), captioning models | OCR text + caption + alt-text + extracted entities. |
| mailparser, BeautifulSoup, custom thread reconstruction | Strip quoted replies, signatures; preserve thread structure. |
Cleaning steps
- Normalize whitespace, line endings, unicode (NFC).
- Strip boilerplate (headers, footers, navigation, cookie banners).
- Detect language; route per-language pipelines if needed.
- Detect and tag content type (prose, code, table, list).
- Redact or tokenize PII at the earliest stage where it's safe to do so.
- Hash content for deduplication keys (exact + near-dup with MinHash/SimHash).
06 Stage 3 — Chunking
Chunking splits documents into retrieval units. Too large and the LLM gets diluted context; too small and you lose meaning. Boundaries should respect semantic structure.
Strategies
| Strategy | How | When to use |
|---|---|---|
| Fixed-size | N tokens with M overlap (e.g. 512 / 64) | Baseline; works on uniform prose. |
| Recursive | Split on paragraph → sentence → word until under target size | Default for mixed prose. LangChain's RecursiveCharacterTextSplitter is the workhorse. |
| Structural | Split on headings, sections, list items | Markdown, HTML, structured docs. Preserves hierarchy. |
| Semantic | Split where embedding similarity drops (greedy clustering) | When prose has implicit topic shifts (interviews, transcripts). |
| Sentence-window | Embed single sentence, retrieve neighbors | Precise retrieval with broader context. Good for FAQ / Q&A. |
| Hierarchical / parent-child | Embed small, retrieve large parent | Classic "small-to-big" retrieval. Best of both granularities. |
| Code-aware | Split by function / class via AST | Code corpora — never split mid-function. |
| Late chunking | Embed long context first, slice the embedding | Newer technique — preserves cross-chunk context. |
Sizing rules of thumb
- Embedding model context. Stay well under the embedder's max tokens — quality often degrades long before the limit.
- Target 200–800 tokens per chunk for general prose. Smaller for QA, larger for narrative.
- Overlap 10–20% to preserve boundary context — but not so much that the index doubles in size.
- Always include a headline / breadcrumb at the top of each chunk:
"<Doc Title> > <Section>\n\n<chunk text>". Hugely improves retrieval and makes the LLM cite better.
07 Stage 4 — Enrichment & metadata
Each chunk carries metadata used for filtering, ranking, attribution, and access control. Underinvested metadata is one of the most common pipeline failures.
Required metadata
Provenance
source_id— stable per-document IDsource_url/pathversion/revisioningested_at,updated_atparser_version,chunker_version
Access & lifecycle
tenant_id,org_id,aclvisibility(public / internal / restricted)retention_until/delete_atlanguage,region
Content
doc_type(faq, runbook, code, transcript)section_path(breadcrumb)tags/topicstitle,summarytoken_count
Optional / derived
- Auto-generated
questionsthe chunk answers - Named entities (people, products, dates)
- Sentiment / quality score
- Difficulty / audience level
08 Stage 5 — Embedding
Convert each chunk to a vector. The embedding model is the largest single quality lever in retrieval — pick deliberately and version explicitly.
Model selection
| Provider / model | Type | Notes |
|---|---|---|
OpenAI text-embedding-3-large | API, dense | Strong general-purpose; 3072d (truncatable). |
Voyage voyage-3, voyage-code-3 | API, dense | Often top-of-leaderboard; specialized variants for code, finance, legal. |
Cohere embed-v3 | API, dense + reranker | Strong multilingual; pairs naturally with Cohere reranker. |
Google text-embedding-004 | API, dense | Strong on Google Cloud workloads. |
| BGE (BAAI), E5, Nomic Embed | Open weights, dense | Self-host on GPU/CPU; cheap, good quality, no data leaves your VPC. |
| SPLADE, BM25 | Sparse | Use alongside dense for hybrid search — handles rare terms / IDs / acronyms. |
| ColBERT / ColPali | Multi-vector / late interaction | Higher quality at higher index cost. ColPali for visual documents. |
Operational concerns
- Version everything. An embedding from
text-embedding-3-largeis not comparable to one from3-small. Tag every vector withembedder_id + version. - Re-embedding is expensive. Plan for it. Keep the original chunk text addressable so you can re-embed without re-parsing.
- Batch. Most providers charge per request and per token — batch 100s of chunks per call.
- Normalize. Most cosine-similarity stores expect L2-normalized vectors. Check your store's expectations.
- Symmetric vs asymmetric. Some models distinguish between
passageandqueryembeddings — use the right prompt template.
# Batched embedding with versioning from openai import OpenAI client = OpenAI() def embed_batch(chunks): resp = client.embeddings.create( model="text-embedding-3-large", input=[c.text for c in chunks], ) for chunk, item in zip(chunks, resp.data): chunk.vector = item.embedding chunk.meta["embedder"] = "openai/text-embedding-3-large" chunk.meta["embedder_dim"] = 3072 return chunks
09 Stage 6 — Indexing
Vectors land in a vector store; sparse representations land in a keyword index; metadata lands in a filterable store. In practice you usually want all three, often in the same product.
Vector stores compared
| Store | Type | Best for |
|---|---|---|
| pgvector (Postgres) | Extension | Already-Postgres shops; small to medium scale; transactional + vector in one place. |
| Qdrant | Self-host / cloud | Strong filtering, hybrid, rich payloads. Open source. |
| Weaviate | Self-host / cloud | Hybrid + modules (rerankers, generative); class-based schema. |
| Milvus / Zilliz | Self-host / cloud | Scale-out to billions of vectors; multiple ANN backends. |
| Pinecone | Managed only | Operationally simplest; serverless tier. |
| Vespa | Self-host / cloud | Best-in-class for hybrid + ranking + low-latency at scale. |
| Elasticsearch / OpenSearch | Self-host / cloud | Already-Elastic shops; native hybrid; good filtering. |
| Turbopuffer / LanceDB | Object-storage native | Cheap large-scale; serverless with cold-start tradeoffs. |
Index design
- Hybrid by default. Combine dense (semantic) + sparse (keyword) — neither alone handles both rare terms and paraphrases well.
- Filterable metadata. Tenancy, ACL, doc type, date — first-class filters on the vector store, not post-filtering.
- Tiered storage. Hot tier in memory, cold tier on disk/S3 — reflect access patterns.
- Sharding strategy. Per-tenant shards prevent noisy neighbors but raise operational cost; pick based on traffic shape.
- Reindex playbook. Document the steps to rebuild the index from raw — you will need it.
10 Stage 7 — Retrieval & rerank
At query time: rewrite the query, retrieve candidates with hybrid search, rerank, then assemble the LLM context.
The retrieval flow
- Query understanding. Rewrite, expand, decompose. Multi-query generates 3–5 paraphrases. HyDE generates a hypothetical answer and embeds that.
- Filter. Apply ACLs, tenant, freshness window — never skip this.
- Hybrid retrieve. Top-K dense + top-K sparse, then merge (Reciprocal Rank Fusion is the standard).
- Rerank. Cross-encoder rerank the top 30–100 candidates down to top 5–10. Cohere Rerank, Voyage Rerank, BGE Reranker are the default options.
- Assemble context. Add breadcrumbs and citations; deduplicate near-identical chunks; budget tokens.
Retrieval techniques worth knowing
HyDE
Have the LLM hallucinate a hypothetical answer; embed that instead of the user query. Closes the gap between question-language and answer-language.
Multi-query
Generate N paraphrases of the query, retrieve for each, merge. Trades cost for recall.
Step-back prompting
Generate a more-general query first, retrieve background context, then answer the specific query.
Small-to-big retrieval
Embed small chunks for precision, return their parent doc/section for context.
Self-RAG / CRAG
The model evaluates retrieved docs and decides whether to use, refine, or skip them — guards against bad retrievals.
Graph RAG
Build a knowledge graph from the corpus, retrieve subgraphs. Strong for queries that hop across entities.
11 Agent data pipelines
Agents need more than a vector store. They consume four distinct data substrates — knowledge, tool data, memory, and traces — and each needs its own pipeline.
1. Knowledge (RAG)
Same as above — corpus → chunks → embeddings → retrieval. The agent calls retrieval as a tool.
2. Tool data
Live data from APIs, databases, services. Pipelines here are about schema discovery and response shaping — making tool outputs LLM-readable.
3. Memory
Per-user / per-session state: facts, preferences, past actions. Pipeline writes from conversations, reads back into context.
4. Traces & evals
Every prompt, tool call, retrieval, and outcome flows into an observability pipeline that feeds dashboards, evals, and fine-tuning datasets.
- One pipeline: docs → embeddings → retrieve
- Stateless per query
- Quality measured by retrieval@k + answer faithfulness
- Four pipelines: knowledge + tools + memory + traces
- Stateful across turns and sessions
- Quality measured by task success, action correctness, cost
12 Agent context engineering
"Context engineering" is the practice of deciding what enters the model's context window on each step. It is the agent equivalent of feature engineering.
Context inputs (per agent step)
| Source | Lifetime | Selection strategy |
|---|---|---|
| System prompt | Static | Versioned; A/B tested. |
| Tool catalog | Per-session | Filter to tools relevant to the task; over-stuffing the catalog wastes tokens and hurts selection accuracy. |
| RAG retrieval | Per-step | Hybrid + rerank; filter by tenancy and freshness. |
| Long-term memory | Per-user | Retrieve top-N relevant memories by embedding similarity + recency. |
| Working memory / scratchpad | Per-task | Carry intermediate results, plans, sub-task state. |
| Tool outputs | Per-call | Truncate / summarize large outputs before re-feeding to the model. |
| Conversation history | Per-session | Sliding window + summary of older turns. |
13 Agent memory pipelines
Memory is data that persists across sessions. Treat it as a write-heavy pipeline with its own ingestion, scoring, and eviction.
Memory types
| Type | What | Storage |
|---|---|---|
| Episodic | Events: "user did X on date Y" | Append-only event log + index |
| Semantic | Facts: "user prefers metric units" | Key-value or document store + vector index |
| Procedural | How-tos: learned tool sequences, playbooks | Versioned prompt / skill registry |
| Working | Current task scratchpad | In-memory or short-TTL store |
The memory write pipeline
- Extract. After each turn, an LLM call extracts candidate memories from the conversation.
- Score. Confidence + utility + novelty. Drop low-value candidates.
- Deduplicate. Embed and check against existing memories — update if similar exists, insert otherwise.
- Tag. Attach user_id, session_id, source, expiry.
- Persist. Write to memory store + vector index.
The memory read pipeline
- Embed the current context.
- Hybrid retrieve from memory store, filter by user_id and validity.
- Rerank by relevance + recency + confidence.
- Inject top-N into the system or user prompt with clear provenance ("Recalled from your prior session: …").
14 Batch vs streaming
The biggest architectural decision after "what's in the corpus" is "how fresh does it need to be." Batch is cheaper and simpler; streaming is fresher and operationally heavier.
- Run hourly / daily / weekly
- Simple to reason about, easy to backfill
- Cheap — no always-on infra
- Freshness lag = batch interval
- Good for: docs, knowledge bases, slowly-changing corpora
- Process events as they arrive
- Sub-minute freshness
- Always-on infra, harder to test
- Backfills require replay infrastructure
- Good for: tickets, logs, transactional data, live chat
Hybrid: micro-batch
The pragmatic middle ground: a streaming consumer batches events into 10–60 second windows, then runs the batch pipeline on each window. You get near-real-time freshness with batch's testability.
15 Freshness & incremental updates
Production pipelines must be incremental — full reindexes are an anti-pattern past a few thousand docs. The hard problem is doing inserts, updates, and deletes safely.
Update patterns
| Operation | Trigger | Pattern |
|---|---|---|
| Insert | New document arrives | Run full pipeline on the new doc; upsert chunks by deterministic ID. |
| Update | Source doc changes | Re-parse, re-chunk, diff against existing chunks; upsert changed, delete removed. |
| Delete | Source doc removed or expires | Hard-delete or tombstone all chunks with that source_id. |
| Re-embed | Embedder version change | Background backfill; route queries to old version until new version reaches parity. |
| Re-chunk | Chunker version change | Reprocess from raw bytes (kept in object storage) — chunker should be deterministic given config. |
Soft delete vs hard delete
Tombstone (soft delete) is usually safer in production — set deleted_at, exclude in retrieval, GC asynchronously. Hard delete is required for GDPR / right-to-be-forgotten requests; build the GC path to actually remove vectors and audit-log the action.
16 Data quality & lineage
Quality controls and lineage tracking turn a research-grade pipeline into a production-grade one.
Quality controls to add
- Deduplication. Exact (hash) + near-dup (MinHash, SimHash) at chunk level. Index size shrinks 20–50% on real corpora.
- PII scanning. Detect and redact (or block) sensitive data before embedding. Tools: Presidio, AWS Macie, Nightfall.
- Toxic/harmful content filters at ingest — don't let bad data into retrieval.
- Language detection. Route per language or filter to supported languages.
- Freshness SLA. Track
p50 / p95 ingestion-to-queryablelatency; alert on regression. - Chunk health metrics. Distribution of chunk sizes, % empty, % code, duplicate rate.
Lineage
For every chunk in the index, you should be able to answer:
- What raw document did this come from?
- What parser / chunker / embedder version produced it?
- When was it ingested? When last updated?
- Who has permission to retrieve it?
Lineage metadata enables incident response, compliance, and intelligent reprocessing. Tools: OpenLineage, DataHub, Marquez.
17 Observability & eval
A pipeline you can't measure is a pipeline you can't improve. Instrument every stage and every retrieval.
Pipeline metrics
Ingestion
- Documents processed / sec
- Parse failure rate per format
- Chunks emitted per doc (distribution)
- End-to-end latency (source → indexed)
Retrieval
- Latency p50/p95/p99
- Recall@k against a labeled set
- nDCG@k after rerank
- % queries with zero results
- Cache hit rate
Generation
- Faithfulness / groundedness score
- Citation accuracy
- Refusal / hallucination rate
- Token cost per answer
Agent
- Task success rate
- Tool-call accuracy
- Steps to completion (distribution)
- Cost per task
- Memory hit rate / utility
Eval frameworks
What to evaluate
- Retrieval quality — labeled query/passage set; recall@k, MRR, nDCG.
- Answer faithfulness — does the answer follow from the retrieved context?
- Answer relevance — does it actually answer the question?
- Context precision — were the retrieved chunks actually used?
- Agent task success — did the multi-step workflow reach the right outcome?
18 Orchestration tools
You can hand-roll pipelines, but past a certain scale you'll want orchestration. The choice shapes how easy retries, backfills, and observability are.
| Tool | Style | Best for |
|---|---|---|
| Airflow | DAG, batch | Scheduled batch ETL; mature ecosystem; verbose for ML. |
| Dagster | Software-defined assets | Modern alternative to Airflow; first-class data lineage. |
| Prefect | Pythonic flows | Lightweight; great for prototypes that grow into production. |
| Temporal | Durable workflows | Long-running, retry-heavy pipelines (e.g. crawl → parse → embed at scale). |
| Kafka + Flink / Spark Streaming | Streaming | High-throughput event-driven ingestion. |
| Ray Data | Distributed Python | GPU-heavy stages (embedding, parsing with vision models). |
| Modal / Replicate / RunPod | Serverless GPU | Burst embedding / batch inference without managing GPUs. |
| LlamaIndex / LangChain | Framework, in-process | Glue layer for parsing/chunking/embedding/retrieval; not a substitute for an orchestrator at scale. |
| Unstructured.io | Managed pipeline | Specialized in the parsing + chunking stages with broad format support. |
19 Reference tech stack
A representative production stack as of 2026. Substitute components freely — every layer has multiple credible options.
| Layer | Choice | Why |
|---|---|---|
| Source of truth | S3 + Postgres | Raw bytes addressable forever; metadata transactional. |
| Ingestion | S3 events → SQS → workers | Decoupled, retry-safe, scales linearly. |
| Orchestration | Dagster (batch) + Temporal (long workflows) | Asset lineage + durable execution. |
| Parsing | Unstructured + Marker for PDFs | Layout-aware; handles tables and figures. |
| Chunking | Recursive + structural by doc type | One size doesn't fit all formats. |
| Embedding | Voyage-3 (general) + Voyage-Code (code) | Specialized embeddings beat one-size-fits-all. |
| Vector store | Qdrant or pgvector (small) / Vespa (large) | Hybrid + filtering + scale. |
| Reranker | Cohere Rerank or BGE Reranker | +10–20 points on nDCG, worth the latency. |
| Memory store | Postgres + pgvector + Redis | Transactional facts, vector search, fast working memory. |
| Observability | Langfuse / Phoenix + OpenTelemetry | Trace every prompt / tool / retrieval. |
| Eval | Ragas + Braintrust | Offline metrics + production CI. |
| Agent framework | LangGraph / Anthropic Agent SDK / OpenAI Agents | Pick one and own the abstractions. |
20 End-to-end example
A minimal but production-shaped pipeline for a customer-support RAG system. Adapt the details — the structure generalizes.
# pipeline.py — simplified end-to-end from dataclasses import dataclass import hashlib @dataclass class Chunk: id: str text: str source_id: str section_path: str meta: dict vector: list | None = None def stable_id(source_id: str, section: str, idx: int) -> str: return hashlib.sha256(f"{source_id}:{section}:{idx}".encode()).hexdigest()[:16] def run(source_uri: str): # 1. Ingest — fetch raw bytes, persist immutable copy raw = ingest(source_uri) # bytes raw_uri = persist_raw(raw, source_uri) # s3://raw/... # 2. Parse — extract structured text doc = parse(raw, source_uri) # Document(sections=[...]) # 3. Chunk — structural + recursive chunks = [] for section in doc.sections: for i, piece in enumerate(recursive_split(section.text, target=600, overlap=80)): chunks.append(Chunk( id=stable_id(doc.source_id, section.path, i), text=f"{doc.title} > {section.path}\n\n{piece}", source_id=doc.source_id, section_path=section.path, meta={ "source_uri": source_uri, "updated_at": doc.updated_at, "acl": doc.acl, "doc_type": doc.doc_type, "parser_version": "v3", "chunker_version": "v2", }, )) # 4. Embed — batch for batch in batched(chunks, 100): embed_batch(batch) # sets .vector + meta["embedder"] # 5. Index — upsert keyed by stable id (idempotent) vector_store.upsert(chunks) keyword_index.upsert(chunks) # 6. Tombstone removed chunks existing_ids = vector_store.ids_for(source_id=doc.source_id) new_ids = {c.id for c in chunks} for dead in existing_ids - new_ids: vector_store.tombstone(dead) # 7. Emit lineage event emit_lineage(doc.source_id, len(chunks), embedder="voyage-3")
Note the design: deterministic IDs make every step idempotent; raw bytes are preserved so you can re-derive without re-fetching; tombstones handle deletes; lineage events feed observability.
21 Anti-patterns
Mistakes that look reasonable but cause grief in production. Recognize and avoid.
| Anti-pattern | Why it hurts | Do this instead |
|---|---|---|
| Random UUIDs as chunk IDs | Re-running ingest creates duplicates instead of upserts. | Deterministic IDs from (source_id, section, position). |
| Embedding without versioning | Cannot tell which embedder produced a vector; cannot mix or migrate cleanly. | Tag every vector with embedder_id + version. |
| Throwing away raw bytes | Cannot reprocess after a parser bug fix without re-fetching everything. | Always persist raw to object storage; everything downstream is derivable. |
| Token-only chunking | Splits mid-sentence, mid-table, mid-function — every chunk is degraded. | Recursive + structural; respect natural boundaries. |
| Skipping the breadcrumb | Retrieved chunk has no idea what document or section it came from. | Prepend "Doc > Section\n\n…" to every chunk's stored text. |
| Dense-only retrieval | Misses queries containing rare terms, IDs, codes, or acronyms. | Hybrid (dense + sparse) with RRF. |
| Filtering after retrieval | You retrieve chunks the user can't see, then drop them — recall collapses. | Push filters (ACL, tenancy, freshness) into the vector store query. |
| Full reindex on every change | Doesn't scale past a few thousand docs; downtime; cost. | Idempotent incremental upserts. |
| No eval set | Cannot tell whether a change improved or regressed retrieval. | Maintain a labeled query set; CI it. |
| Dumping everything into one corpus | Multi-tenant data leaks; retrieval relevance drops. | Separate indexes or strong per-tenant filters. |
| Memory as a black box | Cannot debug why the agent "remembered" something wrong. | Audit-log every memory write; surface provenance in the UI. |
22 Production checklist
Before calling a pipeline production-ready, you should be able to check every item below.
Correctness
- Ingest is idempotent — replay produces no duplicates.
- Updates and deletes propagate to all indexes.
- Every chunk has provenance metadata (source, version, timestamps).
- ACLs and tenant filters are applied at query time, not after.
- Embedder and chunker versions are tracked per chunk.
Quality
- Eval set with labeled queries exists; nDCG / recall reported on every change.
- Faithfulness / groundedness measured on a regression set.
- Dedup pass runs at ingest (exact + near-dup).
- PII redaction or block list active on all sources.
Operability
- Raw bytes preserved → reprocessable from scratch.
- Tracing on every prompt, tool call, retrieval (OpenTelemetry).
- Dashboards for ingestion lag, retrieval latency, eval scores.
- Reindex playbook documented and rehearsed.
- Cost dashboards per stage (embedding $, vector store $, LLM $).
Safety
- Right-to-be-forgotten (hard delete) path implemented and tested.
- Memory writes are audit-logged.
- Per-tenant isolation verified by automated test.
- Rate limits and cost caps on agent loops.