AI Engineering — Study Topics
A comprehensive reference covering LLM fundamentals through advanced autonomous systems, enterprise integration, and beyond. Click any topic to expand its explanation.
LLM Fundamentals
Encoder-only models (BERT, RoBERTa, DeBERTa) use bidirectional self-attention — every token can attend to every other token in both directions simultaneously. This gives them deep contextual understanding of input text. They excel at understanding tasks: classification, named entity recognition, sentiment analysis, and generating embeddings. However, they cannot generate text autoregressively because every token sees the full context, making them unsuitable for open-ended generation.
Decoder-only models (GPT-4, Claude, Llama, Mistral, Gemini) use causal (masked) self-attention — each token can only attend to previous tokens, not future ones. This constraint enables autoregressive generation: the model predicts the next token given all previous tokens, then that token becomes part of the context for predicting the next one. The same architecture handles both "understanding" (by processing the prompt) and "generation" (by producing new tokens). This unified approach, combined with scale, is why decoder-only models dominate modern LLMs.
Encoder-decoder models (T5, BART, original Transformer, Flan-T5) have two separate stacks: an encoder that processes the input with bidirectional attention (like BERT) and a decoder that generates output autoregressively (like GPT) while also attending to the encoder's output via cross-attention. This architecture is ideal for sequence-to-sequence tasks: translation, summarization, and question answering where input and output are clearly separated. The encoder fully "understands" the input before the decoder generates anything.
How they actually work differently: In a decoder-only model, input and output live in the same sequence:
[prompt tokens | generated tokens], with causal masking ensuring generation only sees what came before. For translation, you'd prompt "Translate to French: Hello" and the model continues with "Bonjour." In an encoder-decoder model, the encoder processes "Hello" completely (bidirectionally), producing rich contextual representations, then the decoder generates "Bonjour" while cross-attending to those encoder outputs at every step. The cross-attention layer is the key innovation — it lets the decoder "look back" at the fully-processed input at any generation step.
Why decoder-only won: Despite encoder-decoder being theoretically better suited for seq2seq tasks, decoder-only architectures dominate because: (1) Simplicity and scale — one architecture for all tasks, easier to scale to hundreds of billions of parameters. (2) In-context learning — decoder-only models learned to handle translation, summarization, and Q&A through prompting alone, without needing separate encoder-decoder training. (3) Training efficiency — autoregressive next-token prediction is a simpler, more scalable objective than the encoder-decoder's masked language modeling + denoising objectives. (4) Unified inference — no separate encoding pass needed; the KV-cache optimization makes generation fast.
When to still use each: Use encoder-only (BERT) for embeddings, classification, and retrieval where you need rich bidirectional representations without generation. Use encoder-decoder (T5, BART) for specialized translation and summarization tasks where input-output separation matters and training data is structured as seq2seq pairs. Use decoder-only (GPT, Claude) for everything else — general-purpose chat, reasoning, coding, agents, and any task that benefits from few-shot prompting and instruction following. In practice, 95% of modern LLM applications use decoder-only models.
The core building block is multi-head attention, which projects the input into queries, keys, and values across multiple "heads," letting the model learn different types of relationships simultaneously. Since attention has no inherent notion of order, positional encodings (sinusoidal or learned) are added to the input embeddings so the model understands token positions.
The original Transformer uses an encoder-decoder layout (useful for translation), but most LLMs today are decoder-only (GPT-style) or encoder-only (BERT-style). Decoder-only models generate text autoregressively, predicting the next token given all previous tokens using causal (masked) attention.
Embeddings are learned dense vector representations of tokens. Each token ID maps to a high-dimensional vector (e.g., 768 or 4096 dimensions) in a continuous space where semantically similar tokens end up close together. The embedding layer is trained jointly with the rest of the model.
The key distinction: tokenization is a deterministic preprocessing step (text → integer IDs), while embeddings are learned continuous representations (integer IDs → dense vectors). Vocabulary size directly impacts model size and influences how the model handles rare or multilingual text.
Chain-of-thought (CoT) prompting encourages the model to "think step by step," dramatically improving performance on reasoning, math, and logic tasks. System prompts set the persona, constraints, and behavioral guidelines.
Advanced techniques include self-consistency (sampling multiple reasoning paths and taking the majority answer), tree-of-thought (exploring branching reasoning), and ReAct-style prompting (interleaving reasoning and actions).
Top-k sampling restricts selection to the k most probable tokens, preventing the model from sampling extremely unlikely tokens. Top-p (nucleus) sampling dynamically selects the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9), adapting the candidate pool based on the distribution's shape. In practice, top-p is generally preferred because it adapts to different probability distributions.
Interaction effects matter: combining temperature with top-p gives finer control. For deterministic tasks (classification, data extraction), use temperature=0 with fixed top-p. For creative tasks, use temperature=0.7–1.0. In production, monitor output quality: if too repetitive, increase temperature; if too incoherent, decrease it.
Common pitfall: temperature affects all tasks, not just generation. Low temperature improves factual accuracy and consistency, while high temperature introduces creative variation but also hallucination. Your application should tune these based on task requirements, not user preference alone.
In production, use deterministic outputs for data extraction, classification, and structured tasks where consistency matters. Use stochastic outputs for content generation, brainstorming, and creative work where variation is desired. Caching layers can enforce consistency by returning stored responses for repeated identical prompts, bypassing the LLM entirely for common queries.
Trade-offs: deterministic = predictable but less creative. Stochastic = varied but unpredictable. Sometimes you want hybrid: deterministic extraction (facts), stochastic explanation (phrasing). Cache identical requests to amortize cost and improve latency while maintaining freshness for new queries.
Testing: reproducibility is critical for debugging. Log the random seed and sampling parameters; if a response is wrong, you should be able to reproduce it. For production systems serving many users, consider allowing per-user randomness settings (some users prefer consistent assistants, others want variation).
LoRA (Low-Rank Adaptation) freezes the original model weights and injects small trainable rank-decomposition matrices into each transformer layer. This reduces trainable parameters by 90-99%, making fine-tuning feasible on a single GPU. QLoRA combines LoRA with 4-bit quantization, enabling fine-tuning of 65B+ models on consumer hardware. Adapters typically add only 1-10MB on top of the base model.
When to fine-tune vs prompt engineer: fine-tuning when you need consistent style, domain-specific language, structured output formats, or when few-shot prompting isn't enough. Prompt engineering first — it's cheaper, faster, and doesn't require training infrastructure. Fine-tuning is a last resort for specific, well-defined tasks where prompting plateaus.
Production considerations: version your fine-tuned models, track training data provenance, evaluate on held-out test sets before deployment. Use platforms like Hugging Face, Together AI, or OpenAI's fine-tuning API. Monitor for distribution drift — fine-tuned models can degrade as input patterns change. Always maintain a baseline comparison with the un-tuned model.
However, longer contexts don't mean better attention. The "lost in the middle" phenomenon shows models attend best to information at the beginning and end of the context, often missing details in the middle. Needle-in-a-haystack tests evaluate retrieval accuracy at different context positions. Models vary significantly in their ability to use long contexts effectively.
Technical implications: cost scales linearly with context length (more tokens = more money), latency increases (especially time-to-first-token), and memory requirements grow quadratically with naive attention. Techniques like Flash Attention, Ring Attention, and sliding window attention mitigate computational costs. KV-cache size grows linearly with context, requiring significant GPU memory.
Practical guidance: don't use maximum context just because it's available. Place critical information at the start or end of the prompt. For very long documents, consider chunked RAG even with long-context models — retrieval can outperform stuffing. Use context length as a fallback, not a primary strategy. Monitor token usage and costs carefully.
Implementation approaches: JSON mode (OpenAI, Anthropic) constrains the model to output valid JSON. Schema-constrained generation (like OpenAI's Structured Outputs) uses a JSON Schema to guarantee exact field names, types, and required properties. Grammar-based sampling (llama.cpp, Outlines) constrains token generation at the sampling level, ensuring structural validity token-by-token.
Best practices: provide the schema in the system prompt with clear field descriptions. Use enums for categorical fields. Include examples of expected output. Validate responses against the schema programmatically (even with JSON mode, edge cases exist). For complex outputs, break into smaller structured calls rather than one massive schema.
Common pitfalls: models may produce valid JSON that's semantically wrong (right format, wrong content). Schema-constrained generation adds latency. Very complex nested schemas reduce output quality. Always have fallback parsing logic. In production, log schema validation failures as signals for prompt improvement.
Open-source models (Llama 3, Mistral, Phi-3, Qwen) offer full control, no API dependency, and can be self-hosted. They shine when data privacy is paramount, latency must be minimized, or cost at scale makes API pricing prohibitive. Trade-off: you manage infrastructure, fine-tuning, and updates.
Selection criteria: Task complexity (complex reasoning needs frontier; classification/extraction can use smaller), latency requirements (smaller models are faster), data sensitivity (open-source for on-premise), cost at scale (calculate monthly spend at expected volume), context length needs, and multimodal requirements (vision, audio).
Practical approach: benchmark on YOUR data, not public leaderboards. Create an eval dataset of 50-100 representative examples, test 3-4 models, compare quality and cost. Use model routing — send simple queries to cheap models, complex queries to expensive ones. Re-evaluate quarterly as models improve rapidly.
RAG (Retrieval-Augmented Generation)
This pattern solves the knowledge-cutoff problem and reduces hallucinations by leveraging factual data. The retrieval phase uses dense vector similarity or hybrid search (combining BM25 and semantic matching) to find relevant chunks. Advanced retrieval includes query rewriting, hypothetical document embeddings (HyDE), and iterative refinement for complex information needs.
Common failure modes include retrieving irrelevant chunks, exceeding the context window with too many results, and the model ignoring or misusing retrieved context. Advanced RAG addresses these with re-ranking using cross-encoders, query decomposition for multi-hop questions, and explicit prompting that emphasizes source usage and citation.
Production RAG systems require monitoring retrieval quality (precision, recall, MRR), tracking latency costs of embedding and vector search, and regular updates as source documents change. Frameworks like LlamaIndex abstract much of this complexity but require understanding trade-offs between speed and quality.
Sliding window creates overlapping chunks to ensure no information falls between boundaries, critical for maintaining context. The optimal chunk size depends on the embedding model's context length and typical query patterns — typically 256–1024 tokens. Larger chunks preserve more context but dilute relevance; smaller chunks improve precision but may fragment important information.
Advanced strategies include parent-child chunking (store small chunks for retrieval, larger parents for context) and metadata-aware chunking that preserves document structure (headings, tables). Domain matters significantly: technical docs benefit from semantic chunking, while structured data (tables, lists) may need specialized handling.
Chunk size is not static—experimentally validate against your eval dataset. Common failure: choosing chunk sizes that optimize for retrieval metrics but lose context when injected into prompts, or that don't align with embedding model training corpus characteristics.
Key technical considerations: embedding dimension (typically 384–3072), max input length (varies by model), and inference latency. The choice affects both retrieval quality and storage costs. Changing the embedding model requires re-indexing all documents, a potentially expensive operation. Evaluate on your specific domain using benchmarks like MTEB or internal eval datasets with relevance judgments.
Different models have different strengths: general-purpose models work well for diverse queries, while domain-specific embeddings (legal, medical, scientific) often outperform general models. Multilingual support matters if handling non-English content. Some models are optimized for short queries and documents, others for longer context.
Production considerations: monitor embedding latency (especially if embedding user queries at query time), cache embeddings when possible, and periodically re-evaluate model choice as new models emerge. Budget for storage: a 1M document corpus with 1536-dim embeddings requires ~6 GB of storage.
Technical architecture: vectors are indexed using algorithms like HNSW (Hierarchical Navigable Small World, popular for recall) or IVF (Inverted File, better for very large scale). These trade-off exact search accuracy for speed. Metadata filtering allows associating vectors with structured fields (doc ID, source, date) for exact filtering during retrieval.
Key operational features: real-time upserts (insert/update), scalability to billions of vectors, multi-tenancy (namespaces), backup/disaster recovery, and query performance under load. Most provide Python SDKs and REST APIs. Hybrid search combines vector similarity with keyword search (BM25), critical for many RAG applications where exact term matching matters.
Production trade-offs: managed services (Pinecone) are operationally simple but lock you in and cost more at scale; self-hosted (Weaviate, Qdrant) require infrastructure management but offer flexibility and cost savings. Evaluate on your access patterns (query QPS, update frequency) and scale requirements (documents, vectors, namespace count).
Results are combined using Reciprocal Rank Fusion (RRF), which normalizes rankings from both signals and merges them without requiring parameter tuning. Alternatively, you can use weighted averaging: hybrid_score = alpha * vector_score + (1 - alpha) * bm25_score. The key challenge is tuning the balance parameter (alpha) for your domain and query distribution.
Most modern vector databases (Weaviate, Qdrant, Pinecone) support hybrid search natively. Implementation requires having both a vector index (embedding-based) and an inverted index (keyword-based). Some systems (Elasticsearch) have vector support built-in; others use dedicated vector DBs with external keyword indexing.
When to use hybrid: financial documents (exact terms crucial), customer support (both query intent and specific issue names matter), code search (exact function names + semantic logic). Purely semantic works better for conversational FAQs and creative writing. Evaluate empirically on your eval set.
Popular re-ranker models: Cohere Rerank (commercial API), cross-encoder/ms-marco-MiniLMv2-L12-H384 (open-source), and mxbai-rerank-large-v1. The two-stage approach balances speed and accuracy: fast dense retrieval gets candidates, then expensive cross-encoder reranking refines them. Typical workflow: retrieve 50 vectors in 10ms, rerank top 50 in 100–500ms depending on model.
Re-ranking often produces the single biggest quality improvement for minimal effort. A study showed that adding re-ranking improved nDCG by 10-30% in many benchmarks. Integration is straightforward: call your vector DB for top-50, then call re-ranker API on those chunks, sort by re-ranker score, return top-5.
Considerations: cost (cross-encoders require API calls per query), latency (slower than retrieval alone), and whether to rerank every call or only for ambiguous cases. For low-latency requirements, use a smaller cross-encoder. Some RAG systems cache re-ranking results for repeated queries.
Other proven techniques: sub-question decomposition breaks multi-hop questions into simpler sub-queries retrieved independently, then synthesized. Step-back prompting asks "what general principles apply?" before retrieving, generating better retrieval queries. Multi-query generation creates 3-5 query variants, retrieves for all, and deduplicates results.
These query enhancement techniques trade latency (N queries instead of 1) for retrieval quality. HyDE adds one LLM call at query time, typically 0.5-1 second for document generation. Multi-query adds proportional cost. Sub-question decomposition can help on questions like "Compare pricing between X and Y," where you retrieve for each separately.
When to apply: use simple query rewriting for obvious abbreviations (always cheap). Use HyDE/decomposition when retrieval is your bottleneck and you have latency budget. Measure impact on eval metrics: does the overhead justify the retrieval improvement? Monitor query patterns to apply selectively (e.g., only for longer, complex queries).
Knowledge graphs represent information as triples (subject, predicate, object): "Alice → reports_to → Bob." Graph databases like Neo4j or Amazon Neptune store and query these structures efficiently. GraphRAG typically combines vector retrieval (for relevant subgraphs) with graph traversal (for relationship-following).
Implementation: extract entities and relationships from documents using NER and relation extraction (LLM-based or rule-based), build the graph, then at query time retrieve relevant nodes and their neighborhoods. Microsoft's GraphRAG approach generates community summaries at different hierarchy levels, enabling both local and global questions.
When to use: multi-hop reasoning ("What projects is the team lead of the person who wrote document X working on?"), structured organizational knowledge, regulatory compliance (tracing data lineage), and scientific literature (connecting papers, authors, findings). When not to use: simple FAQ-style questions, when data doesn't have clear entity relationships. GraphRAG adds significant complexity to your pipeline.
Approaches: OCR + text extraction converts visual content to text (losing layout information). Vision-language models (GPT-4V, Claude vision) can directly interpret images and charts. ColPali/ColQwen embed document pages as images directly, bypassing text extraction entirely. CLIP-based retrieval matches text queries to images using shared embedding spaces.
For tables and structured data: extract tables using tools like Unstructured.io, Camelot, or LLM-based extraction. Store table data separately with metadata linking back to source documents. Consider converting tables to markdown or structured JSON for better LLM comprehension.
Production challenges: multimodal embeddings are larger and more expensive, retrieval accuracy varies by content type, and vision model APIs are costly. Start with text-first RAG, add multimodal only for document types where text extraction loses critical information (charts, diagrams, handwritten notes). Always maintain both text and visual representations for fallback.
AI Orchestration / Agents
Reflexion adds self-evaluation: after completing a task, the agent evaluates its own output, identifies mistakes, and tries again. LATS (Language Agent Tree Search) explores multiple reasoning branches like tree search, keeping the best path. Tool-augmented agents dynamically select from available APIs, deciding what external tools to invoke based on the task.
Choice depends on task complexity and constraints. ReAct works for most tasks, exploring interactively. Plan-and-Execute is faster for linear tasks. Reflexion improves quality but adds cost. Hierarchical agents (supervisor + workers) scale to complex problems but need careful orchestration. Multi-hop reasoning tasks benefit from decomposition; single-step tasks don't need agents.
Production considerations: agents are latency-heavy (multiple LLM calls, tool invocations), token-expensive, and can hallucinate about available tools. Implement guardrails (max steps, timeouts, tool validation), fallback strategies for tool failures, and monitoring for agent loops or failures. Cache intermediate results. Start simple (ReAct), add complexity only if needed.
Implementation: define function schemas in JSON Schema format, configure the LLM with function definitions, receive structured tool calls in the response, validate and execute, and return results to the LLM for further reasoning. Supports parallel tool calls (multiple tools in one response), error propagation (returning errors to the model), and chaining (tool output feeds next tool).
Best practices: write clear function descriptions and parameter documentation so the model understands what each tool does. Use enums for constrained parameters. Validate all parameters before execution. Handle errors gracefully; if a tool call fails, return the error to the LLM so it can retry or choose another approach. Monitor tool call accuracy and failure rates.
Common issues: models hallucinating parameters that don't exist, calling tools with invalid arguments, or misunderstanding function intent. Mitigate with explicit prompting ("Only call these tools: ..."), few-shot examples of correct function calls, and schema validation. Most major LLM providers (OpenAI, Claude, others) support native function calling, making integration straightforward.
Technical challenges: coordination overhead (agents need to communicate, increasing latency), error propagation (if one agent fails, others may fail downstream), state management (tracking shared context and agreements), and consistency (agents must not contradict each other). Frameworks like LangGraph, CrewAI, and AutoGen provide primitives for defining agent roles, tools, and communication patterns.
Implementation considerations: define clear interfaces between agents (what each agent is responsible for), implement message passing or event-driven communication, handle agent failure gracefully with timeouts and fallbacks. Use shared context/memory for efficiency. Design agents with clear specialization (one for research, one for analysis, one for writing) rather than generic agents.
When multi-agent helps: research tasks (parallel exploration), complex planning (different agents specialize in different planning aspects), content creation (researcher + writer + editor), analysis (multiple perspectives). When it doesn't: simple tasks, latency-sensitive applications. Start with supervisor-worker if new to multi-agent; debate/consensus is more complex but can improve quality.
Implementing memory requires deciding what to remember (which interactions/facts are important?), when to retrieve (how many memories per query?), and managing staleness and contradictions (old memories may be outdated). Common approaches: summarization (reduce long conversation histories to key points), chunking and embedding (store interactions as vectors for retrieval), and scoring (keep only high-confidence memories).
Practical patterns: use conversation history (short-term) for immediate context, retrieve 2-5 relevant past interactions (long-term) for continuity, periodically summarize very long conversations. For agents, store execution traces (tool calls, outcomes) as episodic memory. Monitor memory size (exceed context limits and latency increases) and retrieval accuracy (wrong memories degrade performance).
Challenges: determining what's memorable, avoiding stale or contradictory memories, balancing memory retrieval latency against richness of context. In production, use memory sparingly; often a few well-chosen recent interactions matter more than extensive history. Implement memory cleanup (remove old, irrelevant memories) and periodic retraining of retrieval embeddings.
CrewAI focuses on multi-agent collaboration with role-based agents that share context. LlamaIndex specializes in RAG pipelines, abstracting embedding, chunking, and retrieval. Haystack is lightweight, pipeline-focused. Each has different strengths: LangChain for flexibility and ecosystem, LangGraph for complex workflows, LlamaIndex for RAG-first applications.
Practical trade-offs: frameworks enable rapid prototyping but add abstraction layers. LangChain chains can hide complexity; LangGraph's explicit graph definition is more verbose but clearer. Using a framework speeds initial development but can make optimization harder. Consider abstraction overhead when latency is critical.
Recommendation: LangChain good for simple chains and prototyping, LangGraph for complex agent workflows, LlamaIndex for RAG-heavy systems. Avoid over-engineering with frameworks; sometimes a simple manual orchestration is clearer and faster. Evaluate your needs: if you need rapid development and don't mind abstraction, use a framework; if you need fine-grained control and latency optimization, consider lighter tools.
MCP follows a client-server architecture: the MCP host (your AI application) connects to MCP servers that expose tools, resources, and prompts through a standardized interface. This decouples tool implementation from LLM orchestration. A single MCP server for "database access" works with any MCP-compatible client, eliminating redundant integration work.
Key capabilities: Tools (executable functions the LLM can call), Resources (data the LLM can read), and Prompts (reusable prompt templates). MCP servers can be local (file system, databases) or remote (APIs, cloud services). The protocol handles authentication, capability discovery, and structured communication.
Why it matters: the AI ecosystem is fragmenting into incompatible tool integrations. MCP standardizes this, similar to how HTTP standardized web communication. For AI engineers, adopting MCP means your tool integrations work across Claude, ChatGPT, and any other MCP-compatible system. Build once, use everywhere.
Input guardrails: validate and sanitize user inputs before they reach the LLM. Detect prompt injection attempts (instructions embedded in user data), block prohibited content, and enforce input length limits. Tools like NeMo Guardrails and Guardrails AI provide programmable safety layers.
Output guardrails: validate LLM outputs before executing actions. Check tool call parameters against allowlists, require human approval for high-risk actions (deleting data, sending emails, financial transactions), implement rate limits on tool calls, and verify outputs against business rules.
Prompt injection defense is critical: attackers can embed instructions in documents, emails, or web pages that the agent processes. Defense-in-depth approaches include input/output scanning, privilege separation (agents have minimum necessary permissions), sandboxing tool execution, and maintaining immutable system instructions that can't be overridden by user content. Monitor and log all agent actions for audit trails.
Evaluation & Monitoring
Effective evaluation requires: diverse eval datasets (100-500 examples covering different query types), clear rubrics (what makes a good response?), and integration into CI/CD (evaluate on every major change). LLM-as-judge is faster than human eval but occasionally biased; pair with random human spot-checks. Design eval data to catch regressions: edge cases, adversarial examples, previously-buggy queries.
For RAG systems, measure retrieval quality (precision, recall, MRR) separately from generation quality. Decompose evaluation by domain/query type (performance may vary). A/B testing in production (even with small traffic) often reveals issues eval data misses. Track metrics over time to catch performance drift.
Common pitfalls: eval dataset too small or unrepresentative (metrics look good but production fails), evaluating at task level only (hard to diagnose failures), not automating eval (manual testing is slow, skipped in hurry). Best practice: start with LLM-as-judge for rapid iteration, then validate against human judges. Make evaluation continuous, automated, and part of your deployment pipeline.
Mitigation strategies work at multiple levels: prompt-level (instruct the model not to hallucinate, use chain-of-thought), retrieval-level (ensure relevant context is retrieved and injected), temperature (lower temperature = less hallucination, but less creativity), and post-processing (verify claims against sources, require citations).
Technical approaches: use retrieval (RAG) to ground responses in facts, implement citation requirements (every claim must cite a source), use smaller models (tend to hallucinate less), and run guardrails that block responses with unverifiable claims. NLI-based detection works well but requires another model call, adding latency and cost.
In production: accept that zero hallucination is unrealistic for LLMs. Focus on detection and mitigation: display confidence levels, include sources, require user confirmation for important claims. Monitor hallucination rate in production logs (user feedback, corrections). Different use cases have different hallucination tolerance: customer service can be forgiving, legal documents cannot.
Implementation: define source documents or passages, require the model to cite sources explicitly (e.g., "[Source: Document A, Section 2]"), or post-process to match claims to sources using semantic similarity or NLI. Production systems often use both prompt-based grounding (instruct model to cite) and automated verification.
Grounding is essential for regulated industries (legal, healthcare, finance) and high-stakes applications. It builds user trust and provides accountability. Costs: generation may be slower if explicit grounding is required; post-hoc checking adds latency; some claims may be hard to ground (synthesized insights from multiple sources).
Best practices: make grounding part of your data generation process (training on grounded examples helps models learn), implement verification tests (check that cited sources actually contain the information), handle edge cases (synthesized claims, common knowledge), and monitor grounding completeness (what % of claims are grounded?). For RAG, grounding is often automatic if you control the retrieval.
Frameworks: Guardrails AI defines validation logic using a DSL, NeMo Guardrails (from NVIDIA) uses a configuration language for topic rails and output schema, LLM Guard (open-source) provides modular guards for input/output scanning. Most implement guardrails as middleware (check before/after LLM calls).
Implementation includes classifiers for toxic content, regex patterns for PII (credit cards, SSNs), prompt injection detection (unusual tokens, suspicious patterns), and structured validation (JSON schema, enum values). These checks add latency (50-200ms typically) but are essential for production, especially customer-facing apps.
Common issues: false positives (blocking legitimate requests), false negatives (missing actual threats), and maintaining guardrails as attacks evolve. Use multiple layers: semantic classifiers for intent, keyword patterns for known attacks, and validation for structure. Log guardrail rejections to monitor effectiveness and false positive rates. Update guardrails as new threat patterns emerge.
Implementation: instrument your application to emit traces (spans for each component: retrieval, LLM call, post-processing), log important events (which documents retrieved, which tool called, latency per step). Correlate traces with inputs/outputs and metadata. Set up dashboards for latency, cost, error rate. Create alerts for regressions (latency spike, quality drop).
Analysis: use traces to identify bottlenecks (where's latency spent?), cost drivers (expensive models vs cheap?), and failure patterns (which queries fail?). Trace errors to root cause (retrieval missed, model misunderstood, tool failed?). Compare performance before/after changes. Monitor user interactions: satisfaction, complaints, corrections.
Production best practices: log sufficient detail to debug issues without excessive cost (sample verbose logging). Protect PII in logs. Set up on-call alerts for critical metrics. Regularly review logs to understand failure modes and improve prompts/RAG/tool definitions. Use observability to guide prioritization: focus on highest-impact issues first.
Implementation: define a scoring rubric (1-5 scale with clear criteria for each level), craft a judge prompt that includes the original query, the response to evaluate, and optionally a reference answer. The judge outputs a score and reasoning. For higher reliability, use pairwise comparison (which of two responses is better?) rather than absolute scoring.
Challenges: position bias (judges favor the first response in comparisons), verbosity bias (longer responses scored higher regardless of quality), self-enhancement bias (models rate their own outputs higher). Mitigate with randomized ordering, calibration examples, and multiple judge passes. Agreement between LLM judges and human evaluators is typically 70-85%.
When to use: rapid evaluation during development, CI/CD quality gates, monitoring production quality trends. When NOT to use as sole evaluator: safety-critical applications, novel domains where the judge has limited knowledge, or when absolute accuracy is required. Always maintain a human evaluation benchmark and periodically calibrate your LLM judge against it.
System Design (AI Systems)
Design patterns: separate read-heavy (retrieval, inference) from write-heavy (index updates) workloads. Use async processing for non-real-time tasks (batch embeddings, periodic reindexing). Implement circuit breakers and fallbacks for external APIs (LLM providers, retrieval systems). Cache aggressively: identical queries should hit cache, not re-run retrieval/inference.
Key trade-offs: latency vs cost (faster inference costs more), freshness vs performance (caching improves speed but stales results), centralized vs distributed (simple architecture is easier but less scalable). Design based on your bottleneck: if retrieval is slow, optimize vector DB; if inference is bottleneck, use faster models or cached responses.
Monitoring: track per-component latency, cache hit rates, error rates, and cost. Use load testing to understand where breakpoints are. Plan for growth: architecture supporting 1K QPS may not support 10K without redesign. Implement feature flags to A/B test new approaches (e.g., different retrieval models, caching strategies) safely at scale.
Considerations: authentication and authorization (who can access what?), resource quotas (prevent one tenant from consuming all resources), cost tracking per tenant (bill accordingly), and security (prevent cross-tenant data leakage). Performance can be complex: one tenant's heavy load shouldn't impact others. Use circuit breakers and rate limiting per tenant.
Architectural patterns: shared infrastructure with logical isolation (simpler, cheaper), or dedicated infrastructure per tenant (more expensive, better isolation). Most SaaS AI apps use shared infrastructure with strong logical isolation. Implement audit logs: what did each tenant access, when, from where?
Testing: ensure isolation is airtight with adversarial tests (can tenant A see tenant B's data?). Test performance under mixed load (many tenants, varying patterns). Implement observability per tenant (monitor costs, latency, error rates separately). Plan for tenant growth and cleanup (inactive tenants, data retention policies).
Strategy: cache full responses for identical queries (deterministic hashing on inputs), cache retrieval results (if chunks don't change, why re-search?), cache embeddings (pre-compute for common documents). TTL (time-to-live) balances freshness and performance: static content can cache longer, dynamic content needs shorter TTLs.
Challenges: knowing when to invalidate cache (source documents changed), handling semantic similarity (similar queries should maybe hit cache), and managing cache size. Implement cache warming (preload common queries). Monitor cache hit rate: if too low (<30%), caching isn't helping much. If very high (>90%), you're either caching stale data or the user base is repetitive.
Implementation: use Redis for distributed caching, implement cache-aside pattern (check cache, miss goes to LLM, write result back). For RAG, caching retrieval results is often higher-impact than caching full responses because retrieval is deterministic and stable. Measure latency and cost improvements from caching to justify complexity.
Architecture: separate job queue (Celery with Redis, AWS SQS, Temporal), worker processes, and job storage. Define SLAs for different job types. High-priority jobs (user-initiated) might have 1-minute SLA; low-priority background jobs can be hours. Implement retries with exponential backoff for failures.
Use cases in AI: batch Re-ranking of search results (user queries return top-50, async job re-ranks to top-5), document ingestion (user uploads 100 PDFs, async workers chunk and embed), periodic retraining (weekly model fine-tuning), and data exports. Async is essential for operations that would timeout in synchronous requests.
Challenges: state management (tracking job progress), failure handling (what happens if a worker crashes mid-job?), and user experience (how long do users wait?). Mitigations: implement progress tracking, use distributed queues with acknowledgments, and provide UI feedback (estimated time remaining). Monitor queue depth and worker utilization to avoid bottlenecks.
Patterns: RESTful endpoints for simple retrieval (GET /search?q=...), function-calling style (POST with tool specifications) for complex tasks, streaming (Server-Sent Events, WebSockets) for long-running operations like LLM generation. Implement pagination for list endpoints, filtering for subset selection.
Versioning: maintain multiple API versions (v1, v2) to avoid breaking changes. Deprecate gradually: announce 6+ months before removal. Document all parameters, return types, and error codes. Implement rate limiting per user/API key. Return HTTP status codes consistently: 200 for success, 400 for client errors, 500 for server errors.
Best practices: include request IDs for debugging, support async/webhook callbacks for long operations, implement caching headers (ETag, cache-control) for retrieval APIs, and monitor API usage (QPS, error rates, latency by endpoint). Plan for backwards compatibility; users depend on your API contract. Use OpenAPI/Swagger for documentation.
Benefits: scalability (add workers without changing core system), resilience (if one service is down, others buffer events), and flexibility (add new services without modifying existing ones). Technologies: message brokers (RabbitMQ, Kafka), serverless functions (AWS Lambda triggered by events), or pub/sub systems (Google Pub/Sub, AWS SNS/SQS).
Challenges: exactly-once delivery is hard (events may be processed multiple times), debugging becomes harder (tracing through events is complex), and ordering matters (process chunks before embeddings). Mitigations: design idempotent handlers, implement request ID tracking across events, use ordered partitions in Kafka for sequencing.
When to use: great for pipelines, batch processing, and loosely-coupled microservices. Overkill for simple request-response APIs. Start simple (direct calls), move to events when scaling demands or complexity grows. Common mistake: publishing too many events (each message has overhead) or too coarse (lose granularity). Design event schema carefully; changing it is harder than changing a function signature.
Implementation patterns: Token bucket (smooth burst handling), sliding window (precise rate tracking), and adaptive rate limiting (adjusts based on provider response headers like x-ratelimit-remaining). Implement at multiple levels: per-user (prevent abuse), per-tenant (fair sharing in multi-tenant systems), and global (stay within provider limits).
Backpressure strategies for when limits are hit: queue requests with priority ordering (paid users first), return cached responses for common queries, fall back to smaller/cheaper models, or return graceful degradation responses. Implement exponential backoff with jitter for retries against provider rate limits.
Production setup: use Redis-based distributed rate limiters for multi-instance deployments. Track token usage per request (not just request count) since a single large prompt can consume your budget. Implement cost alerts and circuit breakers that switch to cheaper models when spending exceeds thresholds. Log all rate limit events for capacity planning.
Cloud & Infrastructure
AI-specific services: AWS has SageMaker (fine-tuning, inference), Azure has OpenAI integration (direct endpoint), GCP has Vertex AI with many pre-trained models. Most organizations standardize on one (often AWS) for operational simplicity. All three offer APIs for popular models (GPT, Claude, etc).
Practical considerations: regional availability (where should your app run?), data residency (GDPR requires EU data in EU regions), and cost (different regions have different prices). Use managed services (Lambda, CloudFunctions) for serverless, containers (ECS, GKE) for stateful workloads, and VMs for full control.
Getting started: create an account, use free tier for testing, set up billing alerts (prevent surprise charges), and learn IAM (access control). Most AI apps start with managed services (simpler, less ops), then move to containers as they scale. Provider lock-in is real; design with portability in mind if possible.
For AI apps: use K8s to scale LLM inference horizontally (run multiple inference pods), manage updates without downtime (rolling deployments), and handle failure recovery automatically. Challenges: K8s is complex; operational overhead is significant (need devops expertise). Alternatives: serverless (AWS Lambda), managed container platforms (ECS, Cloud Run).
Setup: use managed K8s (EKS on AWS, AKS on Azure, GKE on GCP) to avoid cluster management overhead. Define resources: requests (guaranteed minimum), limits (maximum). Use horizontal pod autoscaling to scale based on CPU/memory. Implement liveness/readiness probes so K8s knows when pods are healthy.
When to use K8s: managing multiple services at scale, complex networking, or CI/CD pipelines. Overkill for simple APIs (use serverless). Cost: K8s itself is free, but the cluster costs money (at least a few hundred/month). Recommendation: don't use K8s unless you have the ops expertise; start simpler, add K8s only when you need it.
Tools: GitHub Actions (free with GitHub), GitLab CI, Jenkins, or cloud-native tools (AWS CodePipeline, Azure DevOps). Pipeline stages: lint code, run tests, build Docker image, push to registry, deploy to dev/staging/prod, run smoke tests. For AI: include eval pipeline (did model performance change?), cost tracking, and A/B testing.
Best practices: fast feedback (tests should complete in <10 minutes), parallelization (run independent tests together), and gating (production deployment requires approval or passing stricter checks). Implement feature flags to deploy safely (feature hidden behind flag, enable for % of users). Automated rollbacks on alert thresholds (error rate too high, latency spiked).
For AI systems: include model evaluation in CI (check that model quality didn't regress), cost estimation (will this change increase costs?), and versioning (track which model version is in production). Implement blue-green deployments (two production environments, switch traffic) for zero-downtime updates. Monitor deployment frequency and failure rate.
Serverless strengths: cheap for bursty workloads, scales from zero, no ops overhead. Weaknesses: cold start latency (first invocation is slow, can be seconds), short execution timeouts (usually max 15min), and less control. Containers strengths: can run anything, predictable latency, supports long-running processes. Weaknesses: need to manage infrastructure, more expensive at low scale.
For AI: serverless works for RAG retrieval endpoints (REST endpoints, quick responses), edge-triggered jobs (document uploaded → process in Lambda). Doesn't work well for: LLM inference (latency-sensitive, cold starts unacceptable), background workers (long-running), or stateful services. Containers are better for LLM serving, multi-component pipelines.
Common pattern: hybrid. Retrieval APIs in serverless (FastAPI Lambda), LLM inference in containers (ECS with auto-scaling), batch jobs in serverless (Step Functions orchestrating Lambdas). Estimate your load: if average QPS is <10 and bursty, serverless saves money; if consistent >100 QPS, dedicated containers are cheaper.
Security & Compliance
Requirements: written security policies, access controls (who can access what?), incident response procedures, encryption (in transit and at rest), audit logging (track all access), regular security testing. SOC 2 Type II audit runs for 6+ months (proving sustained compliance), Type I is a point-in-time snapshot.
For AI companies: essential for enterprise customers (most require SOC 2 attestation in contracts). Start with SOC 2 Type II if targeting enterprise. Implementation: document policies, implement them (access controls, encryption), conduct security audit, work with auditor, and maintain compliance.
Cost: audits range $15K-100K+ depending on company size and complexity. Timeline: 3-6 months. Start early if enterprise sales are a priority. Compliance isn't one-time; you maintain it by staying secure and undergoing annual audits. SOC 2 is about operations/security processes, not specific tech. You can achieve SOC 2 with simple well-run infrastructure.
Requirements for vendors: use certified encryption algorithms, implement multi-factor authentication, conduct regular security audits, encrypt backups, and report breaches within 60 days. Common violation: improper de-identification (removing names isn't enough; must remove 18 specific identifiers or use statistical de-identification).
For AI companies: if you process patient data (e.g., AI for diagnosis, patient note analysis), you're handling PHI. Either become HIPAA-compliant or ensure data is de-identified. Compliance includes policies (data retention, deletion), training (staff knows privacy rules), and technical controls (encryption, access logs).
Cost: implementation is significant (encryption infrastructure, audit procedures, training). Penalties for non-compliance are severe ($100-50K per violation, potentially millions for large breaches). Most healthcare AI startups either become HIPAA-compliant or work with de-identified data only. HIPAA is for covered entities and their business associates; B2B tools for non-healthcare companies don't need HIPAA.
Requirements: data processing agreements with customers and vendors, privacy policies (explain what data you collect and why), breach notifications within 72 hours, and data protection impact assessments (risky processing requires analysis). Consent must be explicit (not pre-checked boxes). Data minimization: only collect data you need.
For AI: if you train or use models on EU user data, you're subject to GDPR. Implications: can't just retain data indefinitely (must delete old data), can't train proprietary models on user data without consent, and must explain AI decisions (especially if they materially affect users). Some EU regulators scrutinize AI (fairness, bias concerns).
Practical: if users are in EU, implement data deletion, audit data retention, get explicit consent for data use, and implement privacy-by-design. Cost varies (small startups can self-assess, large companies hire compliance consultants). Penalties are high (up to 4% of global revenue). Many startups use compliance as a feature (privacy-first marketing). If your users aren't in EU, GDPR doesn't apply.
Certification requires: documented information security management system (ISMS), implementation of controls addressing risks, regular audits by accredited certifiers, and ongoing compliance maintenance. Cost: certification audits $20K-100K+, significant effort to implement and document controls.
Who needs it: required by some enterprises (especially government contracts, regulated industries), useful for general credibility. Not as common as SOC 2 in SaaS, but more comprehensive. If you're building infrastructure/security products, ISO 27001 adds credibility.
For AI: ensures your infrastructure and processes are secure. Particularly relevant if you're processing sensitive data or offering AI services to regulated industries. Takes 6-12 months to achieve certification. Plan this in parallel with product development if compliance is a requirement.
Technical: use AES-256 encryption for at-rest data, TLS 1.3 for in-transit, and strong key management (rotate keys, store securely). Implement role-based access control (different staff roles have different access). Log all access to PII (audit trail). Implement data retention policies (delete data after N days/months).
For AI: be careful with training data (don't train models on raw PII). Use tokenization/hashing to de-identify before training. If your model outputs might contain PII (e.g., chatbot), scan outputs for PII before returning. Implement PII detection in user inputs (block users from uploading SSNs). Handle data deletion requests (right to be forgotten).
Tools: data encryption libraries (cryptography package in Python), secrets management (AWS Secrets Manager, HashiCorp Vault), and PII detection (Google DLP API, regular expressions). Monitor for PII leaks: audit logs, security scans, user reports. Breaches happen; respond quickly (notify users, fix the issue, prevent recurrence).
Attack vectors: jailbreaking (bypassing safety filters), data exfiltration (tricking the model into revealing system prompts or user data), privilege escalation (making agents perform unauthorized actions), and supply chain attacks (poisoned training data or compromised tool outputs).
Defense strategies: input sanitization (detect and filter injection patterns), privilege separation (LLMs have minimum necessary permissions), output validation (verify actions before execution), instruction hierarchy (system instructions take precedence over user content), and canary tokens (detecting if system prompts are leaked). Use defense-in-depth; no single technique is sufficient.
For production systems: treat all LLM-processed content as untrusted input (same as SQL injection prevention mindset). Implement logging and monitoring for suspicious patterns. Regular red-team testing. Consider using specialized security models as a pre-filter layer. The OWASP Top 10 for LLM Applications provides a comprehensive threat model.
Data Pipelines
Tools: Apache Airflow (workflow orchestration), dbt (transform in warehouse), Talend, Informatica. For AI: pipelines ingest training data (documents, user logs), transform (clean, tokenize, chunk), and load into vectors/embeddings for fine-tuning or RAG. Typical flow: data source → normalize → deduplication → chunking → embedding → vector store.
Patterns: scheduled (run daily, hourly), event-driven (file uploaded → process), or streaming (continuous). Implement backpressure (slow down if downstream system is overloaded), retries (network failures are common), and monitoring (data quality, latency, error rates).
Challenges: data quality (missing values, inconsistencies), schema changes (new fields require code updates), and scale (handling terabytes efficiently). Testing: validate data at each stage, compare results to expectations (row counts, aggregate values). Production pipelines should be robust (handle failures gracefully) and observable (log/alert on issues).
Patterns: source → stream broker (Kafka) → processing (filter, aggregate, enrich) → sink (database, dashboard). Kafka stores messages; multiple consumers can read independently. Flink supports complex operations (windowing, joins, machine learning). Popular pattern: log events to Kafka, consume with multiple services (analytics, ML training, user notifications).
For AI: stream user queries to track patterns, monitor LLM performance in real-time, or continuously re-rank/improve recommendations. Challenges: exactly-once delivery (processing events multiple times), ordering (some events must be processed in order), and late arrivals (event delayed, how to handle?).
Trade-offs: streaming adds complexity; batch is simpler. Start with batch, move to streaming when you need real-time insights. Cost: Kafka clusters are expensive (need redundancy, high throughput). Cloud streaming (Kinesis) is simpler but pricier. Invest in streaming when business requires real-time, not just because it's cool.
For RAG: chunk documents, remove duplicates (similar documents waste retrieval capacity), and clean formatting (PDFs with weird spacing). For fine-tuning: balance datasets (equal examples of each class), remove outliers, and format as conversation/instruction-response pairs. Handle long documents (split or summarize before training).
Data quality matters hugely: garbage input = garbage output. Implement validation (check for required fields, data type validation), and filtering (remove examples below quality threshold). Evaluate on sample: does preprocessed data look reasonable? Manual spot-checks catch issues automation misses.
Cost: preprocessing is time-consuming (bulk of data pipeline effort). Automate what you can (scripts), sample and validate manually. For large datasets, use sampling to test pipelines before processing everything. Monitor data distribution: if training distribution shifts, model may degrade. Retrain periodically with fresh data.
Examples: Feast, Tecton, Databricks Feature Store. Typical architecture: batch feature computation (daily) stores features in a database (Redis for fast access), inference queries fetch features. Alternative: compute features on-the-fly (slower, more flexible).
For AI: if you're doing ML beyond LLMs (recommendation models, scoring), features matter. For LLM-only applications, less critical (LLMs handle feature engineering). If using feature stores, they reduce model deployment friction (features available at serving time) and improve reproducibility (same features for training and inference).
When to use: you have many features (>100), multiple teams using same features, or strict latency requirements. Overkill for simple use cases. Start simple (compute features in application code), graduate to feature store if complexity grows. Cost: feature stores have overhead (infrastructure, operational complexity).
Data contracts formalize expectations between data producers and consumers using schema definitions, SLAs, and quality checks. Tools like Great Expectations and Pandera let you define validation rules programmatically: column types, value ranges, null percentages, regex patterns, and cross-column relationships. Run validations at ingestion time to catch issues before they propagate.
For LLM-specific data quality: validate training data for label accuracy, check for PII contamination, detect near-duplicates that skew training, and verify instruction-response alignment. For RAG systems, validate document parsing quality, check chunk coherence, and monitor embedding distribution shifts.
Production patterns: implement data quality dashboards with trend monitoring, set up alerts for quality degradation, and establish data quarantine zones for failed validation records. Track data lineage so you can trace issues from model outputs back to source data. Budget 20-30% of pipeline development time for quality infrastructure.
Approaches by complexity: Simple text extraction (PyPDF, python-docx) for digital-native documents. OCR-based (Tesseract, AWS Textract, Google Document AI) for scanned documents and images. Layout-aware parsing (Unstructured.io, LlamaParse) preserves document structure including headers, tables, lists, and reading order. Vision-based parsing uses multimodal LLMs to interpret complex layouts directly.
Table extraction deserves special attention: tables are notoriously hard to parse. Camelot and Tabula work well for simple PDF tables. For complex or nested tables, LLM-based extraction or specialized tools like AWS Textract Tables provide better results. Always validate extracted tables against the source.
Production considerations: build a parsing pipeline that routes documents to appropriate parsers based on file type and complexity. Cache parsed results (parsing is expensive). Implement quality checks: compare extracted text length against expected ranges, verify structural elements were preserved, and spot-check random documents regularly. Handle failures gracefully — some documents will always parse poorly.
Backend Engineering
For AI applications: FastAPI is excellent for wrapping LLM services, RAG systems, or inference endpoints. Use async to handle many concurrent requests without threads. Integrate with LangChain, streaming responses, dependency injection for database connections.
Production setup: use Uvicorn (ASGI server), deploy in Docker containers, implement load balancing (multiple FastAPI instances behind nginx), and monitoring. Add middleware for logging, error handling, rate limiting. Structure: main app file, routers for endpoint groups, models for request/response schemas, and dependencies for shared logic.
Common patterns: authentication (verify API keys), error handling (custom exception handlers), caching (Redis), and background tasks (Celery for async work). Testing: pytest with test client. Don't over-engineer; FastAPI enables rapid development. Scaling: if one server handles 100 requests/sec, add more servers behind load balancer.
Downsides: single-threaded (each request handled sequentially, though non-blocking), less type safety than typed languages (use TypeScript to add types). Popular stack: Express (framework), TypeScript, PostgreSQL (database), Redis (caching).
For AI: Node.js works well for orchestrating microservices (calling Python LLM services, vector DBs). Express middleware handles auth, logging, rate limiting. WebSocket support is native (good for streaming LLM responses). However, CPU-intensive work (embeddings, inference) is better in Python.
Production: use Node cluster (multiple processes), containerize in Docker, monitor with tools like New Relic. Scaling: horizontal scaling (multiple Node instances) is simple. Testing: Jest for unit tests. Common pattern: Node handles HTTP routing, delegates AI work to Python services via REST or gRPC.
Why it matters: a single server can handle thousands of concurrent connections (not creating a thread per request, which is expensive). Ideal for I/O-bound operations (network requests, database queries, waiting). Not helpful for CPU-bound work (inference needs threads or separate processes).
For AI apps: async is critical for concurrency. Example: receive 100 requests, spawn 100 async tasks that call the LLM API, wait for all results. With async, the same thread handles all 100. Without it, you'd need 100 threads (memory overhead, complexity).
Best practice: async for I/O (retrieval, LLM calls, database), dedicated workers for CPU-bound (inference on GPU). Monitor: track concurrent connections, response times, and resource usage. Common mistakes: mixing sync and async (creates deadlocks), blocking calls inside async functions (defeats purpose), and not properly awaiting tasks (events drop).
For AI: Postgres stores application data (users, documents, metadata), handles transactional consistency (important for financial, legal data). Use for: document management (chunks, embeddings metadata), audit logs, and business data. Redis caches: LLM responses, search results, session state, rate limit counters.
Typical pattern: application queries Postgres (slower but persistent), caches results in Redis (fast hits). On cache miss, query Postgres, update Redis. Set expiration (TTL) on cache keys. For vector data, use dedicated vector DBs (Pinecone, Weaviate), not Postgres (unless using pgvector extension).
Operations: backup Postgres regularly (data loss is catastrophic), monitor slow queries, tune indexes. Scale Postgres with read replicas (for read-heavy workloads). Redis is simpler (less state), but data loss is acceptable (cache can be regenerated). Use managed services (AWS RDS for Postgres, ElastiCache for Redis) for reduced ops burden.
Advantages: independent scaling (scale retrieval separately from LLM), technology diversity (use Python for ML, Node for API), fault isolation (if embedding service is down, retrieval still works). Disadvantages: complexity (distributed systems are hard), latency (inter-service calls are slow), and operational overhead.
When microservices help: multiple teams, different scaling needs, technology choices, or independent deployment. Overkill for startups or simple applications. Start monolithic, split into services as complexity grows. Common mistake: premature microservices (adds complexity before it's needed).
Tools: Docker for containerization, Kubernetes for orchestration, gRPC for efficient communication, and service mesh (Istio) for reliability (retries, circuit breakers). Monitoring is critical: trace requests across services, identify where latency happens, catch failures. Start simple, graduate to microservices.
Server-Sent Events (SSE) is a simpler alternative for one-way streaming (server to client). Most LLM APIs use SSE for token streaming. SSE works over standard HTTP, is automatically reconnectable, and is simpler to implement than WebSockets. Use SSE for LLM response streaming; use WebSockets for bidirectional communication (real-time chat with typing indicators, multi-user collaboration).
Implementation with FastAPI: use
StreamingResponse for SSE or WebSocket endpoints for bidirectional communication. Handle connection lifecycle (connect, message, disconnect, error). Implement heartbeats to detect stale connections. For scale, use a pub/sub layer (Redis Pub/Sub) to broadcast messages across multiple server instances.
Production considerations: WebSocket connections are stateful, making horizontal scaling more complex. Use sticky sessions or a connection registry. Implement reconnection logic on the client side with exponential backoff. Monitor connection counts and memory usage. Set connection timeouts to prevent resource leaks. Consider connection limits per user to prevent abuse.
Role-Based Access Control (RBAC) defines what resources and actions each role can access. For AI applications, this extends to: which models users can access, token quotas per role, which tools agents can use, and which data sources are available for RAG retrieval. Implement at both the API gateway and application level.
AI-specific considerations: tool-level permissions (restrict which tools an agent can call based on user role), data access scoping (RAG retrieval filtered by user's document permissions), cost quotas (limit expensive model usage per user/team), and audit logging (track all LLM interactions for compliance).
Production patterns: use middleware for auth (FastAPI dependencies, Express middleware). Implement API key rotation. Use short-lived JWTs with refresh tokens. Never expose API keys to frontend code. For multi-tenant AI systems, ensure tenant isolation at every layer: separate API keys, namespaced vector stores, and scoped tool access. Rate limit by authenticated identity, not just IP.
Frontend (Minimum Required)
token-by-token
For AI applications: build chat interfaces (message list, input box), search UIs (query input, results, facets), and streaming responses (show LLM output as it arrives). Use hooks (useState, useEffect) to manage state and side effects. Libraries: react-query for data fetching, zustand for global state.
Patterns: component hierarchy (parent passes data to children), event handlers (onClick, onChange), and conditional rendering (show/hide based on state). Avoid common mistakes: prop drilling (passing props through many levels, use context instead), re-rendering on every state change (use useMemo, useCallback).
Tools: Create React App for setup, Next.js for full-stack (frontend + backend in one codebase), Tailwind for styling. Testing: Jest for unit tests, React Testing Library for component tests. Deployment: Vercel for Next.js, AWS S3+CloudFront for static sites. Performance: code splitting, lazy loading, and monitoring.
For AI apps: TypeScript improves code quality and refactoring safety. Type interfaces for API contracts ensure frontend and backend align. Example: define ChatMessage interface, use it everywhere. IntelliSense in IDEs provides autocompletion.
Setup: transpile TypeScript to JavaScript before running. Build tools (webpack, esbuild) handle this. Cost: slightly longer development (writing types) but saves debugging time (type errors caught early). Recommended for teams; less valuable for solo projects where you're the only reader.
Common patterns: interfaces for data structures, generics for reusable components (Component
Patterns: loading state (show spinner while fetching), error handling (show error message on failure), retry logic (retry failed requests), and caching (don't refetch same query). Use react-query or SWR to simplify: they handle loading, error, caching, and deduplication automatically.
For AI: streaming responses require special handling. Fetch response as stream, read chunks, update UI as data arrives. WebSocket for bidirectional communication (e.g., real-time collaboration). TypeScript types for API responses prevent runtime errors.
Best practices: validate server responses (don't trust API contracts blindly), use request IDs for debugging, implement timeouts (don't wait forever), and log errors (which APIs fail?). Monitor: track API latency (users wait), error rates, and usage patterns. CORS (cross-origin) can complicate frontend-backend communication; handle properly.
Technical: use Fetch API with streaming response, read chunks, parse JSON (if applicable), update state (React setState or state management), and re-render. Example: LLM generates tokens, each sent as JSON line, frontend appends to chat message. Handle network errors (stream interruption) gracefully.
Libraries: OpenAI SDK abstracts streaming (js client), tRPC for type-safe streaming, or raw fetch if simple. Performance: avoid re-rendering entire message on each token (expensive), append to existing message. Some frameworks (Next.js with React Server Components) streamline this.
Challenges: handling interruptions (user clicks cancel), ensuring tokens don't drop, and managing backpressure (don't accumulate too many updates). Test with slow networks (throttle in browser devtools) to ensure experience is good. Analytics: track how often users cancel (if high, LLM might be too slow).
Essential patterns: Streaming responses (show tokens as they arrive, reducing perceived latency), loading skeletons (indicate processing is happening), confidence indicators (show how certain the AI is), source citations (link claims to source documents), and edit/regenerate controls (let users refine outputs). Progressive disclosure: show the answer first, details on demand.
Feedback mechanisms are critical: thumbs up/down on responses, inline correction, and "report incorrect" flows. This data feeds back into evaluation and fine-tuning. Design for graceful degradation: when the AI fails or is uncertain, provide helpful fallbacks (suggest related topics, offer to connect with a human, or show raw search results).
Anti-patterns to avoid: hiding that content is AI-generated, showing raw JSON/errors to users, blocking the UI during long LLM calls, not providing any way to give feedback, and over-trusting AI outputs without human verification options. Accessibility matters: screen readers need to handle streaming text, and auto-scrolling should be controllable.
Pre-Sales / Solutions Engineering
Key questions: What problem are they solving? What's the current process? What's the business impact? What constraints (budget, timeline, tech stack)? Red flags: vague requirements, unclear success metrics, constantly changing asks. Push back: clarify and document agreements.
Output: requirements document (features needed, success criteria, timeline, budget), user stories (as a X, I want Y so that Z), and acceptance criteria (how to verify it works). Involve technical team (can we build this?) and customers (is this what you want?).
Common mistakes: accepting vague requirements (leads to scope creep), over-committing (yes to everything), not documenting (memory is unreliable), and assuming understanding (clarify assumptions). Best practice: iterative refinement (revisit requirements monthly as understanding deepens), test assumptions with prototypes.
Process: gather requirements, sketch architecture (whiteboard), analyze trade-offs (speed vs cost vs simplicity), document design (diagrams, specifications), and validate with customer. Common patterns: monolith (simple, for startups), microservices (scalable, complex), or hybrid.
For AI solutions: consider retrieval (vector DB, freshness), inference (model, latency budget), data pipeline (ETL, quality), and monitoring (eval metrics, costs). Example: RAG chatbot needs document ingestion pipeline, vector DB, LLM API, frontend, and observability stack.
Documentation: architecture decision records (ADRs) capture why you chose X over Y. Diagrams: system architecture (boxes and arrows), data flow (where does data move?), and deployment (how does it run?). Review with team: fresh eyes catch issues. Revisit as requirements change.
Key practices: be specific (vague SOWs lead to disputes). Example: "Build AI chatbot" is vague; "Build RAG chatbot for 100 FAQ documents, supporting 50 concurrent users, deployed on customer's AWS account" is specific. Define success: how will you measure completion? Include acceptance criteria.
Timeline: break into phases with milestones. Example: Phase 1 (weeks 1-4): proof-of-concept with 10 documents, Phase 2 (weeks 5-8): scale to 100 documents and production deployment. Include buffers; software always takes longer than expected.
Common pitfalls: scope creep (customer keeps asking for more), not pricing for uncertainty (if you're unsure, add buffer). Mitigations: document scope carefully, have change order process (additional features = additional cost), and communicate regularly (avoid surprises). First-time scoping is hard; factor in learning.
For engineers: estimate in story points (relative complexity, not time) or hours. Break large features into smaller tasks (days worth of effort), estimate each, sum. Build confidence: estimate, measure actual, compare, improve estimates over time. Categories: new feature (high uncertainty), bug fix (lower uncertainty), infrastructure work (varies).
Buffers: pad estimates by 20-50% for unknowns. Communicate uncertainty: "1-2 weeks" is better than "5 days" if truly uncertain. Common mistakes: over-optimism (ignoring risks, testing, debugging), and under-estimation (saying "yes" without thinking through details).
For AI projects: add buffers for model selection, data quality issues, and eval cycles. Proof-of-concept (POC) work is expensive (lots of exploration). For customers, present best/worst/most-likely estimates (planning fallacy is real). Track actual vs estimated to improve future estimates. Over-communicating about uncertainty is better than broken commitments.
Example: "AI chatbot reduces support tickets by 30%, saving $500K/year in support staff. Implementation costs $200K, ongoing costs $50K/year. ROI: ($500K - $50K) / $250K = 180% annually." Customers need this to justify purchase.
How to calculate: identify financial impact (reduce time per task? eliminate headcount? increase revenue?), measure baseline (current cost, time), estimate improvement (how much faster with AI?), and multiply by volume (X tasks/year * Y savings/task). Be conservative (if uncertain, use lower estimate).
Communicate: present ROI prominently in sales deck. Include break-even timeline (when does customer recover investment?). Address risks: what if model accuracy is lower? What if adoption is slow? Sensitivity analysis: how does ROI change if key assumptions shift? Customers are skeptical of unrealistic projections; credible, conservative estimates build trust.
Enterprise Integration
For AI integration: AI can enrich data (find contact email, company size from public sources), score leads (predict which leads are most likely to convert), automate workflows (send email when deal reaches certain stage), or summarize (summarize customer history before support call).
Technical: Salesforce has APIs (REST, SOAP, GraphQL) for programmatic access. Use them to build integrations: external AI service analyzes customer data, writes results back to Salesforce. Authentication via OAuth 2.0. Consider Salesforce Einstein (built-in AI features).
Common: mid-market and enterprise companies use Salesforce. If selling to enterprises, understanding Salesforce basics helps close deals faster (some customers require Salesforce integration). APIs are well-documented, but Salesforce is complex (steep learning curve). Hire consultants or use integrations (Zapier) if fully building integrations is overkill.
For AI: automate order-to-cash (AI predicts which invoices will be late, flags for collections), demand forecasting (AI predicts demand, feeds into supply chain), and maintenance (AI predicts equipment failure, schedules maintenance). Integrations are complex: ERPs have legacy APIs (XML, SOAP), complex data models, and strict change control.
Challenges: ERPs are mission-critical (can't risk downtime), have complex data schemas (months to fully understand), and change slowly. Integrations typically use middleware (MuleSoft, Boomi, iPaaS) for data mapping and transformation. Large enterprises have dedicated integration teams.
Selling AI to enterprises usually involves ERP integration. Understanding ERP basics helps conversations. Don't underestimate complexity: ERP integrations are typically 20-30% of project effort. Plan accordingly in SOW.
For AI integrations: Salesforce pushes deal data via webhook → AI service scores lead → scores pushed back to Salesforce via API. Webhooks reduce latency (no polling delay) and load (constant polling is inefficient). Secure webhooks: verify requests (HMAC signature), authenticate, rate limit.
Challenges: webhook delivery isn't guaranteed (network fails, receiver is down). Mitigate: implement retries with exponential backoff, track delivery status, and idempotency (processing same event twice is safe). Scale: webhooks fire often; need to handle load (queue incoming webhooks, process asynchronously).
Tools: for building webhooks, use signing libraries (verify request authenticity), queuing systems (RabbitMQ, SQS), and monitoring (track delivery rates). Most SaaS apps support webhooks; Zapier and IFTTT use webhooks to integrate disparate services. For enterprises, webhooks are preferred over polling (more real-time, less load).
For AI apps: authenticate users (verify they're who they claim), authorize actions (can they access this document?), and audit (who accessed what?). OAuth 2.0 is popular for B2C (users login with Google/GitHub). SAML is standard in enterprises (single sign-on to many apps).
Challenges: password breaches (users choose weak passwords, reuse across sites), MFA fatigue (users resent repeated prompts), and session management (how long do login sessions last?). Mitigations: never store plaintext passwords (hash with bcrypt, argon2), enforce MFA for privileged access, and implement session timeouts.
For customer-facing apps: support OAuth (easier for users, better security). For internal tools: SAML/SSO (users have one password, works for all apps). Use identity services (Auth0, Okta) to avoid building from scratch; these are hard to get right. Compliance: GDPR requires data portability (users can export identity data).
Implementation patterns: Namespace isolation in vector databases (separate namespaces per tenant), row-level security in relational databases, filtered retrieval (always include tenant_id in RAG queries), and separate model deployments for highest-security tenants. Never mix tenant data in the same LLM prompt or fine-tuning dataset.
For RAG systems specifically: tag all chunks with tenant metadata at ingestion. Apply mandatory tenant filters on every retrieval query. Validate that retrieved chunks belong to the requesting tenant before injecting into prompts. Audit retrieval logs for cross-tenant access. Consider separate vector collections per tenant for the strictest isolation.
Compliance considerations: some regulations (HIPAA, GDPR) require data residency (data stays in specific regions), encryption at rest and in transit, and right-to-deletion. Implement tenant data export and deletion capabilities. When using third-party LLM APIs, understand their data retention policies and whether prompts are used for training. Many enterprises require zero-retention agreements.
Performance & Cost Optimization
Example: RAG with expensive model costs $0.01 per query (100 tokens in, 50 out, $0.002 input, $0.008 output). Optimizations: switch to cheaper model (-50% cost), better retrieval means less context injected (-20%), and response caching for repeated queries (if 50% cache hit, -50% cost). Total: potential 70% savings.
Tracking: monitor token usage by feature (which features cost most?), by model (gpt-4 vs gpt-4-turbo), and by user (which users are cost-heavy?). Alert on anomalies (sudden spike). Implement quotas (prevent runaway costs). Cost ≠ quality; cheaper models often work just as well, especially with good prompting.
Strategy: test multiple models on your eval dataset. Results often surprise: cheaper models (GPT-4-turbo) beat expensive ones (GPT-4) on specific tasks. Use cost metrics in your eval (latency, quality, cost). Prioritize: if cost is main concern, optimize via model selection and caching first (faster, easier than architecture changes).
Considerations: latency (user waiting?), cost (budget constraints?), quality (accuracy requirements?), compliance (can data leave company?), and customization (fine-tuning support?). Open-source (Llama, Mistral) vs commercial (OpenAI, Claude, Cohere) trade-off control for convenience.
Strategy: start with strong baseline (GPT-4, Claude 3 Opus). If too slow/expensive, test smaller alternatives. Create eval harness: same 100 test cases, run all models, compare quality, latency, cost. Decision matrix: rank models by importance (quality weighted highest? cost?). Re-evaluate quarterly (new models appear frequently).
Common mistake: choosing based on popularity, not data. "Everyone uses GPT-4" doesn't mean it's best for you. Another: choosing once and never re-evaluating. Models improve; reassess periodically. Multi-model strategy: use expensive model for complex tasks, cheap for simple (reduces average cost).
Strategy: cache full LLM responses for deterministic tasks (data extraction, classification). For generative tasks (creative writing), caching is less useful (output changes with temperature). Retrieval results: if source documents don't change frequently, caching is high-impact. Embeddings: pre-compute for static documents.
Implementation: use Redis for distributed caching, implement cache-aside pattern (check cache, miss goes to LLM, write result back). For RAG, caching retrieval results is often higher-impact than caching full responses because retrieval is deterministic and stable. Monitor: cache hit rate, hit latency, miss latency. Aim for 40-60% hit rate on most workloads.
Pitfalls: cache misses are slower than no cache (add latency checking cache, then fetching). Over-caching staleness (old cached data misleads). Wrong TTL (too short = misses, too long = stale). Measure: compare with and without caching. If hit rate is <20%, caching isn't helping (remove it).
Quantization methods: Post-training quantization (PTQ) converts after training (fast, slight quality loss). Quantization-aware training (QAT) trains with quantization in mind (better quality, more expensive). GPTQ and AWQ are popular PTQ methods optimized for LLMs. GGUF format (used by llama.cpp) enables efficient CPU inference of quantized models.
Knowledge distillation trains a smaller "student" model to mimic a larger "teacher" model's outputs. The student learns the teacher's soft probability distributions, not just hard labels, capturing nuanced knowledge in a fraction of the parameters. This can produce models that are 5-10x smaller with 90%+ quality retention for specific tasks.
Pruning removes unnecessary weights (structured or unstructured) based on magnitude or importance scores. Combined with distillation and quantization, you can achieve 50-100x compression for deployment. Production considerations: always benchmark quantized models against full-precision on YOUR tasks. Use tools like vLLM, TensorRT-LLM, or llama.cpp for optimized inference.
Advanced Topics (High Impact)
Techniques: classification model predicts task difficulty, router selects model. Or: cascade (try cheap model, if confidence low, use expensive). Or: ensemble (run multiple models, aggregate). Each adds complexity but can significantly reduce costs.
Implementation: create decision tree (task type → model), classify incoming requests (which model fits?), and route. Log routing decisions to debug (is classifier making right calls?). A/B test: compare single-model vs multi-model on cost and quality.
Common: finance uses multi-model (simple balance inquiry → GPT-3.5, complex analysis → GPT-4). Customer support uses multi-model (FAQ → retrieval only, complex issue → full agent with tools). Requires benchmark on your workload to justify complexity. Start simple, add multi-model if cost is main issue.
Challenges: hallucination (agents confidently execute wrong actions), error propagation (one mistake cascades), and runaway costs (agents get stuck in loops, making expensive API calls). Mitigations: timeout controls (max N steps), tool validation (don't execute invalid actions), and human approval loops (agent proposes, human approves).
Implementation: frameworks like AutoGen, CrewAI, or LangGraph provide agent primitives. Define tools available to agent, success criteria, and failure modes. Test extensively (what happens if tool fails? if context overflows?). Monitoring: track agent behavior (which actions are taken?), success rate, and cost.
Current limitations: agents work for constrained, well-defined problems. Open-ended goals (write a book) still struggle (agents go off-topic, lack focus). Regulation: autonomous systems face scrutiny (explainability, liability). Most production autonomous systems are highly constrained (specific domain, validated workflows).
Techniques: health checks (monitor system state), anomaly detection (quality metrics drop?), automatic recovery (restart service, fall back, retry), and escalation (if automatic recovery fails, alert human). Observability is critical: can't heal what you don't see.
For AI pipelines: monitor eval metrics (quality), latency, and cost. Alerts on regressions. Automatic mitigation: switch model, increase temperature, adjust retrieval parameters. Manual intervention: incident postmortems (why did it fail?), parameter tuning.
Challenges: false positives (alert on temporary blips), complicated recovery logic (hard to implement correctly), and liability (automated actions must be safe). Common approach: automate detection, require human approval for major actions. Over-automation can hide problems; balance.
Use cases: fine-tuning with limited real data (generate examples of desired behavior), testing (edge cases, adversarial inputs), and privacy (synthetic data for demos, not real data). Quality: synthetic data should be realistic and cover distribution.
For AI: generate instruction-response pairs for fine-tuning. Example: prompt LLM to generate customer support conversations, use as training data. Or: generate test queries for eval (edge cases that real users might ask). Monitor: evaluate model on synthetic data; if quality is different than real, investigate.
Challenges: distribution mismatch (synthetic data doesn't match real), bias (generation process has biases), and evaluation (how to assess quality?). Start simple (template-based), graduate to LLM-based if needed. Synthetic data is valuable but not a replacement for real data (model trained on synthetic only may not generalize).
Implementation: collect feedback (explicit: users rate; implicit: user behavior), label data, retrain periodically (nightly, weekly). Challenges: ensuring feedback is high-quality (if users rate incorrectly, model learns incorrectly), cold start (no feedback initially), and labeling cost.
For RAG: user clicks result → relevant feedback; skips result → irrelevant. Use feedback to retrain retrieval model. For LLM: user edits generations → valuable examples for fine-tuning (especially if domain-specific). For agents: execution traces + outcome (success/failure) used to improve action selection.
Pitfalls: feedback bias (users only rate edge cases), cascading errors (poor model generates biased data, feedback trains on bias), and expensive retraining. Mitigations: sample feedback (not all users), validate feedback quality (spot-check), and regularly re-baseline (recompute metrics on held-out test set). Feedback loops enable continuous improvement but require careful design.
The training process: create training examples that include the question, a mix of relevant ("oracle") and irrelevant ("distractor") documents, and the answer with chain-of-thought reasoning citing the relevant documents. The model learns to: identify which retrieved documents are useful, ignore distractors, and generate well-grounded answers with citations.
RAFT outperforms both pure RAG and pure fine-tuning on domain-specific benchmarks because it combines domain knowledge (from fine-tuning) with the ability to leverage fresh retrieved information (from RAG training). It's particularly effective for specialized domains like medical, legal, and technical documentation.
Implementation: requires a high-quality training dataset with question-document-answer triples. Use the target domain's documents as oracle sources. Generate distractors from the same corpus. Fine-tune using LoRA/QLoRA for efficiency. Evaluate on held-out questions with both seen and unseen documents to verify generalization. This approach is emerging as a best practice for enterprise AI applications.
Bias detection: test model outputs across demographic groups for disparate impact. Use metrics like demographic parity (equal positive prediction rates), equalized odds (equal true/false positive rates), and calibration (predicted probabilities match actual outcomes). Tools: AI Fairness 360, Fairlearn, and custom red-teaming. Bias can enter through training data, model architecture, or deployment context.
Transparency requires explainability (why did the model produce this output?), documentation (model cards describing capabilities and limitations), and user disclosure (making it clear when AI is being used). Accountability means having human oversight for high-stakes decisions, maintaining audit trails, and establishing clear escalation paths when AI systems fail.
For AI engineers in practice: implement bias testing in your evaluation pipeline, create model cards for all deployed models, establish human review processes for high-risk outputs, maintain comprehensive logging for audit trails, and stay current with evolving regulations (EU AI Act, state-level AI laws). Responsible AI isn't just ethics — it's increasingly a legal and business requirement.
MLOps & Model Lifecycle
MLflow is the most widely adopted open-source platform, providing experiment tracking, model registry, and deployment tools. Weights & Biases (W&B) offers superior visualization, team collaboration, and sweep (hyperparameter search) capabilities. Both integrate with major ML frameworks (PyTorch, TensorFlow, Hugging Face).
The model registry is a central repository for versioned models with metadata, stage transitions (staging → production → archived), and approval workflows. Every deployed model should be traceable back to its training run, data version, and code commit. This traceability is essential for debugging production issues and regulatory compliance.
Best practices: log everything automatically (use framework integrations), tag experiments with meaningful metadata, establish naming conventions, and set up automated comparisons. For LLM applications, track prompt versions alongside model versions — prompt changes are equivalent to model changes in their impact on output quality.
Canary deployment gradually routes a small percentage of traffic (1-5%) to the new model version, monitoring for regressions before increasing. This catches issues with minimal user impact. For LLM systems, monitor latency, error rates, user feedback scores, and key quality metrics during the canary phase.
Shadow deployment runs the new model alongside production without serving its outputs to users. Both models process the same requests, but only the production model's responses are returned. Outputs are compared offline. This is ideal for high-risk changes where you need extensive evaluation before any user exposure.
A/B testing deliberately splits traffic to compare model versions on business metrics (conversion rate, user satisfaction, task completion). Unlike canary (which is about safety), A/B testing is about measuring which version is better. Use statistical significance testing before declaring a winner. For LLM systems, A/B tests should run for at least 1-2 weeks due to output variability.
For LLM applications, drift manifests as: declining user satisfaction scores, increasing hallucination rates, lower task completion rates, or shifting topic distributions in user queries. Monitor both statistical metrics (embedding distribution distances, token probability distributions) and business metrics (user feedback, escalation rates).
Detection techniques: statistical tests (KS test, PSI) comparing current vs baseline distributions, window-based monitoring (compare rolling 7-day metrics against historical baselines), and automated evaluation (run periodic eval suites against production-like inputs). Set alerts for significant deviations.
Response to drift: investigate root cause (new data patterns? provider model update? prompt degradation?), update evaluation datasets to reflect current reality, retrain or re-prompt as needed, and document the incident. For LLM API users, provider model updates (GPT-4 version changes) are a major drift source — always pin model versions and test before upgrading.
A typical LLM CI/CD pipeline: code linting → unit tests → integration tests → eval suite (run LLM against test cases, score with LLM-as-judge) → cost estimation → staging deployment → smoke tests → canary release → production. The eval suite is the critical gate — if quality metrics drop below thresholds, the pipeline fails.
Prompt versioning: treat prompts as code. Store in version control, review changes in PRs, and test automatically. A prompt change can be as impactful as a code change. Use tools like promptfoo or Braintrust for automated prompt evaluation in CI. Track prompt-model compatibility (a prompt optimized for GPT-4 may not work with Claude).
Challenges: LLM evaluations are slow (minutes, not seconds) and non-deterministic. Use parallelization, caching, and statistical significance tests. Set up nightly comprehensive evals (full test suite) and fast CI evals (subset of critical cases). Budget for eval costs (running LLM-as-judge in CI isn't free).
Key platforms: LangSmith (built by LangChain team, deep integration with LangChain/LangGraph), Langfuse (open-source, model-agnostic), Arize Phoenix (focus on embeddings and retrieval quality), and Helicone (lightweight proxy-based logging). All provide trace visualization, cost tracking, and evaluation capabilities.
Distributed tracing for LLM apps: each user request generates a trace containing spans for each operation (embedding, retrieval, LLM call, tool execution). This enables debugging complex chains: "Why was this response wrong?" → trace reveals the retrieval returned irrelevant chunks, or the prompt was malformed, or the model hallucinated despite good context.
What to log: input/output for every LLM call, token counts and costs, latency per operation, retrieval scores and chunks, tool call parameters and results, user feedback, and error details. Set up dashboards for: daily cost trends, latency percentiles (p50, p95, p99), error rates by type, and quality scores over time. Alert on anomalies.
Interview Questions & Answers
20+ real-world questions with structured answers following the Problem → Approach → Architecture → Trade-offs → Production pattern
LLM / AI Fundamentals
Embeddings are learned dense vectors (e.g., 4096 dimensions) that represent tokens in a continuous semantic space. The embedding layer is a lookup table mapping each token ID to its vector. These vectors are trained with the model — semantically similar tokens end up near each other (king − man + woman ≈ queen). Embeddings capture meaning; tokenization does not.
tiktoken (OpenAI) or model-specific tokenizers for accurate counts. Be aware that code, non-English text, and structured data (JSON) often tokenize poorly (more tokens than expected). When building RAG, your chunk sizes should be measured in tokens, not characters.RAG (Very Important)
Query Pipeline: User query → query rewriting (expand abbreviations, clarify ambiguity) → hybrid search (BM25 + vector, alpha=0.7) → cross-encoder re-ranking (Cohere Rerank, top 20 → top 5) → context assembly (system prompt + top chunks + user query) → LLM generation with inline citations → grounding check (verify each claim maps to a source).
Supporting Infrastructure: Semantic cache (Redis + vector similarity), auth middleware (SSO + RBAC — users only see docs they have access to), observability (Langfuse traces), feedback collection (thumbs up/down + comments).
Semantic chunking computes embedding similarity between consecutive sentences and splits where similarity drops significantly — producing chunks that are coherent units of meaning. More expensive but produces better retrieval. Parent-child chunking creates small chunks for retrieval precision but returns the larger parent chunk for generation context.
Hybrid search runs both in parallel and combines results using Reciprocal Rank Fusion (RRF). This is almost always the right answer for production systems because it handles both specific lookups and vague queries. The alpha parameter (0 = all BM25, 1 = all vector) lets you tune the balance — typically 0.5–0.7 (slight vector bias) works best.
2. Better chunking — Switch from fixed-size to semantic or document-structure-aware chunking. Test multiple chunk sizes. Implement parent-child retrieval.
3. Hybrid search — Combine BM25 + vector search. Catches cases where either alone fails.
4. Query transformation — Query rewriting (expand/clarify user queries with an LLM), HyDE (generate hypothetical answer, embed that), and multi-query (generate 3–5 query variations, merge results).
5. Better embeddings — Upgrade embedding model (check MTEB leaderboard), or fine-tune embeddings on your domain data.
6. Metadata filtering — Filter by date, department, document type before vector search. Reduces noise dramatically.
Agents / Orchestration
Implement with LangGraph: define agents as nodes, edges as message passing, with conditional routing and cycles (reviewer can send back to writer). Shared state holds the accumulated work product. Each agent has its own system prompt, tools, and optionally a different model (researcher uses a model with web access, writer uses a creative model).
Example: tools = [search_database(query, filters), send_email(to, subject, body), get_weather(city)]. User says "What's the weather in Tokyo and email it to my boss." The model outputs two parallel function calls, your code executes both, results are returned, and the model synthesizes a response.
Long-term memory: Store important facts, user preferences, and past interactions in a vector database. On each new message, retrieve relevant memories and inject them into the system prompt. Example: "User prefers responses in bullet points. User works in healthcare. Last week, user asked about HIPAA compliance."
Episodic memory: Record task outcomes ("Last time I called this API, the rate limit was 100/min") so the agent can learn from experience without fine-tuning.
Evaluation / Production
Offline: Build a diverse eval dataset (200+ examples covering common queries, edge cases, adversarial inputs). Use RAGAS metrics for RAG (faithfulness, answer relevancy, context precision/recall). Use LLM-as-judge for general quality scoring. Run as part of CI/CD — block deploys if scores regress.
Online: Sample 5–10% of production traffic, run automated quality checks (hallucination detection, relevance scoring, format validation). Track latency (p50, p95, p99), cost per query, and error rates. Set up alerts for regressions.
User feedback: Thumbs up/down, regeneration clicks, copy events, task completion rates. Aggregate into dashboards. Use negative feedback to build failure case datasets for eval improvement.
Self-consistency: Generate 3–5 responses at temperature 0.5, compare for agreement. Disagreement on factual claims indicates low confidence/potential hallucination. Source attribution: Require the model to cite specific passages; verify citations actually support the claims.
Confidence scoring: Use token-level log probabilities (where available) to flag low-confidence segments.
System Design (Very High Weight)
Add conversation memory (maintain context within session), sentiment detection (auto-escalate if user is frustrated), and feedback loop (resolved tickets feed back into knowledge base).
Load balancing + queuing: ALB distributes across API servers. Request queue (SQS/Kafka) absorbs spikes and enables back-pressure. Priority queues for premium users.
Model routing: Route 70% of queries to cheap small models (GPT-4o-mini, Haiku), 30% to expensive large models. Automatic fallback between providers (OpenAI down → Anthropic).
Horizontal scaling: Stateless API servers scale with HPA. For self-hosted models: vLLM with continuous batching on GPU clusters, scaling based on queue depth. Multi-region deployment for global latency.
2. Caching — Semantic cache hits return in <50ms. Even a 30% hit rate dramatically reduces average latency.
3. Smaller models — GPT-4o-mini is 5–10x faster than GPT-4. Route simple queries to fast models.
4. Parallel execution — Run retrieval, guardrails, and other preprocessing in parallel (asyncio.gather). Run multiple tool calls simultaneously.
5. Prompt optimization — Shorter prompts = fewer input tokens = faster processing. Remove unnecessary examples, compress system prompts.
6. Pre-computation — Pre-embed documents (don't embed at query time). Pre-warm model connections. Pre-fetch user context.
Deployment / Infra
Auth: SSO integration (SAML/OIDC with Okta/Entra), SCIM for user provisioning, RBAC for feature-level access control, MFA enforcement.
Data: Encryption at rest (AES-256) and in transit (TLS 1.3). Customer-managed encryption keys (BYOK). PII detection and redaction before sending to LLM APIs. Data residency controls (keep EU data in EU region).
Compliance: SOC 2 Type II certification, HIPAA BAAs with all sub-processors, GDPR DPAs, audit logging of all data access, regular penetration testing.
Pre-Sales / FDE-Specific (Critical)
Business context: "What problem are you trying to solve? What happens today without AI? What would success look like in 6 months? How do you measure it?"
Data landscape: "What data do you have? Where does it live? What format? How much? How often does it change? Who owns it?"
Users: "Who will use this? How tech-savvy are they? What's their current workflow? How many users?"
Constraints: "What compliance requirements exist (HIPAA, GDPR, SOC 2)? Can data leave your infrastructure? What's your budget range? Timeline?"
Technical environment: "What's your current tech stack? Cloud provider? Identity system? Existing integrations?"
Step 2 — Feasibility Assessment: Quick 1–2 day spike. Get sample data, test with a basic RAG pipeline. Can we actually answer their questions with their data? If yes, proceed. If no, be honest.
Step 3 — Solution Architecture: Draw the architecture diagram, list technology choices with justifications, identify integration points, and map the data flow. Get technical buy-in from their engineering team.
Step 4 — SOW: Define scope (included AND excluded), deliverables (specific artifacts: "deployed chatbot with admin dashboard," not "AI solution"), milestones with acceptance criteria ("chatbot achieves >85% answer accuracy on test dataset of 200 questions"), assumptions ("client provides access to Zendesk API by Week 2"), timeline, and pricing.
Bonus (Often Asked)
Use GPT-4o-mini / Claude Haiku / Gemini Flash when: classification, simple extraction, routing decisions, high-volume tasks where marginal quality improvement doesn't justify 50–100x cost increase. Cost: ~$0.15–0.50 per 1M tokens. Latency: 200ms–2s.
Use fine-tuned small models when: highly specific task with consistent format (entity extraction, sentiment analysis), need lowest latency (<100ms), or must run on-device/on-premise.
Model routing is the production answer: classify each query's complexity and route to the appropriate model tier. This typically cuts costs 60–70% while maintaining quality where it matters.
Track cost per query by endpoint, by user tier, and by feature. Set budget alerts. Monitor for cost anomalies (a single user generating 10x normal traffic). In my experience, combining caching + model routing delivers 80% cost reduction for most applications.
Example structure: "We deployed a RAG chatbot for a financial services client. After 2 weeks in production, users reported the bot was confidently citing outdated compliance policies. Root cause: our ingestion pipeline had a bug that silently failed on document updates — the vector DB contained stale embeddings from the initial load while source documents had been updated 3 times. Fix: immediate re-index of all documents. Prevention: added change detection with hash comparison, freshness metadata on all chunks, automated integration tests that verify end-to-end from document update to correct retrieval, and alerting on ingestion pipeline failures."
Show: you can debug (observability), you take ownership, you think systemically (prevention, not just fix), and you communicate clearly to stakeholders.
Focus on outcomes, not technology. "This system will reduce your support ticket resolution time from 4 hours to 30 minutes" is better than "We'll implement a RAG pipeline with cross-encoder re-ranking and hybrid search."
Set honest expectations. "AI is like a very smart new hire — it'll get 85–90% of answers right from day one, and we'll improve it over time with feedback. For the 10–15% it's unsure about, it escalates to your team."
Use demos, not decks. A 5-minute live demo with their actual data is worth more than 50 slides. Build a quick prototype during the discovery phase and show it in the next meeting.
Fine-tuning & Model Selection
Security & Safety
MLOps & Lifecycle
Behavioral & Soft Skills
What They're Actually Testing
Answer Structure (Use This for Every Question)
The 5-Part Framework
For every technical question, structure your answer using this pattern. It demonstrates systematic thinking and production experience:
Problem → What's the challenge and why does it matter? (30 seconds)
Approach → How would you solve it? What are the key techniques? (1–2 minutes)
Architecture → What does the system look like? Components, data flow, tech choices. (2–3 minutes)
Trade-offs → What alternatives did you consider? Why this approach over others? (1 minute)
Production Concerns → Monitoring, security, cost, failure modes, scalability. (1 minute)
This framework works because it mirrors how senior engineers actually think. Junior engineers jump to the solution. Senior engineers start with the problem, consider alternatives, and think about what happens at 3am when things break. Interviewers notice the difference immediately.