LLMOps — Comprehensive Reference

End-to-end guide to building, deploying, evaluating, and operating LLM-powered systems in production

MLOps vs LLMOps Beginner

Traditional MLOps focuses on reproducible, deterministic model training pipelines. LLMOps extends this for the unique challenges of large language models: non-deterministic outputs, prompt-driven behavior, massive inference costs, and rapid model evolution.

Aspect Traditional MLOps LLMOps
Data Preparation Labeled datasets, feature engineering, class balance Unlabeled text corpora, synthetic data, prompt templates, instruction tuning datasets
Training Deterministic, reproducible, gradient-based optimization Pre-training complete, fine-tuning selective, prompt engineering primary
Evaluation Metrics (accuracy, F1, AUC) on held-out test set Human evals, semantic metrics (BLEU, ROUGE, BERTScore), framework testing (LM Eval Harness)
Deployment Model file + inference code, millisecond latency Model API (proprietary) or self-hosted, seconds latency, token-based billing
Monitoring Prediction distribution, input drift, output metrics Token usage, cost per query, semantic drift, human feedback loops
Key Cost Driver Training compute (one-time), storage Inference tokens (per query), context window size

Why LLMs Need Different Ops

Why this matters Understanding the difference between MLOps and LLMOps shapes how you architect systems. You won't be retraining models constantly; you'll focus on prompts, retrieval quality, and API orchestration instead.

LLM Selection & Sizing Beginner

Choosing the right model is a critical decision balancing cost, latency, capability, and control. No single model is best for all use cases.

Decision Factors

Model Deployment Cost/1M Tokens Context Strengths
GPT-4o API $15 in / $60 out 128K Best reasoning, multimodal, vision
Claude 3.5 Sonnet API $3 in / $15 out 200K Longest context, nuance, code
Llama 3 (70B) Self-hosted / API $0.90 in / $1.35 out 8K Open-source, fine-tunable, efficient
Mistral 7B Self-hosted / API $0.14 in / $0.42 out 32K Fast, small, cheap, decent quality
Gemini 2.0 Flash API $0.075 in / $0.30 out 1M Massive context, multimodal, fast

Sizing Decision Tree

// Pseudo-code decision framework if (complexity = "simple classification") { model = "Mistral 7B" // fast, cheap } else if (latency_budget_ms < 500) { model = "GPT-4o mini" // low-latency API } else if (context_needed > 100000) { model = "Claude 3.5 Sonnet" // longest context } else if (reasoning_critical) { model = "GPT-4o" // best reasoning chains } else if (budget_constrained && can_self_host) { model = "Llama 3 70B" // best open-source }
Tip: Start with a frontier model (GPT-4o, Claude) to establish quality baseline. Once you understand requirements, evaluate cheaper models. Often a 7B fine-tuned model beats a larger base model.
Why this matters Model choice cascades through your entire system: infrastructure, cost budget, latency guarantees, fine-tuning feasibility. Choose wisely and revisit quarterly as new models emerge.

Prompt Engineering Beginner

Prompts are code for LLMs. Systematic prompt engineering often provides better results than fine-tuning, with zero training cost or latency.

Zero-Shot Prompting

Ask the model directly without examples. Works for well-known tasks but quality varies with phrasing.

"Classify the sentiment of this tweet: 'Just launched our new product!' "

Few-Shot Prompting

Provide 2-5 examples before asking the task. Dramatically improves consistency and accuracy.

"Classify sentiment. Examples: 'Best product ever' → Positive 'Terrible experience' → Negative 'It was okay' → Neutral Classify: 'Just launched our new product!' →"

Chain-of-Thought (CoT)

Ask model to explain reasoning step-by-step before answering. Improves reasoning quality for complex tasks.

"Solve step by step: A store has 12 apples. They sell 5 and receive 8 more. How many do they have? Let me work through this: 1. Starting with 12 2. After selling 5: 12 - 5 = 7 3. After receiving 8: 7 + 8 = 15"

Self-Consistency

Generate multiple completions with temperature=1, take majority vote. Improves reasoning accuracy at cost of extra API calls.

// Generate 5 independent responses with temperature=1 for (i = 0; i < 5; i++) { response = llm("Q: What is 5+3?", temperature=1) responses.append(response) } // Take most common answer answer = majority_vote(responses)

Tree-of-Thought (ToT)

Explore multiple reasoning paths, score each, and expand the most promising. Useful for complex decision problems.

// Pseudo-code for tree search function tree_of_thought(problem) { initial_state = "Initial analysis of problem" states = [initial_state] for (depth = 1; depth < 3; depth++) { new_states = [] for (state in states) { // Generate 3 possible next thoughts next_thoughts = llm("Next step from: " + state, n=3) // Score and keep top 2 scored = score_thoughts(next_thoughts) new_states.extend(top_k(scored, 2)) } states = new_states } return best_path(states) }

System Prompts & Role Prompting

Set context and personality with system prompt. Role prompting ("You are a physicist") often improves relevant knowledge activation.

System: "You are an expert software architect with 15 years of experience. Provide concise, practical advice focused on production systems." User: "How should I design a caching layer for an LLM API?"

Structured Output (JSON Mode)

Modern models support requesting JSON output. Enables reliable parsing and downstream tool integration.

"Extract entity relationships as JSON. { "entities": [ {"name": "Alice", "type": "person"}, {"name": "Company X", "type": "org"} ], "relationships": [ {"from": "Alice", "to": "Company X", "relation": "works_at"} ] }"
Prompt Versioning: Treat prompts as code. Version them in git, test against eval benchmarks before rolling out changes, track performance deltas.
Why this matters Prompt engineering is your fastest lever for quality improvement: instant deployment, no infrastructure cost, repeatable experimentation. A well-engineered prompt often beats a fine-tuned model.

RAG Architecture Intermediate

Retrieval-Augmented Generation augments LLM knowledge with external documents, enabling current information and domain specificity without model retraining.

Core Pipeline: Ingest → Chunk → Embed → Index → Retrieve → Rerank → Generate

// Simplified RAG pipeline pseudocode class RAG_Pipeline { def ingest(documents) { // Load PDFs, web pages, databases return raw_text } def chunk(text) { // Split on semantics, not just size return [chunk1, chunk2, ...] } def embed(chunks) { // Convert text to vectors (768-1536 dims) vectors = [embedding_model.encode(c) for c in chunks] return vectors } def index(vectors) { // Store in vector DB with HNSW/IVF indexing db.upsert(vectors) } def retrieve(query, k=5) { query_vec = embedding_model.encode(query) results = db.search(query_vec, top_k=k) return results } def rerank(query, candidates) { // Cross-encoder: re-score with query context scores = cross_encoder.score(query, candidates) return sorted(candidates, key=scores) } def generate(query, context) { prompt = "Given context: " + context + "\nQuery: " + query return llm.generate(prompt) } }

Naive RAG Limitations

Advanced RAG

Routing Pattern: Route queries: simple questions to LLM directly, complex/factual to RAG. Avoid RAG latency when unnecessary.
Why this matters RAG is the gateway to current, specialized knowledge without constant model updates. Most production LLM systems are RAG-augmented, not pure LLM.

Vector Databases Intermediate

Vector databases (VectorDBs) efficiently store, index, and retrieve high-dimensional embeddings. Choice affects retrieval speed, scale, and cost.

Database Hosting Features Scale Pricing Model
Pinecone Managed cloud Full-text hybrid search, metadata filtering, namespaces Billions of vectors $0.10 per 100K vectors/month + queries
Weaviate Self-hosted / cloud GraphQL API, multimodal, reranking built-in Billions Open-source; cloud: per-vector + queries
Qdrant Self-hosted / cloud Payload filtering, sparse embeddings, performance focused Billions Open-source; cloud: compute + storage
Chroma Embedded / cloud Python-first, simple API, great for prototyping Millions Open-source (embedded); Chroma Cloud pricing TBA
Milvus Self-hosted / cloud High performance, sparse + dense, partition support Billions Open-source; Zilliz Cloud: per-month compute
pgvector (PostgreSQL) Self-hosted SQL + vectors, perfect for hybrid data, ACID transactions Millions (practical) Open-source

Embedding Models Comparison

Model Dimension MTEB Score Cost / 1M tokens Strengths
OpenAI text-embedding-3-large 3072 64.5 $0.13 Highest quality, proprietary
Cohere Embed-3-large 1024 64.2 $0.10 Multilingual, retrieval optimized
BGE-m3 1024 64.5 $0 (open-source) Multilingual, sparse + dense, free
E5-large-v2 1024 63.5 $0 (open-source) Strong, symmetric, free
jina-embeddings-v2-base-en 768 62.3 $0 (open-source) Fast, compact, efficient

Indexing Strategies

Distance Metrics

// Common distance metrics for embeddings euclidean = sqrt(sum((a_i - b_i)^2)) // Raw distance cosine_sim = (a · b) / (|a| |b|) // Normalized, most common dot_product = a · b // Fast, if normalized already
Why this matters VectorDB choice impacts retrieval latency, cost, and quality. For <10M vectors use Chroma or pgvector; for billions use Pinecone or Milvus. Embedding model choice affects quality more than VectorDB.

Chunking & Embedding Strategies Intermediate

How you split documents and represent them dramatically affects RAG quality. Poor chunking creates lost context and redundancy.

Chunking Methods

// Semantic chunking example function semantic_chunk(text, threshold=0.5) { sentences = split_sentences(text) chunks = [] current_chunk = [sentences[0]] for (i = 1; i < sentences.length; i++) { prev_embed = embed(sentences[i-1]) curr_embed = embed(sentences[i]) similarity = cosine_similarity(prev_embed, curr_embed) if (similarity > threshold) { current_chunk.append(sentences[i]) } else { chunks.append(join(current_chunk)) current_chunk = [sentences[i]] } } chunks.append(join(current_chunk)) return chunks }

Chunk Size & Overlap Tradeoffs

Chunk Size Pros Cons Use Case
256 tokens Precise retrieval, low cost May miss context, many chunks to search Dense documents, keyword search
512 tokens Good balance Most RAG systems (default)
1024 tokens Rich context in chunk Retrieval recall may suffer, expensive Complex reasoning needs full context

Overlap: Use 50-100 tokens overlap to prevent critical information from being split across chunks.

Embedding Models In-Depth

Bi-encoder (Asymmetric): Embed query and documents separately, compute similarity. Compute query embedding at query time. Fast retrieval, good for asymmetric search (short query → long document).

Cross-encoder (Reranker): Encode query + document together. Slower (run on retrieved set), much higher accuracy. Use for top-k reranking.

// Embed documents once, reuse for (doc in documents) { embedding = bi_encoder.encode(doc) vectordb.upsert(embedding, metadata=doc) } // At query time: embed query, retrieve, rerank query_embedding = bi_encoder.encode(query) candidates = vectordb.search(query_embedding, k=50) reranked = [] for (doc in candidates) { score = cross_encoder.score(query, doc) reranked.append((doc, score)) } return sorted(reranked, key=score)[:k=5]
Metadata Extraction: Extract title, author, date, source URL during chunking. Use in reranking and context windows.
Why this matters Chunking quality directly impacts RAG recall. Bad chunking (fixed-size, no overlap) loses information. Semantic chunking + reranking = best quality. Embedding model choice (OpenAI vs BGE) affects 2-5% quality delta; chunking affects 10-20%.

Fine-Tuning Methods Intermediate

Fine-tuning adapts pre-trained models to specific tasks with domain data. Different methods trade off cost, speed, and quality.

Method Memory Speed Quality Cost Use Case
Full Fine-Tuning 80GB+ (70B model) 1-7 days Highest $1000s Domain-critical, large budgets
LoRA 24GB (70B model) 8-24 hours 95% of full $100-500 Fast iteration, multiple adapters
QLoRA 4GB (70B model) 24-48 hours 90% of full $10-50 Limited compute, rapid prototyping
SFT (Supervised) 24GB+ 1-3 days High $500+ Instruction following, behavior shaping

LoRA Configuration Example

// LoRA config for efficient fine-tuning const lora_config = { r: 16, // LoRA rank (lower = less params) lora_alpha: 32, // Scaling factor lora_dropout: 0.05, // Dropout in LoRA layers target_modules: [ // Which layers to adapt "q_proj", "v_proj", "k_proj", "out_proj" ], bias: "none", task_type: "CAUSAL_LM" } // Training params const training_config = { learning_rate: 5e-4, num_epochs: 3, batch_size: 16, warmup_steps: 100, gradient_accumulation_steps: 4 }

When to Fine-Tune vs Prompt vs RAG

// Decision tree if (need_current_facts || specific_documents) { "Use RAG" } else if (can_improve_with_examples || format_critical) { "Try few-shot prompting first" } else if (have_1000+ examples && domain_specific) { "Fine-tune with LoRA" } else if (production_critical && budget_available) { "Full fine-tuning" }
Why this matters Fine-tuning is expensive and slow. Start with prompting and RAG. Only fine-tune when you have domain data and can't achieve quality otherwise.

RLHF & Alignment Advanced

Reinforcement Learning from Human Feedback aligns LLM outputs with human preferences. Critical for safety and quality.

Alignment Methods Comparison

Method Human Feedback Complexity Quality Recent Use
RLHF (PPO) Preferences (A>B) High High ChatGPT training
DPO (Direct Preference Opt) Preferences (A>B) Low (SFT-like) High Llama 2, newer models
ORPO Preferences Low High Recent, simplified DPO
KTO (Kahneman-Tversky Opt) Binary ratings Low High Alternative to DPO
// DPO training simplified pseudocode function dpo_loss(chosen, rejected, logprobs_chosen, logprobs_rejected) { // Preference: chosen should be more likely than rejected preference_ratio = logprobs_chosen - logprobs_rejected // Loss encourages preference_ratio to grow return -log(sigmoid(BETA * preference_ratio)) } // Training loop: SFT on (prompt, chosen) pairs with DPO loss for (batch in training_data) { logprobs_c = model.logprobs(batch.prompt, batch.chosen) logprobs_r = model.logprobs(batch.prompt, batch.rejected) loss = dpo_loss(logprobs_c, logprobs_r) loss.backward() optimizer.step() }

Data Collection for Alignment

Why this matters Alignment determines whether your model is trustworthy and safe. DPO is now preferred over RLHF: simpler, faster, better results. Consider it essential for production systems.

Data Preparation & Curation Intermediate

High-quality training data is the foundation of good models. Data pipelines include collection, cleaning, deduplication, and formatting.

Data Collection & Cleaning Pipeline

// Data preparation pipeline function prepare_data(raw_sources) { // 1. Collect from diverse sources data = [] data.extend(scrape_web()) data.extend(load_books()) data.extend(load_academic()) // 2. Clean: remove gibberish, non-text, corrupted data = [d for d in data if is_valid_text(d)] // 3. Deduplicate: fuzzy + exact data = deduplicate(data, threshold=0.95) // 4. Filter quality: length, language, toxicity data = [d for d in data if len(d) > 100 && detect_language(d) == "en" && quality_score(d) > 0.5 ] return data }

Training Data Formats

// Alpaca format example { "instruction": "Summarize this article in 2 sentences", "input": "[article text here]", "output": "[2-sentence summary]" } // ChatML format example { "messages": [ {"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "4"} ] }

Synthetic Data Generation

Generate diverse training examples cheaply using an existing LLM (seed data → diverse variations → filter → fine-tune).

// Synthetic data generation pipeline function generate_synthetic(seed_examples, multiplier=10) { synthetic = [] for (seed in seed_examples) { for (i = 0; i < multiplier; i++) { // Prompt: rephrase, simplify, make harder, etc. variant = llm.generate( "Rephrase: " + seed.instruction ) synthetic.append(variant) } } // Filter: only keep quality examples return [s for s in synthetic if quality_check(s)] }

Data Flywheel

Production system → user interactions → log quality signals → curate best examples → retrain → better model → better quality. Virtuous cycle.

Deduplication: Use embedding-based or MinHash deduplication. Exact matches miss paraphrases. Critical for preventing train-test leakage.
Why this matters Data quality compounds: 10k quality examples beat 1M noisy examples. Invest in curation, deduplication, and synthetic generation upfront. The data flywheel creates compounding value over time.

LLM Evaluation Frameworks Intermediate

Evaluating LLM outputs requires metrics beyond traditional classification accuracy. Semantic similarity, human preference, and task-specific metrics all matter.

Evaluation Metrics

Metric Type Interpretation Use Case
Perplexity NLP How surprised model is at next token. Lower = better. Language modeling baseline
BLEU NLG N-gram overlap with reference. 0-100 scale. Machine translation
ROUGE NLG Recall of n-grams. Multiple variants (ROUGE-1, ROUGE-L). Summarization
BERTScore Semantic Embedding similarity between output and reference. Context-aware. Paraphrase, generation tasks
Human Eval Direct Annotators rate quality. Most reliable but expensive. Production decision-making

Evaluation Frameworks & Tools

// Example: using LM Eval Harness const eval_config = { model: "meta-llama/Llama-2-7b-hf", tasks: [ "mmlu", // 57-task multiple choice "arc_challenge", // Reading comprehension "humaneval", // Code generation "gsm8k" // Math reasoning ], num_fewshot: 5 } // Output: accuracy scores across tasks { "mmlu": 0.42, "arc_challenge": 0.51, "humaneval": 0.23 }
Benchmark Leaderboards: Track: Open LLM Leaderboard, MTEB (Embeddings), Papers With Code. Always compare against recent baselines.
Why this matters No single metric captures quality. Combine automated metrics (ROUGE, BERTScore) for rapid iteration with occasional human eval for final validation. Use domain-specific metrics (medical accuracy, code compilation) when available.

Red Teaming & Safety Testing Advanced

Adversarial testing identifies vulnerabilities before production. Red teams find jailbreaks, prompt injection attacks, and biased outputs.

Common Attack Vectors

Defense Layers

// Multi-layer defense pipeline function process_request(user_input) { // 1. Input validation & sanitization if (is_injection(user_input)) { return "Rejected: injection detected" } // 2. Input filtering (PII, secrets) user_input = mask_pii(user_input) // 3. Model inference with safety response = model.generate(user_input, temperature=0.7) // 4. Output filtering & guardrails if (is_toxic(response) || is_hallucination(response)) { return "I can't help with that" } // 5. Audit logging log_interaction(user_input, response, timestamp) return response }

Tools for Safety Testing

OWASP LLM Top 10 (Critical Vulnerabilities)

# Vulnerability Mitigation
1 Prompt Injection Input filtering, semantic isolation, user context separation
2 Insecure Output Handling Output validation, sanitization, schema enforcement
3 Training Data Poisoning Data validation, source vetting, fingerprinting
4 Model Denial of Service Rate limiting, resource quotas, input length limits
5 Supply Chain Vulnerabilities Dependency scanning, model provenance, audit trails
Why this matters Safety isn't optional in production. Build multi-layer defenses. Red team early and continuously. Invest in monitoring for emerging attack patterns.

A/B Testing & Experimentation Intermediate

LLM outputs are stochastic. A/B testing requires statistical methods and human preference validation, not just accuracy metrics.

Online vs Offline Evaluation

Approach Speed Validity Cost Use Case
Offline (Bench) Minutes Proxy quality Low ($1-10) Rapid iteration, model selection
Online (A/B) Hours-weeks Real user feedback High ($1000s) Production decisions, user experience
Human Eval Days-weeks Direct quality Medium ($100-1000) Model-to-model comparison, releases

Statistical Significance for LLM Outputs

LLM outputs are variable. Need larger sample sizes than traditional A/B tests.

// Simplified sample size calculation // Cohen's h for two proportions baseline_rate = 0.65 // Variant A success rate variant_rate = 0.72 // Variant B success rate (target) significance = 0.05 // Alpha error (95% confidence) power = 0.80 // Beta (80% power to detect) // Result: need ~500 samples per variant // With LLM stochasticity: may need 1000-2000 for stable signals

User Preference Testing

Why this matters Online A/B tests reveal real-world performance. Offline benchmarks miss user satisfaction and task fit. Run both: benchmarks for iteration, A/B for validation.

Model Serving Infrastructure Advanced

Serving LLMs in production requires specialized infrastructure for throughput, latency, and cost efficiency.

Framework Throughput Latency (TTFT) Features Best For
vLLM 15-30 req/s 50-200ms PagedAttention, quantization, paged KV cache High throughput, open-source models
TGI 10-20 req/s 100-300ms Streaming, LoRA, quantization, batching Production-ready, features rich
TensorRT-LLM 20-50 req/s 30-100ms NVIDIA-optimized, kernel fusion, quantization Latency-critical, NVIDIA hardware
Triton Inference 10-30 req/s 100-500ms Multi-backend, model management, versioning Multi-model orchestration
Ollama 1-5 req/s 500ms-2s Easy local deployment, no dependencies Development, local inference

Batching Strategies

KV Cache Optimization

KV cache stores key-value pairs for attention. Scales with context length and batch size. Major memory bottleneck.

// KV cache memory calculation // For 70B model, 4-bit quantized num_layers = 80 hidden_dim = 8192 batch_size = 32 seq_length = 4096 bytes_per_value = 2 // 4-bit = 0.5 bytes, but overhead memory_gb = (num_layers * 2 * batch_size * seq_length * hidden_dim * bytes_per_value) / 1e9 // Result: ~50-80 GB (limits batch size on 80GB GPUs) // Optimizations: // - Quantize KV cache to int8 or fp8 // - Paged attention: allocate blocks dynamically (vLLM approach) // - Shared KV cache for same prompt prefix (prompt reuse)
TTFT (Time To First Token): Critical for user experience. vLLM's PagedAttention reduces TTFT to 50-200ms. Matters more than total throughput for interactive systems.
Why this matters Serving infrastructure is your throughput, latency, and cost ceiling. vLLM is the industry standard for open-source models. For production, profile your workload: if TTFT critical, prioritize; if throughput, batch aggressively.

API Design & Gateway Patterns Intermediate

APIs expose LLM capabilities safely and efficiently. Patterns include rate limiting, load balancing, model routing, and fallbacks.

Gateway Pattern with Fallback Chain

// API gateway pseudocode with fallback chain function generate(prompt, options={}) { // 1. Rate limit & quota check if (!check_rate_limit(user_id)) { return error("Rate limit exceeded", 429) } // 2. Route based on complexity & cost model = "gpt-4" if (options.budget == "low") { model = "gpt-4-turbo" } if (options.speed == "fast") { model = "gpt-3.5-turbo" } // 3. Try primary model, fallback on error try { return llm[model].generate(prompt) } catch (error) { // Fallback to cheaper/faster model try { return llm["gpt-3.5-turbo"].generate(prompt) } catch { // Last resort: cached result or error cached = cache.get(hash(prompt)) return cached || error("Service unavailable", 503) } } }

Rate Limiting Strategy

Streaming vs Non-Streaming

// Non-streaming: wait for full response response = await llm.generate(prompt) return {"text": response} // Streaming: SSE tokens as generated stream = llm.generate_stream(prompt) for (chunk in stream) { send_sse("data: " + json(chunk)) }

API Versioning for Prompt Changes

Version prompts like code: v1, v2, etc. Support old versions for backward compatibility. Track performance per version.

Why this matters Robust API design isolates clients from model failures and changes. Fallback chains provide graceful degradation. Cost routing and rate limiting protect both user and server.

Caching Strategies Intermediate

Caching LLM outputs saves cost and latency. Multiple strategies available depending on query patterns.

Caching Methods

// Semantic caching with embeddings function semantic_cache_get(query, threshold=0.95) { query_embed = embed(query) // Search cache for similar queries candidates = cache_db.search(query_embed, top_k=5) for (cached in candidates) { similarity = cosine_similarity(query_embed, cached.embed) if (similarity > threshold) { return cached.response // Cache hit! } } return null // No cache hit, generate } // After generation, store in cache response = llm.generate(query) cache_db.insert({ query: query, embed: query_embed, response: response, timestamp: now() }) return response

Tools for Caching

Cost Savings & Invalidation

ROI: If hit rate is 30% and cached request is 70% cheaper, save 21% on token cost. At $1M/month spend = $210k savings.

Invalidation: Set TTL on cached entries (default 24h). For semantic caching, invalidate when embedding model updates.

Why this matters Caching is free money: reduce cost 20-30% with minimal complexity. Start with exact-match (easy), upgrade to semantic for better hit rates. API-level prompt caching is the easiest path.

LLM Observability Intermediate

Observability for LLMs differs from traditional systems. Monitor tokens, cost, latency (TTFT, E2E), and semantic drift.

Key Metrics to Monitor

Metric What It Measures Alert Threshold Action
TTFT (Time to First Token) Time from request to first token. User-perceived latency. +50% vs baseline Scale inference, optimize batching
TPS (Tokens Per Second) Throughput. Tokens generated per second across all requests. -30% vs baseline Check GPU utilization, increase batch size
E2E Latency Full request-response time. Includes generation + post-processing. >10s for chat Profile chain, check downstream services
Token Usage Input + output tokens per request. Direct cost driver. +20% vs expected Review prompts, check for loops
Cost per Query (tokens × rate) summed. Bottom-line metric. +15% vs budget Optimize prompts, route to cheaper model
Error Rate % of requests failing (timeout, API error, crash). >1% Investigate, check API quota

Observability Tools

Tracing Chains & Agents

// Example: trace LLM chain const trace = start_trace("qa_chain") // Step 1: Retrieve documents retrieval_span = start_span("retrieve") docs = vector_db.search(query) retrieval_span.record({docs_count: docs.length, latency: 120}) // Step 2: Generate answer gen_span = start_span("generate") answer = llm.generate(query, context=docs) gen_span.record({tokens: 150, cost: 0.003}) // Complete trace trace.end({success: true, latency: 450})
Why this matters Observability uncovers issues fast: high TTFT reveals batching problems; high token usage reveals prompt bloat; high costs reveal model selection misfit. Invest in tracing to debug chains in production.

Guardrails & Content Filtering Intermediate

Guardrails enforce safety, quality, and correctness. Covers input validation, output filtering, and policy enforcement.

Guardrail Framework Layers

// Multi-layer guardrail pipeline function apply_guardrails(prompt, context="") { // 1. Input validation if (is_injection(prompt)) { return {error: "prompt_injection_detected"} } // 2. PII/Secrets masking prompt = mask_sensitive(prompt) // 3. Policy check (topic allowlist) if (!allowed_topic(prompt)) { return {error: "topic_restricted"} } // 4. Language filter if (detect_language(prompt) != "en") { return {error: "unsupported_language"} } // 5. Generate LLM response response = llm.generate(prompt) // 6. Output validation if (is_toxic(response) || contains_pii(response)) { response = "I can't provide that response" } return {text: response} }

Tools for Guardrails

PII Detection & Masking

// Detect and mask PII text = "My SSN is 123-45-6789 and email is john@example.com" // Detect PII entities entities = detect_pii(text) // Result: [ssn: "123-45-6789", email: "john@example.com"] // Mask PII masked = mask(text, entities) // Result: "My SSN is [SSN] and email is [EMAIL]"
Why this matters Guardrails are your safety net. Start with input/output filters (simple regex). Upgrade to learned classifiers (Llama Guard) for nuance. Deploy confidently knowing bad inputs and outputs are caught.

Drift Detection & Continuous Evaluation Advanced

Model and data drift degrade performance silently. Continuous eval pipelines detect regressions early.

Types of Drift

Automated Eval Pipeline

// Daily regression testing function daily_eval() { // Load baseline results (last successful run) baseline = load_baseline() // Run current model on eval set current_results = [] for (example in EVAL_SET) { output = model.generate(example.input) score = evaluate(output, example.expected) current_results.push(score) } // Compare: statistical significance test avg_current = mean(current_results) avg_baseline = mean(baseline.results) p_value = ttest(current_results, baseline.results) if (p_value < 0.05 && avg_current < avg_baseline) { alert("REGRESSION: performance dropped significantly") rollback_model() } else { save_baseline(current_results) } }

Monitoring for Drift

Signal Detection Method Alert Threshold
TTFT degradation Percentile tracking (p95 TTFT) +30% over 7-day baseline
Token inflation Mean tokens per query trend +20% increase
Semantic drift (hallucination) Human-annotated eval set weekly -5% accuracy
Input distribution shift Embedding centroid distance Cosine distance > 0.2
Why this matters Drift is silent killer. Weekly evals catch problems before users notice. Automated alerts enable fast rollback. Invest in eval infrastructure: pays dividends.

Cost Optimization Intermediate

LLM inference cost is the dominant operating expense. Optimization strategies compound to 30-60% savings.

Cost Reduction Techniques

Cost Comparison Table (per 1M tokens, April 2026)

Provider Model Input Cost Output Cost Context Window
OpenAI gpt-4o-mini $0.15 $0.60 128K
OpenAI gpt-4o $15 $60 128K
Anthropic Claude 3.5 Haiku $0.80 $4.00 200K
Anthropic Claude 3.5 Sonnet $3 $15 200K
Google Gemini 2.0 Flash $0.075 $0.30 1M
Together Llama 3 70B $0.90 $1.35 8K

ROI Calculation: Caching Example

// Cost analysis for semantic caching monthly_queries = 1000000 cost_per_query = 0.01 // $0.01 per query (500 tokens avg) cache_hit_rate = 0.25 // 25% hit rate cache_cost_ratio = 0.03 // Cache query costs 3% of LLM query current_cost = monthly_queries * cost_per_query // = 1M * $0.01 = $10,000/month with_cache_cost = (monthly_queries * cache_hit_rate * cost_per_query * cache_cost_ratio) + (monthly_queries * (1 - cache_hit_rate) * cost_per_query) // = (1M * 0.25 * $0.01 * 0.03) + (1M * 0.75 * $0.01) // = $75 + $7,500 = $7,575 savings = ((current_cost - with_cache_cost) / current_cost) * 100 // = (($10k - $7.575k) / $10k) * 100 = 24.25% savings
Batch API Timing: Use batch API for non-urgent workloads (data labeling, nightly jobs). 50% discount offsets latency. OpenAI batch API: 24h turnaround.
Why this matters Small optimizations compound. 20% from compression + 30% from cascading + 25% from caching = 58% total savings. For $100k/month spend = $58k saved annually.

CI/CD for LLM Applications Advanced

LLM CI/CD differs from traditional software. Focus on prompt versioning, eval gates, and gradual rollout.

CI Pipeline for Prompts

// CI: test prompts before merge on push_to_branch { // 1. Lint prompts (check structure) lint_prompts() // 2. Run on eval set (fast check) results = run_eval(prompts, SMALL_EVAL_SET) // 3. Eval gate: must be >= baseline baseline_score = 0.82 if (results.accuracy < baseline_score) { fail("Failed eval gate") } // 4. Cost check: tokens should not increase prev_tokens_avg = 450 if (results.avg_tokens > prev_tokens_avg * 1.1) { warn("Token usage increased 10%+") } }

CD: Gradual Rollout

// Feature flag for prompt versioning function get_prompt(user_id) { if (is_in_experiment(user_id, "new_prompt_v2")) { return PROMPT_V2 // New version } else { return PROMPT_V1 // Stable baseline } } // Monitor: is v2 better than v1? v1_accuracy = 0.82 v2_accuracy = 0.85 if (v2_accuracy > v1_accuracy && p_value < 0.05) { promote("PROMPT_V2") // Make v2 default }

Model Registry

Version models and prompts in a registry. Track metadata: eval score, date, author, deployed version, cost.

Why this matters Treat prompts and models as first-class artifacts in your deployment pipeline. Eval gates and feature flags enable safe iteration at production speed.

Governance & Compliance Advanced

LLM systems raise unique governance, safety, and compliance concerns. Frameworks: model cards, audit trails, responsible AI.

Model Cards

Document model capability, limitations, bias, and appropriate use. Critical for transparency and governance.

// Model card example (simplified) { "model_name": "claude-3-sonnet-v1.2-finetuned", "intended_use": "Customer support chat assistant", "performance": { "accuracy": 0.91, "latency_p95_ms": 245 }, "limitations": [ "May hallucinate facts not in training data", "Trained on English only" ], "bias_analysis": { "gender_bias_test": "WEAT score 0.3 (low bias)", "notes": "Tested on balanced gender distribution" }, "safety_measures": [ "Llama Guard filtering enabled", "Rate limiting: 100 req/min per user" ] }

Audit Trails & Data Lineage

Regulatory Landscape

Regulation Key Requirement LLM Impact
EU AI Act High-risk AI needs impact assessment, transparency LLMs in hiring/legal flagged as high-risk. Requires explainability.
GDPR Right to deletion, data minimization, transparency Can't delete from trained models. Document data sources, get consent.
SOC 2 Type II Security, availability, confidentiality controls Requires audit logging, access controls, incident response plans.
HIPAA Protected health information must be encrypted Can't send PHI to public APIs. Self-host or use BAA-signed providers.

Responsible AI Framework

Why this matters Governance isn't just compliance theater—it builds trust. Regulations are tightening. Start with model cards and audit logging. Plan for GDPR/AI Act now before they become urgent.

Agent Architectures Advanced

Agents use LLMs to reason, plan, and take actions. Patterns: ReAct, Plan-and-Execute, multi-agent systems.

ReAct (Reasoning + Acting)

Agent thinks (Reason) and acts (Act) in a loop until task complete. At each step: generate thought, choose action, observe result.

// ReAct loop pseudocode function react_agent(task) { thought_history = [] while (!task.is_done()) { // Step 1: Think (reason about next action) thought = llm.generate( "Thought: what should I do next?\n" + format_history(thought_history) ) thought_history.push({type: "thought", content: thought}) // Step 2: Act (choose and execute action) action = llm.generate( "Action: use one of [" + list_tools() + "]" ) result = execute_tool(action) thought_history.push({type: "action", content: action}) // Step 3: Observe (result becomes context) thought_history.push({type: "observation", content: result}) } return task.result }

Tool Calling Patterns

Memory Management

Orchestration Frameworks

Why this matters Agents extend LLMs beyond text generation to task automation. ReAct + tool calling is production-ready. Multi-agent systems add complexity; start with single agent loops.

Multimodal LLMOps Advanced

Multimodal models handle images, audio, video alongside text. Different eval metrics, serving considerations.

Multimodal Model Examples

Model Input Modalities Strength Cost
GPT-4o Text, Image, PDF Best vision understanding, reasoning $15/1M input tokens
Claude 3.5 Sonnet Text, Image Strong vision, nuanced understanding $3/1M input tokens
Llama 3.2 Vision Text, Image Open-source, fine-tunable Self-hosted cost
Whisper + GPT-4 Audio → Text + Text Speech transcription then chat $0.02/min audio + text tokens

Image Input Handling

// Image encoding strategies // 1. Direct base64 (simple, works for small images) image_b64 = image_to_base64("/path/to/image.jpg") response = llm.generate( prompt="Describe image:", images=[{"data": image_b64}] ) // 2. URL reference (models fetch from URL) response = llm.generate( prompt="Describe image:", images=[{"url": "https://example.com/image.jpg"}] ) // 3. Token cost: images add tokens (e.g., ~170 tokens per image) cost = (num_images * 170 + prompt_tokens) * cost_per_token

Eval Metrics for Vision

Serving Multimodal Models

Why this matters Multimodal opens new use cases: document understanding, visual QA, image analysis. Same principles apply: evaluate, optimize, monitor. Vision models cost more tokens; cache aggressively.

Production Patterns & Anti-Patterns Intermediate

Best practices and common pitfalls for production LLM systems.

Structured Output & Reliability

// Pattern: enforce structured output schema = { "type": "object", "properties": { "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}, "confidence": {"type": "number", "minimum": 0, "maximum": 1} }, "required": ["sentiment"] } response = llm.generate(prompt, schema=schema) // Guaranteed valid JSON matching schema

Retry Strategies

// Exponential backoff + fallback function generate_with_retry(prompt, max_retries=3) { for (attempt = 0; attempt < max_retries; attempt++) { try { return llm.generate(prompt) } catch (error) { if (error.is_transient()) { wait_ms = 100 * (2 ** attempt) // 100ms, 200ms, 400ms sleep(wait_ms) } else { break // Non-transient, don't retry } } } // Fallback: return cached or empty return cache.get(prompt) || "" }

Graceful Degradation

Feature Flags for AI

Control AI behavior via flags: disable RAG if retrieval broken, switch to cheaper model if quota low, disable generation if too slow.

Anti-Patterns

Anti-Pattern Problem Solution
Over-reliance on single model API downtime = full outage Fallback chains, local model backup
No eval pipeline Quality degrades silently Weekly evals, regression alerts
Ignoring latency budgets Slow feature = poor UX Profile, cache, use faster models for critical paths
Raw model output to users Hallucination, toxicity Guardrails, output filtering, human review
No prompt versioning Regression, no rollback Git version prompts, CI gates, feature flags
Why this matters Production systems fail gracefully. Build defense in depth: retries, fallbacks, feature flags. Version control prompts like code. Invest in observability early.