Menu
01 Cover 02 Overview 03 Fundamentals 04 Schemas 05 Orchestration 06 Skill Map 07 ReAct 08 Parallel 09 Multi-Agent 10 Errors 11 Sandboxing 12 Streaming 13 Observability 14 Testing 15 Cost 16 Deployment 17 Checklist 18 Frameworks 19 Roadmap 20 Audit

AI Agents — Advanced Tool Calling

Production patterns for building reliable, observable, and safe tool-using AI agents — from schema design through multi-agent orchestration to deployment.

Tool Use Function Calling ReAct Multi-Agent Orchestration Guardrails MLOps
20
Sections
Code Examples
8
Architecture Diagrams
100%
Production Ready
slide-1

Why Tool Calling Changes Everything

Tool calling is when LLMs generate structured function calls instead of free text. The model outputs a tool name + arguments, your code executes it, returns results, and the model continues. This unlocks deterministic, repeatable, and composable agent workflows.

User Query LLM Reasoning Tool Selection Execution Result Synthesis Response

Structured I/O

From text-in/text-out to deterministic function calls. The model outputs JSON matching your schema, not ambiguous natural language.

Multi-Step Reasoning

From single-turn to iterative loops. The agent thinks, acts, observes, and repeats until the goal is reached.

Integrated Systems

From isolated model to end-to-end system. Your code controls execution, validation, and error recovery.

Why Agents Need Tools

Real-Time Data

Access APIs, databases, and live information the model has no knowledge of.

Take Actions

Send emails, create records, modify systems, and execute business logic.

Precise Computation

Offload math, date calculations, and deterministic logic to code.

Private Systems

Connect to internal services, databases, and proprietary systems safely.

Tool Calling is Deterministic Structured Output
Not 'asking the AI to write code'. The model outputs JSON matching your schema, your code validates and executes. You're always in control.

Five Production Design Commitments

1. Capability Boundary

Curated tools with least-privilege, rate limits, and auditing. Never expose untrusted capabilities.

2. Reliable Substrate

Idempotency, retries, durable state checkpoints, and saga compensation patterns.

3. Grounded Loop

Schema validation + explicit grounding rules. Tool inputs must be validated before execution.

4. Observability

Tool-call spans, auth decisions, retries, and state checkpoints. Monitor actions, not just tokens.

5. Continuous Eval

Adversarial testing, InjecAgent benchmarks, and red-team suites in CI/CD.

slide-overview

Tool Calling Fundamentals

The Tool Calling Lifecycle

Define schemas Present to model Select tool Execute Return result Continue

Complete Tool Calling Flow with Anthropic Claude API

import anthropic client = anthropic.Anthropic() tools = [{ "name": "get_weather", "description": "Get current weather for a city", "input_schema": { "type": "object", "properties": { "city": {"type": "string", "description": "City name"}, "units": {"type": "string", "enum": ["celsius", "fahrenheit"], "default": "celsius"} }, "required": ["city"] } }] response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, tools=tools, messages=[{"role": "user", "content": "What's the weather in Tokyo?"}] ) # Handle tool use response for block in response.content: if block.type == "tool_use": result = execute_tool(block.name, block.input) # Send result back to continue conversation

Tool Definition

Name, description, and JSON schema that defines the interface the model must follow.

Tool Selection

The model picks which tool to call based on the user query and available tools.

Tool Execution

Your code runs the selected tool with the model-provided arguments.

Result Injection

Feed results back to the model to continue reasoning or take next steps.

The Model NEVER Executes Tools
It outputs structured JSON saying "I want to call X with args Y". Your code is always in control. This is the core principle that makes tool calling safe and deterministic.
slide-fundamentals

Tool Schema Design & Best Practices

Well-designed schemas dramatically improve model accuracy and reduce errors. The schema is the interface between your LLM and your system.

Well-Designed vs Poorly-Designed Schemas

✓ Good Schema
"input_schema": { "type": "object", "properties": { "date_range": { "type": "object", "properties": { "start": { "type": "string", "format": "date", "desc": "YYYY-MM-DD" } } }, "limit": { "type": "integer", "minimum": 1, "maximum": 100 } }, "required": ["date_range"] }
✗ Bad Schema
"input_schema": { "type": "object", "properties": { "query": { "type": "string" }, "options": { "type": "string" } } } # No descriptions, no constraints, # vague parameter names, string for # everything, no validation

Schema Validation Pipeline

Raw LLM
JSON Parse
Schema Valid
Type Coerce
Business Valid
Execute

Best Practices for Tool Schemas

1. Write Descriptions for the MODEL

Be explicit about format, constraints, and examples. The model reads the description to understand what you want.

2. Use Enums to Constrain Choices

Don't let the model invent values. Define allowed options explicitly in the schema.

3. Keep Tool Count Under 20

More tools = worse selection accuracy. Group related tools or use sub-actions.

4. Version Your Schemas

Breaking changes need migration. Track schema versions and deprecate gradually.

5. Include Examples in Descriptions

Show the model what good input looks like. "Example: 2024-03-15" is better than "ISO date format".

6. Validate Server-Side

NEVER trust model output. Validate all inputs, check ranges, verify enums, and handle edge cases.

Parameter Type Patterns

Type Pattern Use Case
String with Regex "pattern": "^[A-Z]{2}\\d{3}$" Codes, SKUs, phone numbers
Number with Bounds "type": "number", "minimum": 0, "maximum": 100 Prices, percentages, counts
Enum "enum": ["pending", "active", "archived"] Status, categories, choices
Array with Items "type": "array", "items": {"type": "string"}, "maxItems": 10 Tags, email lists, IDs
Nested Object "type": "object", "properties": {...}, "required": [...] Complex data, structured inputs

Tool Interface Patterns

Interface Pattern Contract Format Security Implications Best-Fit Use Cases
REST+OpenAPI OpenAPI 3.0 spec, HTTP verbs Network isolation, TLS required, easy MITM if not HTTPS External APIs, standard microservices
gRPC+Protobuf Protocol Buffers, binary format Strongly typed, harder to modify, TLS required High-throughput internal services, low-latency
Provider Function Calling Native code, type signatures Direct code execution, trust model is critical Same-process agents, embedded systems
MCP (Model Context Protocol) JSON-RPC, tools as resources Standardized, audit trails, capability discovery Multi-agent systems, plugin architectures
Tool-Proxy/Firewall Transparent proxy, schema validation Input sanitization, rate-limiting, logging layer Enterprise, compliance-heavy, zero-trust
Security First Lesson: Treat tool calling like untrusted code generation. Regardless of interface pattern, validate all inputs, check permissions, and log every execution.
slide-schemas

Agent Orchestration Patterns

Different patterns for different needs. Most production systems need Sequential Chain. Start simple.

Single Tool Query LLM Tool Sequential Chain Query LLM Tool A LLM Tool B Router Query Router Tool A Tool B LLM Autonomous Agent Query Loop [Think] [Act] [Observe] Response

Orchestration Pattern Comparison

Pattern When to Use Complexity Latency Reliability
Single Tool Simple tasks, one action per query Low Fast High
Sequential Chain Multi-step workflows with known order Medium Moderate High
Router Classification or dispatch to different tools Medium Fast High
Autonomous Agent Complex reasoning, unknown steps, exploration High Slow Medium

AgentOrchestrator Implementation (Autonomous Loop)

class AgentOrchestrator: def __init__(self, max_iterations=10): self.max_iterations = max_iterations self.tool_registry = {} self.conversation = [] def register_tool(self, name, tool_func, schema): self.tool_registry[name] = {"func": tool_func, "schema": schema} def run(self, query): self.conversation = [{"role": "user", "content": query}] for iteration in range(self.max_iterations): # Get LLM response with tool definitions response = call_llm(self.conversation, self.tool_registry) if response.stop_reason == "end_turn": return response.text for block in response.content: if block.type == "tool_use": result = self.tool_registry[block.name]["func"](block.input) self.conversation.append({"role": "user", "content": result}) return "Max iterations reached"
Start with Sequential Chain
Most production use cases need Sequential Chain, not full autonomous agents. Autonomous loops are harder to debug, slower, and more expensive. Use them only when the task requires true exploration.
slide-orchestration
Skill Map

Agent Skill Map — Capabilities & Tool Taxonomy

A production agent system is only as powerful as its tools. This skill map charts the full landscape of agent capabilities, the tool categories that enable them, and how they compose into real-world workflows.

AI Agent Orchestrator Information Retrieval Web Search (Google, Bing, Brave) RAG / Vector DB Query Database / SQL Query API Data Fetch (REST / GraphQL) Document / PDF Reader Code & Computation Code Execution (Python, JS) Shell / CLI Commands Math / Calculator Data Transform / ETL Sandboxed Notebook (Jupyter) Actions & Integrations Email (Send / Read / Draft) Calendar (Create / Query) Slack / Teams Messaging Jira / Linear / Asana Tickets CRM (Salesforce / HubSpot) Git (Commit / PR / Review) Content Generation Document Writer (DOCX/PDF) Image Generation (DALL·E, SD) Chart / Visualization Builder Slide Deck / Presentation Spreadsheet / CSV Generator Memory & State Conversation History Long-Term Memory (Vector) User Preferences / Context Safety & Control Human-in-the-Loop Gate Permission Checker Budget / Rate Limiter

Information Retrieval

The foundation of any useful agent — fetching context from the world.

ToolUse CaseComplexity
Web SearchReal-time facts, news, current eventsLow
RAG / Vector SearchPrivate knowledge base Q&AMedium
SQL QueryStructured data: metrics, reports, analyticsMedium
API FetchLive system data (weather, stock, status)Low
Doc ReaderPDF/DOCX parsing, contract analysisMedium

Code & Computation

Extends the agent beyond language into precise computation and system control.

ToolUse CaseComplexity
Code ExecData analysis, plotting, scriptingHigh
Shell/CLIFile ops, git, system adminHigh
CalculatorPrecise math, financial calcsLow
Data TransformCSV cleaning, JSON reshape, ETLMedium
NotebookInteractive data explorationHigh

Actions & Integrations

Where agents become truly useful — taking actions in real systems on behalf of users.

ToolUse CaseRisk Level
EmailDraft, send, search, summarizeMedium
CalendarSchedule, reschedule, check availabilityMedium
MessagingPost updates, respond, search channelsMedium
TicketingCreate, update, assign, closeLow
CRMUpdate contacts, log activitiesHigh
GitCommit, create PRs, review codeHigh

Skill Composition — Mapping Use Cases to Tool Sets

Real agent tasks rarely need a single tool. Here's how skills compose for common production use cases:

Use CaseRequired SkillsTool ChainOrchestration
Research Assistant Retrieval Compute Web Search → Document Reader → Summarize → Write Report Sequential Chain
Data Analyst Retrieval Compute Generate SQL Query → Code Exec (pandas) → Chart Builder → Slide Deck Sequential + Parallel
Customer Support Retrieval Actions Memory RAG Search → CRM Lookup → Ticket Create → Email Draft Router + Sequential
DevOps Copilot Compute Retrieval Actions Log Search → Shell Exec → Runbook Lookup → Slack Alert ReAct (Autonomous)
Meeting Prep Agent Retrieval Actions Generate Calendar Check → CRM Lookup → Web Search → Doc Writer Sequential Chain
Code Review Agent Compute Retrieval Actions Git Diff → Code Exec (tests) → Style Check → PR Comment Parallel + Sequential

Tool Design Principles

# The UNIX philosophy for agent tools: # 1. Do one thing well # 2. Composable inputs/outputs # 3. Fail loudly with clear errors # 4. Idempotent where possible # 5. Observable (logs, metrics, traces) class ToolDesignChecklist: single_responsibility: bool # One action per tool clear_description: bool # LLM can understand when to use typed_schema: bool # JSON Schema with constraints error_messages: bool # Actionable, not cryptic idempotent: bool # Safe to retry timeout: bool # Never hangs rate_limited: bool # Protects downstream permission_scoped: bool # Least privilege observable: bool # Emits traces/metrics

Tool Risk Tiers & Approval Gates

Not all tools carry equal risk. Tier them and enforce appropriate approval flows:

T1

Read-Only (Auto-approve)

Search, lookup, calculate, read. No side effects. Safe for autonomous execution.

T2

Reversible Write (Soft-approve)

Draft email, create ticket, update status. Can be undone. Log + notify, execute with confirmation for sensitive data.

T3

Irreversible Action (Human-approve)

Send email, publish post, execute payment, delete record. Require explicit human confirmation before execution.

T4

Privileged / Admin (Never auto-approve)

Access controls, billing changes, system config, code deploy. Always require authenticated human approval with audit trail.

Principle: Start with read-only tools (Tier 1) and expand to write actions only after your orchestration, error handling, and observability are proven in production. Most agent value comes from retrieval + synthesis — not from taking actions.
slide-skillmap
Reasoning

ReAct & Reasoning Loops

ReAct (Reason + Act) is the pattern for agentic behavior: Think → Act → Observe → Repeat. The agent reasons about what to do, takes action, observes results, and continues.

Thought Action Observation Done? Final Answer

ReActAgent Implementation

class ReActAgent: def __init__(self, llm, tools, max_steps=10): self.llm = llm self.tools = {t.name: t for t in tools} self.max_steps = max_steps self.scratchpad = [] async def run(self, query): for step in range(self.max_steps): # Think: reason about what to do next thought = await self.llm.think(query, self.scratchpad) self.scratchpad.append(("thought", thought)) if thought.action == "finish": return thought.answer # Act: execute the chosen tool tool = self.tools[thought.tool_name] result = await tool.execute(thought.tool_args) # Observe: record the result self.scratchpad.append(("observation", result)) return "Max steps reached — could not complete"

ReAct vs Other Approaches

ReAct (Reason + Act)

Explicit think-act-observe loop. Best for complex tasks requiring exploration and recovery.

Chain-of-Thought

Reasoning without action. Better for analysis but can't execute tasks or access new data.

Plan-and-Execute

Create a plan first, then execute. Good for structured workflows, poor for adaptive tasks.

LATS (Language Agent Tree Search)

Tree search over actions. Expensive but explores multiple paths for complex problems.

Planning Algorithms Comparison

Paradigm Core Idea Strengths Weaknesses
ReAct Reason → Act → Observe → Repeat Flexible, adaptive, handles new situations, easy to debug Can be verbose, slower on simple tasks
Plan-and-Solve Create plan first, then execute steps Structured, good for complex tasks, reduces errors Brittle when plan becomes stale, poor for exploration
Tree of Thoughts Explore multiple reasoning branches Finds better solutions, good for reasoning, backtracking High token cost, slow, overkill for simple tasks
Reflexion Agent self-reflects on failures Learns from mistakes, improves over iterations Requires many iterations, slow convergence
Toolformer Model decides when to call tools Minimal tokens, fast, uses tools sparingly Requires fine-tuning, less transparent reasoning
Graph of Thoughts Reasoning as directed acyclic graph Expresses complex dependencies, good for workflows Complex to implement, expensive to evaluate
Key Principle: Use the smallest-horizon agent that meets your use case. Start with ReAct for simplicity. Upgrade to Tree of Thoughts only if your task requires exploration over multiple branches.
Key Insight: ReAct gives the model a structured way to interleave reasoning and action. Without it, agents tend to either over-plan (missing opportunities) or act impulsively (making mistakes).
slide-react
Performance

Parallel & Multi-Tool Execution

Modern LLMs can request multiple tools at once. Instead of waiting for each result, execute them in parallel for 3-5x latency improvement.

LLM [Tool A, Tool B, Tool C] Tool A Tool B Tool C Results Collected Next LLM Turn

ParallelToolExecutor with asyncio.gather

class ParallelToolExecutor: async def execute_batch(self, tool_calls, timeout=30): tasks = [ asyncio.wait_for( self._execute_one(call), timeout=timeout ) for call in tool_calls ] results = await asyncio.gather(*tasks, return_exceptions=True) # Handle partial failures gracefully return [ ToolResult(call.id, result) if not isinstance(result, Exception) else ToolResult(call.id, error=str(result)) for call, result in zip(tool_calls, results) ]

Parallel Execution Benefits & Challenges

Benefits

Latency: 3-5x faster for multi-step tasks. Throughput: Execute 10 tools simultaneously instead of sequentially.

Challenges

Ordering: Dependency handling (Tool B needs Tool A result). Cost: Fan-out multiplies API calls.

Error Handling

Partial failures (1 of 3 fails). Timeouts on slow tools. Always use return_exceptions=True.

Execution Strategies

Strategy Use Case Latency Complexity
Serial Tool B depends on Tool A result Slowest Simple
Parallel Independent tools (most common) 3-5x faster Moderate
DAG-Based Complex dependency graphs Optimal High
Speculative Execute multiple paths, pick best Variable Very High
slide-parallel
Architecture

Multi-Agent Delegation

Delegate complex tasks to specialized sub-agents. Supervisor routes work, workers execute independently, results compose back up.

Supervisor Agent Research Agent web_search doc_reader Code Agent code_exec file_write Data Agent sql_query chart_gen Synthesize Results

SupervisorAgent Router Implementation

class SupervisorAgent: def __init__(self): self.workers = { "research": ResearchAgent(tools=[web_search, doc_reader]), "code": CodeAgent(tools=[code_exec, file_write]), "data": DataAgent(tools=[sql_query, chart_gen]), } async def handle(self, task): plan = await self.planner.decompose(task) results = {} for step in plan.steps: worker = self.workers[step.agent] results[step.id] = await worker.execute(step.instruction, context=results) return await self.synthesizer.combine(results)

Multi-Agent Orchestration Patterns

Supervisor / Worker

One coordinator routes tasks to specialized workers. Best for diverse skill domains. Easy to scale.

Peer-to-Peer

Agents communicate directly. More flexible but harder to debug. Good for negotiation.

Hierarchical

Tree of agents. Scales better for large teams. Messaging overhead increases.

Swarm

Decentralized with local rules. Complex behaviors from simple agents. Hard to predict.

Warning: Multi-agent systems multiply complexity, cost, and failure modes. Use a single agent with multiple tools until proven insufficient. Each agent adds latency, context window overhead, and debugging difficulty.
slide-multiagent
Reliability

Error Handling & Recovery

Real agents fail. Your job is to define what's retryable, what's not, and how the model can recover.

Error Retryable? Yes Retry w/ Backoff Max Retries? Yes Fail Gracefully No Fallback Avail? Yes Use Fallback Return error to LLM LLM Retries

ResilientToolExecutor with Decorators

class ResilientToolExecutor: @retry(max_attempts=3, backoff=exponential(base=2)) @circuit_breaker(failure_threshold=5, recovery_timeout=60) @timeout(seconds=30) async def execute(self, tool_name, args): try: result = await self.tools[tool_name].run(args) self.metrics.record_success(tool_name) return result except ValidationError as e: return ToolError(f"Invalid args: {e}", retryable=False) except RateLimitError: raise # Let retry decorator handle except TimeoutError: self.metrics.record_timeout(tool_name) return ToolError("Tool timed out", retryable=True)

Error Taxonomy

Tool Execution Failure

API unreachable, permission denied, network timeout. Often retryable with backoff.

Schema Validation

LLM outputs invalid args. Not retryable at executor level—return to LLM to correct.

Timeout / Rate Limit

Tool slow or API throttled. Retryable with exponential backoff and jitter.

Critical: Always return errors TO the LLM as tool results rather than crashing. The model can often recover by trying a different approach or refining its request.

Durable Execution & Saga Patterns

Production systems require durability. If an agent crashes mid-execution, you must be able to resume. Use idempotency keys, durable state checkpoints, and compensation patterns.

Temporal Idempotent Activities

Use Temporal workflows to define durable, idempotent tool executions. If a tool call succeeds but the workflow crashes, Temporal retries from the checkpoint—not from scratch.

Stripe-Style Idempotency Keys

For external APIs, include idempotency keys in request headers. If the same request is sent twice, the server returns cached result—no duplicate charge.

Saga Compensation Pattern

For multi-step workflows, define compensating actions (rollbacks). If step 3 fails, execute reverse of step 2, then step 1. AWS Prescriptive Guidance reference.

Workflow Ledger

Persist every tool call and result to a ledger. Enables auditing, replay, and recovery. Each entry is immutable and timestamped.

SagaExecutor with Compensation

class SagaExecutor: def __init__(self, ledger): self.ledger = ledger self.compensations = [] async def execute_saga(self, steps): for step in steps: try: # Add idempotency key to request idempotency_key = uuid4() result = await self.execute_with_key( step.tool, step.args, idempotency_key ) # Log to ledger await self.ledger.append({ "type": "TOOL_CALL", "step_id": step.id, "tool": step.tool, "result": result, "timestamp": now() }) # Register compensation for rollback if step.compensate: self.compensations.insert(0, ( step.compensate, result )) except Exception as e: # Execute all compensations in reverse order for comp_fn, context in self.compensations: await comp_fn(context) await self.ledger.append({ "type": "COMPENSATION", "fn": comp_fn.__name__ }) raise return "All steps completed successfully"
Durability First: Use LangGraph checkpointing, Temporal workflows, or custom ledger patterns. Production agents without durability are just expensive, slow, unreliable scripts.
slide-errors
Security

Sandboxing & Permission Models

The model decides which tools to call. You must constrain what's possible through layered permission checks and resource limits.

User Permissions Agent Sandbox Tool Allowlist Resource ACL Gate 1 Gate 2 Gate 3 Gate 4

SandboxedExecutor with Permission Engine

class SandboxedExecutor: def __init__(self, user_context): self.allowed_tools = self._resolve_permissions(user_context) self.resource_limits = ResourceLimits( max_api_calls=100, max_tokens_spent=50000, max_wall_time=300, allowed_domains=["api.internal.com"], blocked_actions=["delete", "admin.*"] ) async def execute(self, tool_call): # 1. Tool allowlist check if tool_call.name not in self.allowed_tools: raise PermissionDenied(f"Tool {tool_call.name} not allowed") # 2. Argument sanitization safe_args = self.sanitizer.clean(tool_call.args) # 3. Resource limit check self.resource_limits.check_budget() # 4. Execute in sandbox return await self.sandbox.run(tool_call.name, safe_args)

Sandboxing Technologies

Technology Isolation Level Performance Complexity
Docker Container (OS-level) Moderate Medium
gVisor Syscall interception Lower Medium
Firecracker MicroVM (strong isolation) Good High
WASM Process-level sandbox Very Fast Medium
E2B / Modal Managed multi-tenant Good Low

InjecAgent & Tool Selection Attacks

Prompt injection isn't just direct—attackers can inject malicious instructions via tool-returned content, web pages, or API responses. The InjecAgent benchmark measures resilience.

Indirect Prompt Injection

Attacker embeds malicious instructions in web content, API response, or email body. Agent fetches content and executes instructions unknowingly. Example: web_search returns page with hidden "delete all data" instruction.

Tool Selection Manipulation

Attacker crafts query to confuse tool selection: "Call the delete_account tool to help me." Agent picks wrong tool due to misleading language. Requires strict tool descriptions + grounding.

InjecAgent Benchmark

Research benchmark that tests agent resilience to indirect injection. Measures: Can the agent resist instructions embedded in tool outputs? Real-world validated attacks.

Defense Layering

Tool firewall validates tool calls structurally. Structured outputs reduce injection surface. PII redaction (Presidio) + secrets management (Vault) prevent data leakage via logs.

Defense: Tool Firewall + Structured Output

class ToolFirewall: async def filter_and_execute(self, tool_calls): for call in tool_calls: # 1. Structural validation (strict schema) if not self.schema_validator.validate(call): raise InvalidToolCall("Failed schema validation") # 2. Allowlist enforcement if call.name not in self.allowed_tools: raise ToolNotAllowed(call.name) # 3. Input sanitization with context awareness args = self.sanitizer.clean(call.args) args = self.redactor.redact_pii(args) # Presidio # 4. Execute & redact result before returning to LLM result = await self._execute(call.name, args) result = self.redactor.redact_secrets(result) # Vault # 5. Log for audit await self.audit_logger.log({ "tool": call.name, "args_hash": sha256(args), "result_hash": sha256(result), "timestamp": now() }) yield result
Critical Security Principle: Never let an LLM agent execute arbitrary code without sandboxing. Even with "harmless" tools, injection via tool arguments is a real attack vector. Layer defenses: allowlist → sanitize → rate limit → sandbox.
slide-sandboxing
Performance

Streaming & Real-Time Tool Calling

Stream text tokens to the user immediately while tool calls execute in parallel. When a tool call interrupts the stream, show a loading indicator. This cuts perceived latency from seconds to milliseconds.

Text tokens... streaming Tool Call paused Execution ⏱️ running Result ✓ injected Text resume... resumed

StreamingToolHandler Implementation

class StreamingToolHandler: async def stream_with_tools(self, messages, tools): async with self.client.messages.stream( model="claude-sonnet-4-20250514", messages=messages, tools=tools, max_tokens=4096 ) as stream: collected_content = [] async for event in stream: if event.type == "content_block_start": if event.content_block.type == "tool_use": # Tool call detected mid-stream tool_input = await self._collect_tool_input(stream) result = await self.executor.execute( event.content_block.name, tool_input ) yield ToolResultEvent(result) else: yield TextStartEvent() elif event.type == "content_block_delta": if hasattr(event.delta, 'text'): yield TextDeltaEvent(event.delta.text)

Real-Time Communication Patterns

Server-Sent Events

Browser native streaming. Perfect for web clients and long-lived connections.

WebSocket

Bidirectional real-time communication. Best for interactive, multi-turn streams.

Token Rendering

Display each token as it arrives. Users see response forming in real-time.

Latency Hiding

Mask tool execution behind visible text. Perceived latency drops dramatically.

Stream text to the user immediately. When a tool call interrupts, show a loading indicator. This cuts perceived latency from seconds to milliseconds.
11 / Streaming & Real-Time
Operations

Observability & Tracing

Trace every LLM call, every tool selection, every tool execution, every result, every error. Collect token counts, latencies, and costs. Without observability, you can't debug or optimize.

User Request 0ms LLM Call 1 800ms search +120ms LLM Call 2 600ms weather +200ms LLM Call 3 400ms Response ~2.1s

OpenTelemetry Integration

from opentelemetry import trace tracer = trace.get_tracer("agent") class TracedAgent: @tracer.start_as_current_span("agent.run") async def run(self, query): span = trace.get_current_span() span.set_attribute("query", query[:200]) for step in range(self.max_steps): with tracer.start_span("llm.call") as llm_span: response = await self.llm.generate(...) llm_span.set_attribute("model", self.model) llm_span.set_attribute("tokens.input", response.usage.input) llm_span.set_attribute("tokens.output", response.usage.output) if response.has_tool_use: with tracer.start_span(f"tool.{tool_name}") as tool_span: result = await self.execute_tool(...) tool_span.set_attribute("tool.success", not result.is_error)

Key Metrics Dashboard

LLM Calls

Calls per query, latency distribution (P50/P95/P99), token usage trends.

Tool Selection

Accuracy rate, which tools selected most, tool selection errors.

Costs & Budget

Token usage & cost per query, cost per tool, budget tracking.

Tool Latency

Execution time per tool, P50/P95/P99 distribution, bottleneck tools.

Error Tracking

Error rate by tool, error types, error recovery success.

Tools: LangSmith, Arize Phoenix, OpenTelemetry

Datadog, Braintrust, custom observability backends.

12 / Observability & Tracing
Quality

Testing & Evaluation

Build a testing pyramid: Unit tests for tool functions → Integration tests for tool calling flow → Agent eval suites for end-to-end task completion → Red team for adversarial testing.

Unit Tests Tool functions, schemas Integration Tests Tool calling flow Agent Eval E2E task completion Red Team

AgentEvaluator Implementation

class AgentEvaluator: def __init__(self, agent, eval_set): self.agent = agent self.eval_set = eval_set # [(query, expected_tools, expected_answer)] async def run_eval(self): results = [] for query, expected_tools, expected_answer in self.eval_set: trace = await self.agent.run_traced(query) results.append(EvalResult( tool_selection_accuracy=self._check_tools(trace, expected_tools), answer_correctness=self._check_answer(trace.answer, expected_answer), steps_taken=len(trace.steps), total_tokens=trace.total_tokens, latency_ms=trace.duration_ms, )) return EvalReport(results)

Evaluation Metrics & Patterns

Deterministic Tests

Tool functions must be deterministic. Unit test each tool independently.

LLM-as-Judge

Use another LLM to grade answer quality. Good for subjective tasks.

Regression Testing

Run full eval suite on model upgrades. Lock in baseline before changes.

Adversarial Testing

Prompt injection, jailbreak attempts, malformed inputs, edge cases.

13 / Testing & Evaluation
Economics

Cost Control & Optimization

LLM tokens drive costs: input context grows with each step. Tool execution, retries, and multi-agent fan-out add up. Set budgets, track spend, optimize aggressively.

LLM Tokens 65% Tool Execution 15% Re-tries 10% Overhead 10%

CostAwareAgent with Budgets

class CostAwareAgent: def __init__(self, budget_tokens=50000, budget_usd=0.10): self.token_budget = budget_tokens self.usd_budget = budget_usd self.tokens_used = 0 self.cost_usd = 0.0 async def run(self, query): for step in range(self.max_steps): if self.tokens_used > self.token_budget * 0.9: return self._force_answer("Approaching token budget") response = await self.llm.generate(...) self.tokens_used += response.usage.total_tokens self.cost_usd += self._calculate_cost(response.usage) if self.cost_usd > self.usd_budget: return self._force_answer("Cost budget exceeded")

Optimization Strategies

Prompt Caching

Reuse system prompts and tool definitions across queries. Dramatic savings on repeated context.

Context Pruning

Summarize old steps and tool results. Keep only recent, relevant context.

Model Tiering

Cheap model for routing. Strong model only for final answer synthesis.

Tool Result Caching

Cache tool outputs (weather, stock prices, search results) aggressively.

14 / Cost Control & Optimization
DevOps

Production Deployment

Stateless agent services behind a load balancer. Store conversation state in Redis or a database. Trace every call. Monitor tool health. Feature flag new tools. Enable zero-downtime deployments.

Client LB Agent Pod 1 Agent Pod 2 Agent Pod 3 LLM API Tool Reg Redis Traces External Services Search DB Email Deployment Pipeline Lint Test Eval

Production Best Practices

Stateless Design

Agent services are ephemeral. Store conversation state in Redis or a database for horizontal scaling.

Tool Health Checks

Monitor tool dependencies. Circuit breakers for degraded tools. Graceful degradation.

Feature Flags

Gradually roll out new tools. Feature flag new behaviors. Easy rollback.

Canary Deployments

5% traffic to new version. Progressive 25% → 50% → 100%. Zero downtime.

Agent services should be stateless. Store conversation state in Redis or a database. This enables horizontal scaling and zero-downtime deployments.
15 / Production Deployment
Summary

Production Readiness Checklist

Before you ship, verify all four categories. The best agent architectures are boring: simple orchestration, reliable tools, comprehensive monitoring, and strict safety boundaries.

Tool Design

  • ✓ JSON Schema validation
  • ✓ Descriptions optimized for models
  • ✓ Versioned schemas
  • ✓ <20 tools per request
  • ✓ Input sanitization
  • ✓ Idempotent tools where possible

Reliability

  • ✓ Retry with exponential backoff
  • ✓ Circuit breakers
  • ✓ Timeout on every tool
  • ✓ Graceful degradation
  • ✓ Dead letter queue
  • ✓ Error reporting to LLM

Safety

  • ✓ Tool allowlisting per user role
  • ✓ Code execution sandboxing
  • ✓ Network isolation
  • ✓ Argument sanitization
  • ✓ Cost budgets
  • ✓ Prompt injection defense

Operations

  • ✓ Distributed tracing on every call
  • ✓ Token/cost dashboards
  • ✓ Tool latency monitoring
  • ✓ Eval suite in CI/CD
  • ✓ Canary deployments
  • ✓ Model version pinning
Complexity is the enemy of production reliability. The best agent architectures are boring: simple orchestration, reliable tools, comprehensive monitoring, and strict safety boundaries. If you can't easily explain your agent flow to another engineer, it's too complicated.
16 / Production Readiness Checklist
Ecosystem

Agent Frameworks & Ecosystem

Landscape of production frameworks ranges from lightweight libraries to full platforms. Start simple. Choose based on your system's maturity level.

Library Platform DSPy LlamaIndex LangChain LangGraph AutoGen CrewAI MS Agent Framework Ray Serve

Framework Comparison Matrix

Framework Strengths Gaps / Risks Security Posture Best-Fit Role
LangChain +
LangGraph
Wide tool ecosystem, strong community, good documentation Can be verbose, cost tracking not native, performance variable Community-maintained, audit tools available General-purpose agents, prototyping
LlamaIndex Excellent RAG, semantic caching, fast indexing Less multi-agent support, narrower tool set Document-level security, integrations Document-driven agents, knowledge systems
AutoGen Multi-agent conversations, flexible role def Unpredictable cost, hard to debug, no built-in guardrails Limited access controls, relies on models Research, complex collaborative tasks
MS Agent
Framework
Enterprise-grade, strong security, durable execution Steeper learning curve, Azure-dependent Built-in RBAC, audit trails, compliance ready Enterprise production, regulated industries
Semantic
Kernel
Plugin model, cross-language support, .NET first Smaller ecosystem than LangChain, less documentation Microsoft ecosystem integration .NET applications, Windows-first orgs
Ray Serve Distributed scaling, low-latency serving, cost-aware Operational overhead, requires Kubernetes knowledge Network isolation, resource limits High-volume production, multi-tenant SaaS
CrewAI Simple role-based design, good for structured workflows Early stage, smaller community, limited frameworks Depends on underlying models, basic tooling Workflow-focused teams, structured tasks
Haystack Modular pipelines, clear abstractions, good docs Smaller community, less multi-agent tooling Pipeline-level access control Search & QA systems, modular pipelines
DSPy Minimal, Pythonic, great for optimization Limited built-in tools, requires more custom code Simple surface = easy to audit Research, custom agents, fine-tuning workflows
Starting Strategy: Begin with a lightweight library (LangChain or DSPy). Migrate to a platform (MS Agent Framework or Ray Serve) only when you need durability, multi-tenancy, or compliance features. Premature platformification adds complexity without value.
17 / Agent Frameworks & Ecosystem
Planning

Phased Implementation Roadmap

Production agents aren't built overnight. This phased approach helps you balance velocity with reliability.

Implementation Phases Phase 1 Foundation • Tool inventory • Risk tiering • Schema design • Auth model Phase 2 Bounded MVP • Read-only agent • Tool gateway • Basic telemetry • Error handling Phase 3 Workflow Hard. • Durable state (Temporal/LangGraph) • Saga compensation patterns • Idempotency keys for APIs • Workflow ledger logging Phase 4 Security & Eval • Red-team suite • Evals in CI/CD • PII redaction (Presidio) • Secrets management (Vault) Phase 5 Scale-out • Tool discovery via MCP • Multi-agent patterns • Cost attribution • Distributed tracing at scale Migration Path: Demo Agent Bounded Agent Workflow-Grade

Success Metrics & Deliverables

P1: Tools Inventory

Tool catalog doc, risk matrix, schema specs, RBAC roles defined

P2: MVP Deployed

Read-only agent live, tool gateway running, basic observability online

P3: Durable Workflows

Temporal/LangGraph checkpoints, ledger logging, compensation tests

P4: Eval Suite Live

Red-team results, eval CI/CD checks, security audit report

P5: Multi-Agent Ready

MCP integration, distributed tracing proven, cost model validated

Iterative Delivery: Each phase produces working software. You can ship P1 + P2 in 6-8 weeks. P3-P5 happen as you grow. Don't wait for "platform perfection"—move fast on the bounded MVP, harden in production based on observed failures.
17 / Phased Implementation Roadmap
Governance

Audit & Compliance Data Model

Enterprise agents require complete audit trails. This ER model captures every decision, auth check, and compensation action for compliance, debugging, and forensics.

USER user_id (PK) email org_id permissions [] SESSION session_id (PK) user_id (FK) created_at ip_address AGENT_RUN run_id (PK) session_id (FK) started_at status STEP step_id (PK) run_id (FK) step_num thought TOOL_CALL call_id (PK) step_id (FK) tool_name args (hashed) TOOL_RESULT result_id (PK) call_id (FK) result (hashed) latency_ms AUTHZ_DECISION decision_id (PK) call_id (FK) allowed: bool reason COMPENSATION comp_id (PK) call_id (FK) action executed TRACE_SPAN span_id (PK) call_id (FK) trace_id duration_ms creates triggers contains makes returns gated_by may_trigger emits

AuditLogger Implementation

class AuditLogger: def __init__(self, db): self.db = db async def log_tool_call(self, call: ToolCall): # Insert TOOL_CALL record call_id = uuid4() await self.db.execute(""" INSERT INTO TOOL_CALL (call_id, step_id, tool_name, args_hash, created_at) VALUES (?, ?, ?, ?, now()) """, call_id, call.step_id, call.name, sha256(call.args)) # Log auth decision await self.db.execute(""" INSERT INTO AUTHZ_DECISION (decision_id, call_id, allowed, reason) VALUES (?, ?, ?, ?) """, uuid4(), call_id, call.allowed, call.authz_reason) return call_id async def log_tool_result(self, call_id, result, latency_ms): # Insert TOOL_RESULT record (redacted) await self.db.execute(""" INSERT INTO TOOL_RESULT (result_id, call_id, result_hash, latency_ms) VALUES (?, ?, ?, ?) """, uuid4(), call_id, sha256(result), latency_ms) async def log_compensation(self, call_id, action): await self.db.execute(""" INSERT INTO COMPENSATION_ACTION (comp_id, call_id, action, executed) VALUES (?, ?, ?, true) """, uuid4(), call_id, action) async def audit_trail(self, run_id): # Full audit trail for a run: all steps, calls, auth, results return await self.db.query(""" SELECT s.step_num, t.tool_name, a.allowed, a.reason, r.latency_ms, c.action FROM STEP s JOIN TOOL_CALL t ON s.step_id = t.step_id LEFT JOIN AUTHZ_DECISION a ON t.call_id = a.call_id LEFT JOIN TOOL_RESULT r ON t.call_id = r.call_id LEFT JOIN COMPENSATION_ACTION c ON t.call_id = c.call_id WHERE s.run_id = ? ORDER BY s.step_num """, run_id)
Compliance Ready: This model provides: (1) Full trace for forensics, (2) Auth decision audit trail for SOC2, (3) Compensation logs for workflow integrity, (4) Redacted results to prevent PII leakage in logs. Hash sensitive fields, never store plaintext args/results in audit tables.
18 / Audit & Compliance Data Model

AI Agents — Advanced Tool Calling

Production Patterns • 20 Sections • Architecture Diagrams • Code Examples