AI Agents — Advanced Tool Calling

Production patterns for building reliable, observable, and safe tool-using AI agents — from schema design through multi-agent orchestration to deployment.

Tool Use Function Calling ReAct Multi-Agent Orchestration Guardrails MLOps

23

Sections

∞

Code Examples

8

Architecture Diagrams

100%

Production Ready

slide-1

Why Tool Calling Changes Everything

Tool calling is when LLMs generate structured function calls instead of free text. The model outputs a tool name + arguments, your code executes it, returns results, and the model continues. This unlocks deterministic, repeatable, and composable agent workflows.

Structured I/O

From text-in/text-out to deterministic function calls. The model outputs JSON matching your schema, not ambiguous natural language.

Multi-Step Reasoning

From single-turn to iterative loops. The agent thinks, acts, observes, and repeats until the goal is reached.

Integrated Systems

From isolated model to end-to-end system. Your code controls execution, validation, and error recovery.

Why Agents Need Tools

Real-Time Data

Access APIs, databases, and live information the model has no knowledge of.

Take Actions

Send emails, create records, modify systems, and execute business logic.

Precise Computation

Offload math, date calculations, and deterministic logic to code.

Private Systems

Connect to internal services, databases, and proprietary systems safely.

Tool Calling is Deterministic Structured Output
Not 'asking the AI to write code'. The model outputs JSON matching your schema, your code validates and executes. You're always in control.

Five Production Design Commitments

1. Capability Boundary

Curated tools with least-privilege, rate limits, and auditing. Never expose untrusted capabilities.

2. Reliable Substrate

Idempotency, retries, durable state checkpoints, and saga compensation patterns.

3. Grounded Loop

Schema validation + explicit grounding rules. Tool inputs must be validated before execution.

4. Observability

Tool-call spans, auth decisions, retries, and state checkpoints. Monitor actions, not just tokens.

5. Continuous Eval

Adversarial testing, InjecAgent benchmarks, and red-team suites in CI/CD.

slide-overview

Tool Calling Fundamentals

The Tool Calling Lifecycle

Define schemas → Present to model → Select tool → Execute → Return result → Continue

Complete Tool Calling Flow with Anthropic Claude API

import anthropic

client = anthropic.Anthropic()

tools = [{
    "name": "get_weather",
    "description": "Get current weather for a city",
    "input_schema": {
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name"},
            "units": {"type": "string", "enum": ["celsius", "fahrenheit"], "default": "celsius"}
        },
        "required": ["city"]
    }
}]

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}]
)

# Handle tool use response
for block in response.content:
    if block.type == "tool_use":
        result = execute_tool(block.name, block.input)
        # Send result back to continue conversation

Tool Definition

Name, description, and JSON schema that defines the interface the model must follow.

Tool Selection

The model picks which tool to call based on the user query and available tools.

Tool Execution

Your code runs the selected tool with the model-provided arguments.

Result Injection

Feed results back to the model to continue reasoning or take next steps.

The Model NEVER Executes Tools
It outputs structured JSON saying "I want to call X with args Y". Your code is always in control. This is the core principle that makes tool calling safe and deterministic.

slide-fundamentals

Tool Schema Design & Best Practices

Well-designed schemas dramatically improve model accuracy and reduce errors. The schema is the interface between your LLM and your system.

Well-Designed vs Poorly-Designed Schemas

✓ Good Schema

"input_schema": {
  "type": "object",
  "properties": {
    "date_range": {
      "type": "object",
      "properties": {
        "start": {
          "type": "string",
          "format": "date",
          "desc": "YYYY-MM-DD"
        }
      }
    },
    "limit": {
      "type": "integer",
      "minimum": 1,
      "maximum": 100
    }
  },
  "required": ["date_range"]
}

✗ Bad Schema

"input_schema": {
  "type": "object",
  "properties": {
    "query": {
      "type": "string"
    },
    "options": {
      "type": "string"
    }
  }
}

# No descriptions, no constraints,
# vague parameter names, string for
# everything, no validation

Schema Validation Pipeline

Raw LLM

→

JSON Parse

→

Schema Valid

→

Type Coerce

→

Business Valid

→

Execute

Best Practices for Tool Schemas

1. Write Descriptions for the MODEL

Be explicit about format, constraints, and examples. The model reads the description to understand what you want.

2. Use Enums to Constrain Choices

Don't let the model invent values. Define allowed options explicitly in the schema.

3. Keep Tool Count Under 20

More tools = worse selection accuracy. Group related tools or use sub-actions.

4. Version Your Schemas

Breaking changes need migration. Track schema versions and deprecate gradually.

5. Include Examples in Descriptions

Show the model what good input looks like. "Example: 2024-03-15" is better than "ISO date format".

6. Validate Server-Side

NEVER trust model output. Validate all inputs, check ranges, verify enums, and handle edge cases.

Parameter Type Patterns

Type	Pattern	Use Case
String with Regex	`"pattern": "^[A-Z]{2}\\d{3}$"`	Codes, SKUs, phone numbers
Number with Bounds	`"type": "number", "minimum": 0, "maximum": 100`	Prices, percentages, counts
Enum	`"enum": ["pending", "active", "archived"]`	Status, categories, choices
Array with Items	`"type": "array", "items": {"type": "string"}, "maxItems": 10`	Tags, email lists, IDs
Nested Object	`"type": "object", "properties": {...}, "required": [...]`	Complex data, structured inputs

Tool Interface Patterns

Interface Pattern	Contract Format	Security Implications	Best-Fit Use Cases
REST+OpenAPI	OpenAPI 3.0 spec, HTTP verbs	Network isolation, TLS required, easy MITM if not HTTPS	External APIs, standard microservices
gRPC+Protobuf	Protocol Buffers, binary format	Strongly typed, harder to modify, TLS required	High-throughput internal services, low-latency
Provider Function Calling	Native code, type signatures	Direct code execution, trust model is critical	Same-process agents, embedded systems
MCP (Model Context Protocol)	JSON-RPC, tools as resources	Standardized, audit trails, capability discovery	Multi-agent systems, plugin architectures
Tool-Proxy/Firewall	Transparent proxy, schema validation	Input sanitization, rate-limiting, logging layer	Enterprise, compliance-heavy, zero-trust

Security First Lesson: Treat tool calling like untrusted code generation. Regardless of interface pattern, validate all inputs, check permissions, and log every execution.

slide-schemas

Agent Orchestration Patterns

Different patterns for different needs. Most production systems need Sequential Chain. Start simple.

Orchestration Pattern Comparison

Pattern	When to Use	Complexity	Latency	Reliability
Single Tool	Simple tasks, one action per query	Low	Fast	High
Sequential Chain	Multi-step workflows with known order	Medium	Moderate	High
Router	Classification or dispatch to different tools	Medium	Fast	High
Autonomous Agent	Complex reasoning, unknown steps, exploration	High	Slow	Medium

AgentOrchestrator Implementation (Autonomous Loop)

class AgentOrchestrator:
    def __init__(self, max_iterations=10):
        self.max_iterations = max_iterations
        self.tool_registry = {}
        self.conversation = []

    def register_tool(self, name, tool_func, schema):
        self.tool_registry[name] = {"func": tool_func, "schema": schema}

    def run(self, query):
        self.conversation = [{"role": "user", "content": query}]

        for iteration in range(self.max_iterations):
            # Get LLM response with tool definitions
            response = call_llm(self.conversation, self.tool_registry)

            if response.stop_reason == "end_turn":
                return response.text

            for block in response.content:
                if block.type == "tool_use":
                    result = self.tool_registry[block.name]["func"](block.input)
                    self.conversation.append({"role": "user", "content": result})

        return "Max iterations reached"

Start with Sequential Chain
Most production use cases need Sequential Chain, not full autonomous agents. Autonomous loops are harder to debug, slower, and more expensive. Use them only when the task requires true exploration.

slide-orchestration

Skill Map

Agent Skill Map — Capabilities & Tool Taxonomy

A production agent system is only as powerful as its tools. This skill map charts the full landscape of agent capabilities, the tool categories that enable them, and how they compose into real-world workflows.

Information Retrieval

The foundation of any useful agent — fetching context from the world.

Tool	Use Case	Complexity
Web Search	Real-time facts, news, current events	Low
RAG / Vector Search	Private knowledge base Q&A	Medium
SQL Query	Structured data: metrics, reports, analytics	Medium
API Fetch	Live system data (weather, stock, status)	Low
Doc Reader	PDF/DOCX parsing, contract analysis	Medium

Code & Computation

Extends the agent beyond language into precise computation and system control.

Tool	Use Case	Complexity
Code Exec	Data analysis, plotting, scripting	High
Shell/CLI	File ops, git, system admin	High
Calculator	Precise math, financial calcs	Low
Data Transform	CSV cleaning, JSON reshape, ETL	Medium
Notebook	Interactive data exploration	High

Actions & Integrations

Where agents become truly useful — taking actions in real systems on behalf of users.

Tool	Use Case	Risk Level
Email	Draft, send, search, summarize	Medium
Calendar	Schedule, reschedule, check availability	Medium
Messaging	Post updates, respond, search channels	Medium
Ticketing	Create, update, assign, close	Low
CRM	Update contacts, log activities	High
Git	Commit, create PRs, review code	High

Skill Composition — Mapping Use Cases to Tool Sets

Real agent tasks rarely need a single tool. Here's how skills compose for common production use cases:

Use Case	Required Skills	Tool Chain	Orchestration
Research Assistant	Retrieval Compute	Web Search → Document Reader → Summarize → Write Report	Sequential Chain
Data Analyst	Retrieval Compute Generate	SQL Query → Code Exec (pandas) → Chart Builder → Slide Deck	Sequential + Parallel
Customer Support	Retrieval Actions Memory	RAG Search → CRM Lookup → Ticket Create → Email Draft	Router + Sequential
DevOps Copilot	Compute Retrieval Actions	Log Search → Shell Exec → Runbook Lookup → Slack Alert	ReAct (Autonomous)
Meeting Prep Agent	Retrieval Actions Generate	Calendar Check → CRM Lookup → Web Search → Doc Writer	Sequential Chain
Code Review Agent	Compute Retrieval Actions	Git Diff → Code Exec (tests) → Style Check → PR Comment	Parallel + Sequential

Tool Design Principles

# The UNIX philosophy for agent tools:
# 1. Do one thing well
# 2. Composable inputs/outputs
# 3. Fail loudly with clear errors
# 4. Idempotent where possible
# 5. Observable (logs, metrics, traces)

class ToolDesignChecklist:
    single_responsibility: bool  # One action per tool
    clear_description: bool      # LLM can understand when to use
    typed_schema: bool          # JSON Schema with constraints
    error_messages: bool        # Actionable, not cryptic
    idempotent: bool            # Safe to retry
    timeout: bool               # Never hangs
    rate_limited: bool          # Protects downstream
    permission_scoped: bool     # Least privilege
    observable: bool            # Emits traces/metrics

Tool Risk Tiers & Approval Gates

Not all tools carry equal risk. Tier them and enforce appropriate approval flows:

T1

Read-Only (Auto-approve)

Search, lookup, calculate, read. No side effects. Safe for autonomous execution.

T2

Reversible Write (Soft-approve)

Draft email, create ticket, update status. Can be undone. Log + notify, execute with confirmation for sensitive data.

T3

Irreversible Action (Human-approve)

Send email, publish post, execute payment, delete record. Require explicit human confirmation before execution.

T4

Privileged / Admin (Never auto-approve)

Access controls, billing changes, system config, code deploy. Always require authenticated human approval with audit trail.

Principle: Start with read-only tools (Tier 1) and expand to write actions only after your orchestration, error handling, and observability are proven in production. Most agent value comes from retrieval + synthesis — not from taking actions.

slide-skillmap

Reasoning

ReAct & Reasoning Loops

ReAct (Reason + Act) is the pattern for agentic behavior: Think → Act → Observe → Repeat. The agent reasons about what to do, takes action, observes results, and continues.

ReActAgent Implementation

class ReActAgent:
    def __init__(self, llm, tools, max_steps=10):
        self.llm = llm
        self.tools = {t.name: t for t in tools}
        self.max_steps = max_steps
        self.scratchpad = []

    async def run(self, query):
        for step in range(self.max_steps):
            # Think: reason about what to do next
            thought = await self.llm.think(query, self.scratchpad)
            self.scratchpad.append(("thought", thought))

            if thought.action == "finish":
                return thought.answer

            # Act: execute the chosen tool
            tool = self.tools[thought.tool_name]
            result = await tool.execute(thought.tool_args)

            # Observe: record the result
            self.scratchpad.append(("observation", result))

        return "Max steps reached — could not complete"

ReAct vs Other Approaches

ReAct (Reason + Act)

Explicit think-act-observe loop. Best for complex tasks requiring exploration and recovery.

Chain-of-Thought

Reasoning without action. Better for analysis but can't execute tasks or access new data.

Plan-and-Execute

Create a plan first, then execute. Good for structured workflows, poor for adaptive tasks.

LATS (Language Agent Tree Search)

Tree search over actions. Expensive but explores multiple paths for complex problems.

Planning Algorithms Comparison

Paradigm	Core Idea	Strengths	Weaknesses
ReAct	Reason → Act → Observe → Repeat	Flexible, adaptive, handles new situations, easy to debug	Can be verbose, slower on simple tasks
Plan-and-Solve	Create plan first, then execute steps	Structured, good for complex tasks, reduces errors	Brittle when plan becomes stale, poor for exploration
Tree of Thoughts	Explore multiple reasoning branches	Finds better solutions, good for reasoning, backtracking	High token cost, slow, overkill for simple tasks
Reflexion	Agent self-reflects on failures	Learns from mistakes, improves over iterations	Requires many iterations, slow convergence
Toolformer	Model decides when to call tools	Minimal tokens, fast, uses tools sparingly	Requires fine-tuning, less transparent reasoning
Graph of Thoughts	Reasoning as directed acyclic graph	Expresses complex dependencies, good for workflows	Complex to implement, expensive to evaluate

Key Principle: Use the smallest-horizon agent that meets your use case. Start with ReAct for simplicity. Upgrade to Tree of Thoughts only if your task requires exploration over multiple branches.

Key Insight: ReAct gives the model a structured way to interleave reasoning and action. Without it, agents tend to either over-plan (missing opportunities) or act impulsively (making mistakes).

slide-react

Performance

Parallel & Multi-Tool Execution

Modern LLMs can request multiple tools at once. Instead of waiting for each result, execute them in parallel for 3-5x latency improvement.

ParallelToolExecutor with asyncio.gather

class ParallelToolExecutor:
    async def execute_batch(self, tool_calls, timeout=30):
        tasks = [
            asyncio.wait_for(
                self._execute_one(call),
                timeout=timeout
            )
            for call in tool_calls
        ]
        results = await asyncio.gather(*tasks, return_exceptions=True)

        # Handle partial failures gracefully
        return [
            ToolResult(call.id, result) if not isinstance(result, Exception)
            else ToolResult(call.id, error=str(result))
            for call, result in zip(tool_calls, results)
        ]

Parallel Execution Benefits & Challenges

Benefits

Latency: 3-5x faster for multi-step tasks. Throughput: Execute 10 tools simultaneously instead of sequentially.

Challenges

Ordering: Dependency handling (Tool B needs Tool A result). Cost: Fan-out multiplies API calls.

Error Handling

Partial failures (1 of 3 fails). Timeouts on slow tools. Always use return_exceptions=True.

Execution Strategies

Strategy	Use Case	Latency	Complexity
Serial	Tool B depends on Tool A result	Slowest	Simple
Parallel	Independent tools (most common)	3-5x faster	Moderate
DAG-Based	Complex dependency graphs	Optimal	High
Speculative	Execute multiple paths, pick best	Variable	Very High

slide-parallel

Architecture

Multi-Agent Delegation

Delegate complex tasks to specialized sub-agents. Supervisor routes work, workers execute independently, results compose back up.

SupervisorAgent Router Implementation

class SupervisorAgent:
    def __init__(self):
        self.workers = {
            "research": ResearchAgent(tools=[web_search, doc_reader]),
            "code": CodeAgent(tools=[code_exec, file_write]),
            "data": DataAgent(tools=[sql_query, chart_gen]),
        }

    async def handle(self, task):
        plan = await self.planner.decompose(task)
        results = {}
        for step in plan.steps:
            worker = self.workers[step.agent]
            results[step.id] = await worker.execute(step.instruction, context=results)
        return await self.synthesizer.combine(results)

Multi-Agent Orchestration Patterns

Supervisor / Worker

One coordinator routes tasks to specialized workers. Best for diverse skill domains. Easy to scale.

Peer-to-Peer

Agents communicate directly. More flexible but harder to debug. Good for negotiation.

Hierarchical

Tree of agents. Scales better for large teams. Messaging overhead increases.

Swarm

Decentralized with local rules. Complex behaviors from simple agents. Hard to predict.

Warning: Multi-agent systems multiply complexity, cost, and failure modes. Use a single agent with multiple tools until proven insufficient. Each agent adds latency, context window overhead, and debugging difficulty.

slide-multiagent

Reliability

Error Handling & Recovery

Real agents fail. Your job is to define what's retryable, what's not, and how the model can recover.

ResilientToolExecutor with Decorators

class ResilientToolExecutor:
    @retry(max_attempts=3, backoff=exponential(base=2))
    @circuit_breaker(failure_threshold=5, recovery_timeout=60)
    @timeout(seconds=30)
    async def execute(self, tool_name, args):
        try:
            result = await self.tools[tool_name].run(args)
            self.metrics.record_success(tool_name)
            return result
        except ValidationError as e:
            return ToolError(f"Invalid args: {e}", retryable=False)
        except RateLimitError:
            raise  # Let retry decorator handle
        except TimeoutError:
            self.metrics.record_timeout(tool_name)
            return ToolError("Tool timed out", retryable=True)

Error Taxonomy

Tool Execution Failure

API unreachable, permission denied, network timeout. Often retryable with backoff.

Schema Validation

LLM outputs invalid args. Not retryable at executor level—return to LLM to correct.

Timeout / Rate Limit

Tool slow or API throttled. Retryable with exponential backoff and jitter.

Critical: Always return errors TO the LLM as tool results rather than crashing. The model can often recover by trying a different approach or refining its request.

Durable Execution & Saga Patterns

Production systems require durability. If an agent crashes mid-execution, you must be able to resume. Use idempotency keys, durable state checkpoints, and compensation patterns.

Temporal Idempotent Activities

Use Temporal workflows to define durable, idempotent tool executions. If a tool call succeeds but the workflow crashes, Temporal retries from the checkpoint—not from scratch.

Stripe-Style Idempotency Keys

For external APIs, include idempotency keys in request headers. If the same request is sent twice, the server returns cached result—no duplicate charge.

Saga Compensation Pattern

For multi-step workflows, define compensating actions (rollbacks). If step 3 fails, execute reverse of step 2, then step 1. AWS Prescriptive Guidance reference.

Workflow Ledger

Persist every tool call and result to a ledger. Enables auditing, replay, and recovery. Each entry is immutable and timestamped.

SagaExecutor with Compensation

class SagaExecutor:
    def __init__(self, ledger):
        self.ledger = ledger
        self.compensations = []

    async def execute_saga(self, steps):
        for step in steps:
            try:
                # Add idempotency key to request
                idempotency_key = uuid4()
                result = await self.execute_with_key(
                    step.tool, step.args, idempotency_key
                )

                # Log to ledger
                await self.ledger.append({
                    "type": "TOOL_CALL",
                    "step_id": step.id,
                    "tool": step.tool,
                    "result": result,
                    "timestamp": now()
                })

                # Register compensation for rollback
                if step.compensate:
                    self.compensations.insert(0, (
                        step.compensate, result
                    ))

            except Exception as e:
                # Execute all compensations in reverse order
                for comp_fn, context in self.compensations:
                    await comp_fn(context)
                    await self.ledger.append({
                        "type": "COMPENSATION",
                        "fn": comp_fn.__name__
                    })
                raise
        return "All steps completed successfully"

Durability First: Use LangGraph checkpointing, Temporal workflows, or custom ledger patterns. Production agents without durability are just expensive, slow, unreliable scripts.

slide-errors

Security

Sandboxing & Permission Models

The model decides which tools to call. You must constrain what's possible through layered permission checks and resource limits.

SandboxedExecutor with Permission Engine

class SandboxedExecutor:
    def __init__(self, user_context):
        self.allowed_tools = self._resolve_permissions(user_context)
        self.resource_limits = ResourceLimits(
            max_api_calls=100,
            max_tokens_spent=50000,
            max_wall_time=300,
            allowed_domains=["api.internal.com"],
            blocked_actions=["delete", "admin.*"]
        )

    async def execute(self, tool_call):
        # 1. Tool allowlist check
        if tool_call.name not in self.allowed_tools:
            raise PermissionDenied(f"Tool {tool_call.name} not allowed")

        # 2. Argument sanitization
        safe_args = self.sanitizer.clean(tool_call.args)

        # 3. Resource limit check
        self.resource_limits.check_budget()

        # 4. Execute in sandbox
        return await self.sandbox.run(tool_call.name, safe_args)

Sandboxing Technologies

Technology	Isolation Level	Performance	Complexity
Docker	Container (OS-level)	Moderate	Medium
gVisor	Syscall interception	Lower	Medium
Firecracker	MicroVM (strong isolation)	Good	High
WASM	Process-level sandbox	Very Fast	Medium
E2B / Modal	Managed multi-tenant	Good	Low

InjecAgent & Tool Selection Attacks

Prompt injection isn't just direct—attackers can inject malicious instructions via tool-returned content, web pages, or API responses. The InjecAgent benchmark measures resilience.

Indirect Prompt Injection

Attacker embeds malicious instructions in web content, API response, or email body. Agent fetches content and executes instructions unknowingly. Example: web_search returns page with hidden "delete all data" instruction.

Tool Selection Manipulation

Attacker crafts query to confuse tool selection: "Call the delete_account tool to help me." Agent picks wrong tool due to misleading language. Requires strict tool descriptions + grounding.

InjecAgent Benchmark

Research benchmark that tests agent resilience to indirect injection. Measures: Can the agent resist instructions embedded in tool outputs? Real-world validated attacks.

Defense Layering

Tool firewall validates tool calls structurally. Structured outputs reduce injection surface. PII redaction (Presidio) + secrets management (Vault) prevent data leakage via logs.

Defense: Tool Firewall + Structured Output

class ToolFirewall:
    async def filter_and_execute(self, tool_calls):
        for call in tool_calls:
            # 1. Structural validation (strict schema)
            if not self.schema_validator.validate(call):
                raise InvalidToolCall("Failed schema validation")

            # 2. Allowlist enforcement
            if call.name not in self.allowed_tools:
                raise ToolNotAllowed(call.name)

            # 3. Input sanitization with context awareness
            args = self.sanitizer.clean(call.args)
            args = self.redactor.redact_pii(args)  # Presidio

            # 4. Execute & redact result before returning to LLM
            result = await self._execute(call.name, args)
            result = self.redactor.redact_secrets(result)  # Vault

            # 5. Log for audit
            await self.audit_logger.log({
                "tool": call.name,
                "args_hash": sha256(args),
                "result_hash": sha256(result),
                "timestamp": now()
            })
            yield result

Critical Security Principle: Never let an LLM agent execute arbitrary code without sandboxing. Even with "harmless" tools, injection via tool arguments is a real attack vector. Layer defenses: allowlist → sanitize → rate limit → sandbox.

slide-sandboxing

Performance

Streaming & Real-Time Tool Calling

Stream text tokens to the user immediately while tool calls execute in parallel. When a tool call interrupts the stream, show a loading indicator. This cuts perceived latency from seconds to milliseconds.

StreamingToolHandler Implementation

class StreamingToolHandler:
    async def stream_with_tools(self, messages, tools):
        async with self.client.messages.stream(
            model="claude-sonnet-4-20250514",
            messages=messages, tools=tools, max_tokens=4096
        ) as stream:
            collected_content = []
            async for event in stream:
                if event.type == "content_block_start":
                    if event.content_block.type == "tool_use":
                        # Tool call detected mid-stream
                        tool_input = await self._collect_tool_input(stream)
                        result = await self.executor.execute(
                            event.content_block.name, tool_input
                        )
                        yield ToolResultEvent(result)
                    else:
                        yield TextStartEvent()
                elif event.type == "content_block_delta":
                    if hasattr(event.delta, 'text'):
                        yield TextDeltaEvent(event.delta.text)

Real-Time Communication Patterns

Server-Sent Events

Browser native streaming. Perfect for web clients and long-lived connections.

WebSocket

Bidirectional real-time communication. Best for interactive, multi-turn streams.

Token Rendering

Display each token as it arrives. Users see response forming in real-time.

Latency Hiding

Mask tool execution behind visible text. Perceived latency drops dramatically.

Stream text to the user immediately. When a tool call interrupts, show a loading indicator. This cuts perceived latency from seconds to milliseconds.

11 / Streaming & Real-Time

Memory

Agent Memory & Context Management

Multi-turn agents fill context windows. After 5-10 tool calls, you've used 50K+ tokens. What do you keep? What do you drop? Strategic memory management prevents catastrophic context exhaustion.

Four Memory Strategies

Sliding Window

Keep last N turns, drop oldest. Simple but loses early context about original request.

Summarization

LLM summarizes old turns into compact summary. Best quality/cost tradeoff for most agents.

Selective Pruning

Keep results for active topics, drop resolved ones. Smart but complex state tracking needed.

Hierarchical

Short-term (last 5 turns) + Long-term (vector store). Retrieve older context on demand.

AgentMemoryManager (Summarization Strategy)

class AgentMemoryManager:
    def __init__(self, max_tokens=8000, summary_threshold=6000):
        self.max_tokens = max_tokens
        self.threshold = summary_threshold
        self.messages = []
        self.summary = ""

    async def add(self, role, content, tool_result=None):
        self.messages.append({"role": role, "content": content})
        if self.count_tokens() > self.threshold:
            await self.compress()

    async def compress(self):
        old = self.messages[:-3]  # keep last 3 turns
        summary = await self.llm.summarize(old, self.summary)
        self.summary = summary
        self.messages = [
            {"role": "system", "content": f"Previous context: {self.summary}"}
        ] + self.messages[-3:]

    def get_context(self):
        return [{"role": "system", "content": self.summary}] + self.messages

Memory Strategy Comparison

Strategy	Token Efficiency	Quality	Complexity	Latency
Sliding Window	Medium	Low	Low	Instant
Summarization	High	High	Medium	Extra LLM call
Selective Pruning	High	Medium	High	Instant
Hierarchical	Very High	High	Very High	Vector lookup

Tool Result Truncation

Large API responses must be truncated before injecting back into context. A single 10K-token API result can exhaust your entire remaining budget.

def truncate_tool_result(result, max_chars=2000):
    if len(result) <= max_chars:
        return result

    # Extract key sections: first 30% + last 30%
    first_part = result[:max_chars // 3]
    last_part = result[-max_chars // 3:]
    return first_part + "\n[... TRUNCATED ...]\n" + last_part

Most agent failures at turn 8+ are memory failures, not reasoning failures. The model loses track of what it already tried, leading to loops and hallucinated tool calls. Implement aggressive memory management early.

12 / Agent Memory & Context

Async

Async & Long-Running Tool Patterns

Some tools take minutes or hours: report generation, human approval, data pipelines, deployments. Don't block the agent. Use callbacks, polling, or durable execution to handle async work.

Three Async Patterns

Polling

Agent periodically checks tool status. Simple but wasteful, adds latency, burns tokens on every check.

Webhook Callback

Tool calls you back when done. Efficient but requires infrastructure: queue, webhook endpoint, state storage.

Durable Execution

Use Temporal/LangGraph to checkpoint state. Resume from exact point when result arrives. Best for production.

AsyncToolExecutor (Webhook Pattern)

class AsyncToolExecutor:
    async def execute_async(self, tool_name, params):
        job_id = str(uuid.uuid4())
        webhook_url = self.config.webhook_base + f"/complete/{job_id}"

        # Send to tool with callback
        await self.tool_service.enqueue(
            tool_name, params, webhook_url
        )

        # Store pending job
        await self.store.set(
            f"job:{job_id}",
            {"status": "pending", "tool": tool_name}
        )

        # Return job ID to agent (don't block)
        return {"job_id": job_id, "status": "pending"}

    async def on_webhook_complete(self, job_id, result):
        # Inject result and resume agent
        await self.agent_queue.put(
            {"job_id": job_id, "result": result}
        )

Pattern Comparison

Pattern	Latency	Cost	Infrastructure
Polling	High (interval-based)	High (repeated checks)	Minimal
Webhook	Low (event-driven)	Low (one call)	Queue + endpoint
Durable Execution	Very Low	Low	Temporal/LangGraph

For production agents with long-running tools, use durable execution (Temporal or LangGraph). It handles retries, timeouts, and state recovery automatically. Polling burns tokens; webhooks require infrastructure. Durable frameworks solve both.

13 / Async & Long-Running Tools

Multi-Modal

Multi-Modal Tool Calling

Modern agents work with vision, audio, and structured files. Vision tools analyze screenshots and charts. Audio tools transcribe and synthesize. File tools parse PDFs and spreadsheets.

Vision Tools Example

async def analyze_screenshot(image_base64: str) -> str:
    """Extract text, UI elements, and structure from screenshot"""
    response = await client.messages.create(
        model="claude-opus-4-1-20250805",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_base64}},
                {"type": "text", "text": "Extract text, buttons, forms. Return JSON."}
            ]
        }]
    )
    return response.content[0].text

Multi-Modal Capabilities by Provider

Provider	Vision	Audio	Video
Anthropic Claude	Yes (vision)	No (use external)	No
OpenAI GPT-4o	Yes	Yes (in API)	Yes
Google Gemini	Yes	Yes	Yes
Open-source (LLaVA)	Yes (limited)	No	No

When to Use Each Modality

Vision vs Text Extraction

Use Vision: UI screenshots, charts, diagrams, handwriting. Use Text: PDFs with structured text, OCR-ed documents.

Audio Processing

Transcription (speech-to-text) and synthesis (TTS) require external APIs. Integrate with agent tools for voice interfaces.

Vision models are expensive (~2x text tokens). Use for high-value tasks: chart analysis, UI automation. For routine document parsing, text extraction is cheaper and faster.

14 / Multi-Modal Tool Calling

Operations

Observability & Tracing

Trace every LLM call, every tool selection, every tool execution, every result, every error. Collect token counts, latencies, and costs. Without observability, you can't debug or optimize.

OpenTelemetry Integration

from opentelemetry import trace
tracer = trace.get_tracer("agent")

class TracedAgent:
    @tracer.start_as_current_span("agent.run")
    async def run(self, query):
        span = trace.get_current_span()
        span.set_attribute("query", query[:200])

        for step in range(self.max_steps):
            with tracer.start_span("llm.call") as llm_span:
                response = await self.llm.generate(...)
                llm_span.set_attribute("model", self.model)
                llm_span.set_attribute("tokens.input", response.usage.input)
                llm_span.set_attribute("tokens.output", response.usage.output)

            if response.has_tool_use:
                with tracer.start_span(f"tool.{tool_name}") as tool_span:
                    result = await self.execute_tool(...)
                    tool_span.set_attribute("tool.success", not result.is_error)

Key Metrics Dashboard

LLM Calls

Calls per query, latency distribution (P50/P95/P99), token usage trends.

Tool Selection

Accuracy rate, which tools selected most, tool selection errors.

Costs & Budget

Token usage & cost per query, cost per tool, budget tracking.

Tool Latency

Execution time per tool, P50/P95/P99 distribution, bottleneck tools.

Error Tracking

Error rate by tool, error types, error recovery success.

Tools: LangSmith, Arize Phoenix, OpenTelemetry

Datadog, Braintrust, custom observability backends.

15 / Observability & Tracing

Quality

Testing & Evaluation

Build a testing pyramid: Unit tests for tool functions → Integration tests for tool calling flow → Agent eval suites for end-to-end task completion → Red team for adversarial testing.

AgentEvaluator Implementation

class AgentEvaluator:
    def __init__(self, agent, eval_set):
        self.agent = agent
        self.eval_set = eval_set  # [(query, expected_tools, expected_answer)]

    async def run_eval(self):
        results = []
        for query, expected_tools, expected_answer in self.eval_set:
            trace = await self.agent.run_traced(query)
            results.append(EvalResult(
                tool_selection_accuracy=self._check_tools(trace, expected_tools),
                answer_correctness=self._check_answer(trace.answer, expected_answer),
                steps_taken=len(trace.steps),
                total_tokens=trace.total_tokens,
                latency_ms=trace.duration_ms,
            ))
        return EvalReport(results)

Evaluation Metrics & Patterns

Deterministic Tests

Tool functions must be deterministic. Unit test each tool independently.

LLM-as-Judge

Use another LLM to grade answer quality. Good for subjective tasks.

Regression Testing

Run full eval suite on model upgrades. Lock in baseline before changes.

Adversarial Testing

Prompt injection, jailbreak attempts, malformed inputs, edge cases.

Mock Tool Testing Pattern

Mock tools return deterministic responses for testing agent logic without calling real APIs. Critical for CI/CD pipelines.

class MockToolRegistry:
    def __init__(self, fixtures: Dict):
        self.fixtures = fixtures

    async def execute(self, tool_name, params):
        # Return fixture if available, else real call
        if tool_name in self.fixtures:
            return self.fixtures[tool_name]
        return await real_tools[tool_name].execute(params)

# Usage in tests
mock_tools = MockToolRegistry({
    "get_weather": {"temp": 72, "conditions": "sunny"},
    "fetch_data": [1, 2, 3]
})

Snapshot Testing for Tool Call Sequences

Record expected tool call sequences. Compare against baseline on each run to catch logic regressions.

def test_agent_snapshot():
    trace = await agent.run_traced("Find flights to NYC")
    tool_calls = [step.tool_name for step in trace.steps]

    # Snapshot: ["search_flights", "get_prices", "format_response"]
    assert_snapshot(tool_calls, "test_agent_snapshot.json")

GitHub Actions CI Example

name: Test Agents
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run agent eval suite
        run: pytest tests/agent_evals.py -v
        env:
          TOOL_MOCK_MODE: true
          SNAPSHOT_UPDATE: false

16 / Testing & Evaluation

Economics

Cost Control & Optimization

LLM tokens drive costs: input context grows with each step. Tool execution, retries, and multi-agent fan-out add up. Set budgets, track spend, optimize aggressively.

CostAwareAgent with Budgets

class CostAwareAgent:
    def __init__(self, budget_tokens=50000, budget_usd=0.10):
        self.token_budget = budget_tokens
        self.usd_budget = budget_usd
        self.tokens_used = 0
        self.cost_usd = 0.0

    async def run(self, query):
        for step in range(self.max_steps):
            if self.tokens_used > self.token_budget * 0.9:
                return self._force_answer("Approaching token budget")

            response = await self.llm.generate(...)
            self.tokens_used += response.usage.total_tokens
            self.cost_usd += self._calculate_cost(response.usage)

            if self.cost_usd > self.usd_budget:
                return self._force_answer("Cost budget exceeded")

Optimization Strategies

Prompt Caching

Reuse system prompts and tool definitions across queries. Dramatic savings on repeated context.

Context Pruning

Summarize old steps and tool results. Keep only recent, relevant context.

Model Tiering

Cheap model for routing. Strong model only for final answer synthesis.

Tool Result Caching

Cache tool outputs (weather, stock prices, search results) aggressively.

Per-Tool Cost Attribution

Track which tools drive costs. Some tools are expensive (API calls, vision models). Implement cost tracking per tool to identify optimization opportunities.

class CostTracker:
    def __init__(self):
        self.tool_costs = defaultdict(lambda: {"calls": 0, "tokens": 0, "cost_usd": 0.0})

    async def record_call(self, tool_name, response):
        tokens = response.usage.total_tokens
        cost = self._price_tokens(tokens, tool_name)
        self.tool_costs[tool_name]["calls"] += 1
        self.tool_costs[tool_name]["tokens"] += tokens
        self.tool_costs[tool_name]["cost_usd"] += cost

    def get_report(self):
        return sorted(
            self.tool_costs.items(),
            key=lambda x: x[1]["cost_usd"],
            reverse=True
        )

Cost Anomaly Detection

Monitor for unexpected cost spikes. Alert when per-query cost exceeds baseline + 2σ or when tokens exceed budget threshold.

async def detect_anomaly(query_cost, history: List[float]):
    mean = np.mean(history)
    std = np.std(history)
    threshold = mean + 2 * std

    if query_cost > threshold:
        await alert(f"Cost spike: {query_cost}. Baseline: {mean}")

Model Pricing Comparison (March 2026)

Model	Input Token Rate	Output Token Rate	Vision (if available)	Best For
Claude Sonnet 4	$3/1M	$15/1M	Yes	Complex reasoning
Claude Haiku 4.5	$0.80/1M	$4/1M	Yes	Low-latency routing
OpenAI GPT-4o	$2.50/1M	$10/1M	Yes	Multimodal agents
OpenAI GPT-4o Mini	$0.15/1M	$0.60/1M	Yes	High-volume routing
Google Gemini 2.0	$0.075/1M	$0.30/1M	Yes	Cost-sensitive scale
Meta Llama 3.1 (self-hosted)	Compute cost	Compute cost	Limited	Privacy-critical

Implement cost attribution and anomaly detection from day one. A single runaway agent consuming 100K tokens can blow a monthly budget. Track per-tool costs, set hard limits, and alert on spikes.

17 / Cost Control & Optimization

DevOps

Production Deployment

Stateless agent services behind a load balancer. Store conversation state in Redis or a database. Trace every call. Monitor tool health. Feature flag new tools. Enable zero-downtime deployments.

Production Best Practices

Stateless Design

Agent services are ephemeral. Store conversation state in Redis or a database for horizontal scaling.

Tool Health Checks

Monitor tool dependencies. Circuit breakers for degraded tools. Graceful degradation.

Feature Flags

Gradually roll out new tools. Feature flag new behaviors. Easy rollback.

Canary Deployments

5% traffic to new version. Progressive 25% → 50% → 100%. Zero downtime.

Agent services should be stateless. Store conversation state in Redis or a database. This enables horizontal scaling and zero-downtime deployments.

18 / Production Deployment

Summary

Production Readiness Checklist

Before you ship, verify all four categories. The best agent architectures are boring: simple orchestration, reliable tools, comprehensive monitoring, and strict safety boundaries.

Tool Design

✓ JSON Schema validation
✓ Descriptions optimized for models
✓ Versioned schemas
✓ <20 tools per request
✓ Input sanitization
✓ Idempotent tools where possible

Reliability

✓ Retry with exponential backoff
✓ Circuit breakers
✓ Timeout on every tool
✓ Graceful degradation
✓ Dead letter queue
✓ Error reporting to LLM

Safety

✓ Tool allowlisting per user role
✓ Code execution sandboxing
✓ Network isolation
✓ Argument sanitization
✓ Cost budgets
✓ Prompt injection defense

Operations

✓ Distributed tracing on every call
✓ Token/cost dashboards
✓ Tool latency monitoring
✓ Eval suite in CI/CD
✓ Canary deployments
✓ Model version pinning

Complexity is the enemy of production reliability. The best agent architectures are boring: simple orchestration, reliable tools, comprehensive monitoring, and strict safety boundaries. If you can't easily explain your agent flow to another engineer, it's too complicated.

19 / Production Readiness Checklist

Ecosystem

Agent Frameworks & Ecosystem

Landscape of production frameworks ranges from lightweight libraries to full platforms. Start simple. Choose based on your system's maturity level.

Framework Comparison Matrix

Framework	Strengths	Gaps / Risks	Security Posture	Best-Fit Role
LangChain + LangGraph	Wide tool ecosystem, strong community, good documentation	Can be verbose, cost tracking not native, performance variable	Community-maintained, audit tools available	General-purpose agents, prototyping
LlamaIndex	Excellent RAG, semantic caching, fast indexing	Less multi-agent support, narrower tool set	Document-level security, integrations	Document-driven agents, knowledge systems
AutoGen	Multi-agent conversations, flexible role def	Unpredictable cost, hard to debug, no built-in guardrails	Limited access controls, relies on models	Research, complex collaborative tasks
MS Agent Framework	Enterprise-grade, strong security, durable execution	Steeper learning curve, Azure-dependent	Built-in RBAC, audit trails, compliance ready	Enterprise production, regulated industries
Semantic Kernel	Plugin model, cross-language support, .NET first	Smaller ecosystem than LangChain, less documentation	Microsoft ecosystem integration	.NET applications, Windows-first orgs
Ray Serve	Distributed scaling, low-latency serving, cost-aware	Operational overhead, requires Kubernetes knowledge	Network isolation, resource limits	High-volume production, multi-tenant SaaS
CrewAI	Simple role-based design, good for structured workflows	Early stage, smaller community, limited frameworks	Depends on underlying models, basic tooling	Workflow-focused teams, structured tasks
Haystack	Modular pipelines, clear abstractions, good docs	Smaller community, less multi-agent tooling	Pipeline-level access control	Search & QA systems, modular pipelines
DSPy	Minimal, Pythonic, great for optimization	Limited built-in tools, requires more custom code	Simple surface = easy to audit	Research, custom agents, fine-tuning workflows

Starting Strategy: Begin with a lightweight library (LangChain or DSPy). Migrate to a platform (MS Agent Framework or Ray Serve) only when you need durability, multi-tenancy, or compliance features. Premature platformification adds complexity without value.

20 / Agent Frameworks & Ecosystem

Planning

Phased Implementation Roadmap

Production agents aren't built overnight. This phased approach helps you balance velocity with reliability.

Success Metrics & Deliverables

P1: Tools Inventory

Tool catalog doc, risk matrix, schema specs, RBAC roles defined

P2: MVP Deployed

Read-only agent live, tool gateway running, basic observability online

P3: Durable Workflows

Temporal/LangGraph checkpoints, ledger logging, compensation tests

P4: Eval Suite Live

Red-team results, eval CI/CD checks, security audit report

P5: Multi-Agent Ready

MCP integration, distributed tracing proven, cost model validated

Iterative Delivery: Each phase produces working software. You can ship P1 + P2 in 6-8 weeks. P3-P5 happen as you grow. Don't wait for "platform perfection"—move fast on the bounded MVP, harden in production based on observed failures.

21 / Phased Implementation Roadmap

Governance

Audit & Compliance Data Model

Enterprise agents require complete audit trails. This ER model captures every decision, auth check, and compensation action for compliance, debugging, and forensics.

AuditLogger Implementation

class AuditLogger:
    def __init__(self, db):
        self.db = db

    async def log_tool_call(self, call: ToolCall):
        # Insert TOOL_CALL record
        call_id = uuid4()
        await self.db.execute("""
            INSERT INTO TOOL_CALL (call_id, step_id, tool_name, args_hash, created_at)
            VALUES (?, ?, ?, ?, now())
        """, call_id, call.step_id, call.name, sha256(call.args))

        # Log auth decision
        await self.db.execute("""
            INSERT INTO AUTHZ_DECISION (decision_id, call_id, allowed, reason)
            VALUES (?, ?, ?, ?)
        """, uuid4(), call_id, call.allowed, call.authz_reason)

        return call_id

    async def log_tool_result(self, call_id, result, latency_ms):
        # Insert TOOL_RESULT record (redacted)
        await self.db.execute("""
            INSERT INTO TOOL_RESULT (result_id, call_id, result_hash, latency_ms)
            VALUES (?, ?, ?, ?)
        """, uuid4(), call_id, sha256(result), latency_ms)

    async def log_compensation(self, call_id, action):
        await self.db.execute("""
            INSERT INTO COMPENSATION_ACTION (comp_id, call_id, action, executed)
            VALUES (?, ?, ?, true)
        """, uuid4(), call_id, action)

    async def audit_trail(self, run_id):
        # Full audit trail for a run: all steps, calls, auth, results
        return await self.db.query("""
            SELECT s.step_num, t.tool_name, a.allowed, a.reason, r.latency_ms, c.action
            FROM STEP s
            JOIN TOOL_CALL t ON s.step_id = t.step_id
            LEFT JOIN AUTHZ_DECISION a ON t.call_id = a.call_id
            LEFT JOIN TOOL_RESULT r ON t.call_id = r.call_id
            LEFT JOIN COMPENSATION_ACTION c ON t.call_id = c.call_id
            WHERE s.run_id = ?
            ORDER BY s.step_num
        """, run_id)

Compliance Ready: This model provides: (1) Full trace for forensics, (2) Auth decision audit trail for SOC2, (3) Compensation logs for workflow integrity, (4) Redacted results to prevent PII leakage in logs. Hash sensitive fields, never store plaintext args/results in audit tables.

23 / Audit & Compliance Data Model

Glossary of AI Agent Terms

13 key technical terms used throughout this guide.

A

Term	Definition
Agent Loop (ReAct)	The Reasoning-Acting loop where an LLM reasons about a task, selects a tool, executes it, observes results, and iterates until complete. The core pattern for AI agents.
Agentic RAG	A RAG pattern where an LLM agent autonomously decides when to retrieve, which tools to call, and whether to iterate — orchestrating multi-step retrieval and reasoning.

C

Term	Definition
Chain-of-Thought (CoT)	A prompting technique that elicits step-by-step reasoning before the final answer, improving performance on complex multi-step tasks.

D

Term	Definition
Durable Execution	A workflow pattern (Temporal, Inngest) that persists agent state across failures, enabling automatic retries and recovery for long-running multi-step tasks.

F

Term	Definition
Function Calling	The LLM's ability to generate structured JSON tool invocations in response to a query, enabling it to interact with external APIs, databases, and services.

G

Term	Definition
Guardrails	Input/output validation rules that constrain agent behavior — preventing prompt injection, enforcing output schemas, and blocking unsafe tool invocations.

H

Term	Definition
Human-in-the-Loop (HITL)	A pattern where the agent pauses for human approval before executing high-risk actions (e.g., financial transactions, deletions). Critical for production safety.

I

Term	Definition
InjecAgent	An adversarial framework for testing tool-calling agents against prompt injection attacks that attempt to hijack tool invocations through malicious instructions.

J

Term	Definition
JSON Schema (Tool Schema)	The structured definition of a tool's name, description, and parameters that the LLM uses to understand what tools are available and how to invoke them.

M

Term	Definition
Multi-Agent Orchestration	Coordinating multiple specialized agents (e.g., researcher, coder, reviewer) that collaborate on complex tasks through message passing or shared state.

P

Term	Definition
Prompt Injection	An adversarial attack where malicious instructions are embedded in inputs to manipulate the LLM's behavior, potentially causing unauthorized tool invocations.

S

Term	Definition
Saga Pattern	A distributed transaction pattern for multi-step agent workflows where each step has a compensating action (rollback) if a later step fails.

T

Term	Definition
Tool Use	The ability of an LLM to call external functions, APIs, or services to perform actions beyond text generation — retrieving data, executing code, or modifying state.

Full Reference: See the unified LLM Glossary for 140+ terms across all learning documents.