AI Agents — Advanced Tool Calling
Production patterns for building reliable, observable, and safe tool-using AI agents — from schema design through multi-agent orchestration to deployment.
Why Tool Calling Changes Everything
Tool calling is when LLMs generate structured function calls instead of free text. The model outputs a tool name + arguments, your code executes it, returns results, and the model continues. This unlocks deterministic, repeatable, and composable agent workflows.
Structured I/O
From text-in/text-out to deterministic function calls. The model outputs JSON matching your schema, not ambiguous natural language.
Multi-Step Reasoning
From single-turn to iterative loops. The agent thinks, acts, observes, and repeats until the goal is reached.
Integrated Systems
From isolated model to end-to-end system. Your code controls execution, validation, and error recovery.
Why Agents Need Tools
Real-Time Data
Access APIs, databases, and live information the model has no knowledge of.
Take Actions
Send emails, create records, modify systems, and execute business logic.
Precise Computation
Offload math, date calculations, and deterministic logic to code.
Private Systems
Connect to internal services, databases, and proprietary systems safely.
Not 'asking the AI to write code'. The model outputs JSON matching your schema, your code validates and executes. You're always in control.
Five Production Design Commitments
1. Capability Boundary
Curated tools with least-privilege, rate limits, and auditing. Never expose untrusted capabilities.
2. Reliable Substrate
Idempotency, retries, durable state checkpoints, and saga compensation patterns.
3. Grounded Loop
Schema validation + explicit grounding rules. Tool inputs must be validated before execution.
4. Observability
Tool-call spans, auth decisions, retries, and state checkpoints. Monitor actions, not just tokens.
5. Continuous Eval
Adversarial testing, InjecAgent benchmarks, and red-team suites in CI/CD.
Tool Calling Fundamentals
The Tool Calling Lifecycle
Complete Tool Calling Flow with Anthropic Claude API
import anthropic
client = anthropic.Anthropic()
tools = [{
"name": "get_weather",
"description": "Get current weather for a city",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"units": {"type": "string", "enum": ["celsius", "fahrenheit"], "default": "celsius"}
},
"required": ["city"]
}
}]
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}]
)
# Handle tool use response
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
# Send result back to continue conversationTool Definition
Name, description, and JSON schema that defines the interface the model must follow.
Tool Selection
The model picks which tool to call based on the user query and available tools.
Tool Execution
Your code runs the selected tool with the model-provided arguments.
Result Injection
Feed results back to the model to continue reasoning or take next steps.
It outputs structured JSON saying "I want to call X with args Y". Your code is always in control. This is the core principle that makes tool calling safe and deterministic.
Tool Schema Design & Best Practices
Well-designed schemas dramatically improve model accuracy and reduce errors. The schema is the interface between your LLM and your system.
Well-Designed vs Poorly-Designed Schemas
"input_schema": {
"type": "object",
"properties": {
"date_range": {
"type": "object",
"properties": {
"start": {
"type": "string",
"format": "date",
"desc": "YYYY-MM-DD"
}
}
},
"limit": {
"type": "integer",
"minimum": 1,
"maximum": 100
}
},
"required": ["date_range"]
}"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string"
},
"options": {
"type": "string"
}
}
}
# No descriptions, no constraints,
# vague parameter names, string for
# everything, no validationSchema Validation Pipeline
Best Practices for Tool Schemas
1. Write Descriptions for the MODEL
Be explicit about format, constraints, and examples. The model reads the description to understand what you want.
2. Use Enums to Constrain Choices
Don't let the model invent values. Define allowed options explicitly in the schema.
3. Keep Tool Count Under 20
More tools = worse selection accuracy. Group related tools or use sub-actions.
4. Version Your Schemas
Breaking changes need migration. Track schema versions and deprecate gradually.
5. Include Examples in Descriptions
Show the model what good input looks like. "Example: 2024-03-15" is better than "ISO date format".
6. Validate Server-Side
NEVER trust model output. Validate all inputs, check ranges, verify enums, and handle edge cases.
Parameter Type Patterns
| Type | Pattern | Use Case |
|---|---|---|
| String with Regex | "pattern": "^[A-Z]{2}\\d{3}$" |
Codes, SKUs, phone numbers |
| Number with Bounds | "type": "number", "minimum": 0, "maximum": 100 |
Prices, percentages, counts |
| Enum | "enum": ["pending", "active", "archived"] |
Status, categories, choices |
| Array with Items | "type": "array", "items": {"type": "string"}, "maxItems": 10 |
Tags, email lists, IDs |
| Nested Object | "type": "object", "properties": {...}, "required": [...] |
Complex data, structured inputs |
Tool Interface Patterns
| Interface Pattern | Contract Format | Security Implications | Best-Fit Use Cases |
|---|---|---|---|
| REST+OpenAPI | OpenAPI 3.0 spec, HTTP verbs | Network isolation, TLS required, easy MITM if not HTTPS | External APIs, standard microservices |
| gRPC+Protobuf | Protocol Buffers, binary format | Strongly typed, harder to modify, TLS required | High-throughput internal services, low-latency |
| Provider Function Calling | Native code, type signatures | Direct code execution, trust model is critical | Same-process agents, embedded systems |
| MCP (Model Context Protocol) | JSON-RPC, tools as resources | Standardized, audit trails, capability discovery | Multi-agent systems, plugin architectures |
| Tool-Proxy/Firewall | Transparent proxy, schema validation | Input sanitization, rate-limiting, logging layer | Enterprise, compliance-heavy, zero-trust |
Agent Orchestration Patterns
Different patterns for different needs. Most production systems need Sequential Chain. Start simple.
Orchestration Pattern Comparison
| Pattern | When to Use | Complexity | Latency | Reliability |
|---|---|---|---|---|
| Single Tool | Simple tasks, one action per query | Low | Fast | High |
| Sequential Chain | Multi-step workflows with known order | Medium | Moderate | High |
| Router | Classification or dispatch to different tools | Medium | Fast | High |
| Autonomous Agent | Complex reasoning, unknown steps, exploration | High | Slow | Medium |
AgentOrchestrator Implementation (Autonomous Loop)
class AgentOrchestrator:
def __init__(self, max_iterations=10):
self.max_iterations = max_iterations
self.tool_registry = {}
self.conversation = []
def register_tool(self, name, tool_func, schema):
self.tool_registry[name] = {"func": tool_func, "schema": schema}
def run(self, query):
self.conversation = [{"role": "user", "content": query}]
for iteration in range(self.max_iterations):
# Get LLM response with tool definitions
response = call_llm(self.conversation, self.tool_registry)
if response.stop_reason == "end_turn":
return response.text
for block in response.content:
if block.type == "tool_use":
result = self.tool_registry[block.name]["func"](block.input)
self.conversation.append({"role": "user", "content": result})
return "Max iterations reached"Most production use cases need Sequential Chain, not full autonomous agents. Autonomous loops are harder to debug, slower, and more expensive. Use them only when the task requires true exploration.
Agent Skill Map — Capabilities & Tool Taxonomy
A production agent system is only as powerful as its tools. This skill map charts the full landscape of agent capabilities, the tool categories that enable them, and how they compose into real-world workflows.
Information Retrieval
The foundation of any useful agent — fetching context from the world.
| Tool | Use Case | Complexity |
|---|---|---|
| Web Search | Real-time facts, news, current events | Low |
| RAG / Vector Search | Private knowledge base Q&A | Medium |
| SQL Query | Structured data: metrics, reports, analytics | Medium |
| API Fetch | Live system data (weather, stock, status) | Low |
| Doc Reader | PDF/DOCX parsing, contract analysis | Medium |
Code & Computation
Extends the agent beyond language into precise computation and system control.
| Tool | Use Case | Complexity |
|---|---|---|
| Code Exec | Data analysis, plotting, scripting | High |
| Shell/CLI | File ops, git, system admin | High |
| Calculator | Precise math, financial calcs | Low |
| Data Transform | CSV cleaning, JSON reshape, ETL | Medium |
| Notebook | Interactive data exploration | High |
Actions & Integrations
Where agents become truly useful — taking actions in real systems on behalf of users.
| Tool | Use Case | Risk Level |
|---|---|---|
| Draft, send, search, summarize | Medium | |
| Calendar | Schedule, reschedule, check availability | Medium |
| Messaging | Post updates, respond, search channels | Medium |
| Ticketing | Create, update, assign, close | Low |
| CRM | Update contacts, log activities | High |
| Git | Commit, create PRs, review code | High |
Skill Composition — Mapping Use Cases to Tool Sets
Real agent tasks rarely need a single tool. Here's how skills compose for common production use cases:
| Use Case | Required Skills | Tool Chain | Orchestration |
|---|---|---|---|
| Research Assistant | Retrieval Compute | Web Search → Document Reader → Summarize → Write Report | Sequential Chain |
| Data Analyst | Retrieval Compute Generate | SQL Query → Code Exec (pandas) → Chart Builder → Slide Deck | Sequential + Parallel |
| Customer Support | Retrieval Actions Memory | RAG Search → CRM Lookup → Ticket Create → Email Draft | Router + Sequential |
| DevOps Copilot | Compute Retrieval Actions | Log Search → Shell Exec → Runbook Lookup → Slack Alert | ReAct (Autonomous) |
| Meeting Prep Agent | Retrieval Actions Generate | Calendar Check → CRM Lookup → Web Search → Doc Writer | Sequential Chain |
| Code Review Agent | Compute Retrieval Actions | Git Diff → Code Exec (tests) → Style Check → PR Comment | Parallel + Sequential |
Tool Design Principles
# The UNIX philosophy for agent tools:
# 1. Do one thing well
# 2. Composable inputs/outputs
# 3. Fail loudly with clear errors
# 4. Idempotent where possible
# 5. Observable (logs, metrics, traces)
class ToolDesignChecklist:
single_responsibility: bool # One action per tool
clear_description: bool # LLM can understand when to use
typed_schema: bool # JSON Schema with constraints
error_messages: bool # Actionable, not cryptic
idempotent: bool # Safe to retry
timeout: bool # Never hangs
rate_limited: bool # Protects downstream
permission_scoped: bool # Least privilege
observable: bool # Emits traces/metricsTool Risk Tiers & Approval Gates
Not all tools carry equal risk. Tier them and enforce appropriate approval flows:
Read-Only (Auto-approve)
Search, lookup, calculate, read. No side effects. Safe for autonomous execution.
Reversible Write (Soft-approve)
Draft email, create ticket, update status. Can be undone. Log + notify, execute with confirmation for sensitive data.
Irreversible Action (Human-approve)
Send email, publish post, execute payment, delete record. Require explicit human confirmation before execution.
Privileged / Admin (Never auto-approve)
Access controls, billing changes, system config, code deploy. Always require authenticated human approval with audit trail.
ReAct & Reasoning Loops
ReAct (Reason + Act) is the pattern for agentic behavior: Think → Act → Observe → Repeat. The agent reasons about what to do, takes action, observes results, and continues.
ReActAgent Implementation
class ReActAgent:
def __init__(self, llm, tools, max_steps=10):
self.llm = llm
self.tools = {t.name: t for t in tools}
self.max_steps = max_steps
self.scratchpad = []
async def run(self, query):
for step in range(self.max_steps):
# Think: reason about what to do next
thought = await self.llm.think(query, self.scratchpad)
self.scratchpad.append(("thought", thought))
if thought.action == "finish":
return thought.answer
# Act: execute the chosen tool
tool = self.tools[thought.tool_name]
result = await tool.execute(thought.tool_args)
# Observe: record the result
self.scratchpad.append(("observation", result))
return "Max steps reached — could not complete"ReAct vs Other Approaches
ReAct (Reason + Act)
Explicit think-act-observe loop. Best for complex tasks requiring exploration and recovery.
Chain-of-Thought
Reasoning without action. Better for analysis but can't execute tasks or access new data.
Plan-and-Execute
Create a plan first, then execute. Good for structured workflows, poor for adaptive tasks.
LATS (Language Agent Tree Search)
Tree search over actions. Expensive but explores multiple paths for complex problems.
Planning Algorithms Comparison
| Paradigm | Core Idea | Strengths | Weaknesses |
|---|---|---|---|
| ReAct | Reason → Act → Observe → Repeat | Flexible, adaptive, handles new situations, easy to debug | Can be verbose, slower on simple tasks |
| Plan-and-Solve | Create plan first, then execute steps | Structured, good for complex tasks, reduces errors | Brittle when plan becomes stale, poor for exploration |
| Tree of Thoughts | Explore multiple reasoning branches | Finds better solutions, good for reasoning, backtracking | High token cost, slow, overkill for simple tasks |
| Reflexion | Agent self-reflects on failures | Learns from mistakes, improves over iterations | Requires many iterations, slow convergence |
| Toolformer | Model decides when to call tools | Minimal tokens, fast, uses tools sparingly | Requires fine-tuning, less transparent reasoning |
| Graph of Thoughts | Reasoning as directed acyclic graph | Expresses complex dependencies, good for workflows | Complex to implement, expensive to evaluate |
Parallel & Multi-Tool Execution
Modern LLMs can request multiple tools at once. Instead of waiting for each result, execute them in parallel for 3-5x latency improvement.
ParallelToolExecutor with asyncio.gather
class ParallelToolExecutor:
async def execute_batch(self, tool_calls, timeout=30):
tasks = [
asyncio.wait_for(
self._execute_one(call),
timeout=timeout
)
for call in tool_calls
]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Handle partial failures gracefully
return [
ToolResult(call.id, result) if not isinstance(result, Exception)
else ToolResult(call.id, error=str(result))
for call, result in zip(tool_calls, results)
]Parallel Execution Benefits & Challenges
Benefits
Latency: 3-5x faster for multi-step tasks. Throughput: Execute 10 tools simultaneously instead of sequentially.
Challenges
Ordering: Dependency handling (Tool B needs Tool A result). Cost: Fan-out multiplies API calls.
Error Handling
Partial failures (1 of 3 fails). Timeouts on slow tools. Always use return_exceptions=True.
Execution Strategies
| Strategy | Use Case | Latency | Complexity |
|---|---|---|---|
| Serial | Tool B depends on Tool A result | Slowest | Simple |
| Parallel | Independent tools (most common) | 3-5x faster | Moderate |
| DAG-Based | Complex dependency graphs | Optimal | High |
| Speculative | Execute multiple paths, pick best | Variable | Very High |
Multi-Agent Delegation
Delegate complex tasks to specialized sub-agents. Supervisor routes work, workers execute independently, results compose back up.
SupervisorAgent Router Implementation
class SupervisorAgent:
def __init__(self):
self.workers = {
"research": ResearchAgent(tools=[web_search, doc_reader]),
"code": CodeAgent(tools=[code_exec, file_write]),
"data": DataAgent(tools=[sql_query, chart_gen]),
}
async def handle(self, task):
plan = await self.planner.decompose(task)
results = {}
for step in plan.steps:
worker = self.workers[step.agent]
results[step.id] = await worker.execute(step.instruction, context=results)
return await self.synthesizer.combine(results)Multi-Agent Orchestration Patterns
Supervisor / Worker
One coordinator routes tasks to specialized workers. Best for diverse skill domains. Easy to scale.
Peer-to-Peer
Agents communicate directly. More flexible but harder to debug. Good for negotiation.
Hierarchical
Tree of agents. Scales better for large teams. Messaging overhead increases.
Swarm
Decentralized with local rules. Complex behaviors from simple agents. Hard to predict.
Error Handling & Recovery
Real agents fail. Your job is to define what's retryable, what's not, and how the model can recover.
ResilientToolExecutor with Decorators
class ResilientToolExecutor:
@retry(max_attempts=3, backoff=exponential(base=2))
@circuit_breaker(failure_threshold=5, recovery_timeout=60)
@timeout(seconds=30)
async def execute(self, tool_name, args):
try:
result = await self.tools[tool_name].run(args)
self.metrics.record_success(tool_name)
return result
except ValidationError as e:
return ToolError(f"Invalid args: {e}", retryable=False)
except RateLimitError:
raise # Let retry decorator handle
except TimeoutError:
self.metrics.record_timeout(tool_name)
return ToolError("Tool timed out", retryable=True)Error Taxonomy
Tool Execution Failure
API unreachable, permission denied, network timeout. Often retryable with backoff.
Schema Validation
LLM outputs invalid args. Not retryable at executor level—return to LLM to correct.
Timeout / Rate Limit
Tool slow or API throttled. Retryable with exponential backoff and jitter.
Durable Execution & Saga Patterns
Production systems require durability. If an agent crashes mid-execution, you must be able to resume. Use idempotency keys, durable state checkpoints, and compensation patterns.
Temporal Idempotent Activities
Use Temporal workflows to define durable, idempotent tool executions. If a tool call succeeds but the workflow crashes, Temporal retries from the checkpoint—not from scratch.
Stripe-Style Idempotency Keys
For external APIs, include idempotency keys in request headers. If the same request is sent twice, the server returns cached result—no duplicate charge.
Saga Compensation Pattern
For multi-step workflows, define compensating actions (rollbacks). If step 3 fails, execute reverse of step 2, then step 1. AWS Prescriptive Guidance reference.
Workflow Ledger
Persist every tool call and result to a ledger. Enables auditing, replay, and recovery. Each entry is immutable and timestamped.
SagaExecutor with Compensation
class SagaExecutor:
def __init__(self, ledger):
self.ledger = ledger
self.compensations = []
async def execute_saga(self, steps):
for step in steps:
try:
# Add idempotency key to request
idempotency_key = uuid4()
result = await self.execute_with_key(
step.tool, step.args, idempotency_key
)
# Log to ledger
await self.ledger.append({
"type": "TOOL_CALL",
"step_id": step.id,
"tool": step.tool,
"result": result,
"timestamp": now()
})
# Register compensation for rollback
if step.compensate:
self.compensations.insert(0, (
step.compensate, result
))
except Exception as e:
# Execute all compensations in reverse order
for comp_fn, context in self.compensations:
await comp_fn(context)
await self.ledger.append({
"type": "COMPENSATION",
"fn": comp_fn.__name__
})
raise
return "All steps completed successfully"Sandboxing & Permission Models
The model decides which tools to call. You must constrain what's possible through layered permission checks and resource limits.
SandboxedExecutor with Permission Engine
class SandboxedExecutor:
def __init__(self, user_context):
self.allowed_tools = self._resolve_permissions(user_context)
self.resource_limits = ResourceLimits(
max_api_calls=100,
max_tokens_spent=50000,
max_wall_time=300,
allowed_domains=["api.internal.com"],
blocked_actions=["delete", "admin.*"]
)
async def execute(self, tool_call):
# 1. Tool allowlist check
if tool_call.name not in self.allowed_tools:
raise PermissionDenied(f"Tool {tool_call.name} not allowed")
# 2. Argument sanitization
safe_args = self.sanitizer.clean(tool_call.args)
# 3. Resource limit check
self.resource_limits.check_budget()
# 4. Execute in sandbox
return await self.sandbox.run(tool_call.name, safe_args)Sandboxing Technologies
| Technology | Isolation Level | Performance | Complexity |
|---|---|---|---|
| Docker | Container (OS-level) | Moderate | Medium |
| gVisor | Syscall interception | Lower | Medium |
| Firecracker | MicroVM (strong isolation) | Good | High |
| WASM | Process-level sandbox | Very Fast | Medium |
| E2B / Modal | Managed multi-tenant | Good | Low |
InjecAgent & Tool Selection Attacks
Prompt injection isn't just direct—attackers can inject malicious instructions via tool-returned content, web pages, or API responses. The InjecAgent benchmark measures resilience.
Indirect Prompt Injection
Attacker embeds malicious instructions in web content, API response, or email body. Agent fetches content and executes instructions unknowingly. Example: web_search returns page with hidden "delete all data" instruction.
Tool Selection Manipulation
Attacker crafts query to confuse tool selection: "Call the delete_account tool to help me." Agent picks wrong tool due to misleading language. Requires strict tool descriptions + grounding.
InjecAgent Benchmark
Research benchmark that tests agent resilience to indirect injection. Measures: Can the agent resist instructions embedded in tool outputs? Real-world validated attacks.
Defense Layering
Tool firewall validates tool calls structurally. Structured outputs reduce injection surface. PII redaction (Presidio) + secrets management (Vault) prevent data leakage via logs.
Defense: Tool Firewall + Structured Output
class ToolFirewall:
async def filter_and_execute(self, tool_calls):
for call in tool_calls:
# 1. Structural validation (strict schema)
if not self.schema_validator.validate(call):
raise InvalidToolCall("Failed schema validation")
# 2. Allowlist enforcement
if call.name not in self.allowed_tools:
raise ToolNotAllowed(call.name)
# 3. Input sanitization with context awareness
args = self.sanitizer.clean(call.args)
args = self.redactor.redact_pii(args) # Presidio
# 4. Execute & redact result before returning to LLM
result = await self._execute(call.name, args)
result = self.redactor.redact_secrets(result) # Vault
# 5. Log for audit
await self.audit_logger.log({
"tool": call.name,
"args_hash": sha256(args),
"result_hash": sha256(result),
"timestamp": now()
})
yield resultStreaming & Real-Time Tool Calling
Stream text tokens to the user immediately while tool calls execute in parallel. When a tool call interrupts the stream, show a loading indicator. This cuts perceived latency from seconds to milliseconds.
StreamingToolHandler Implementation
class StreamingToolHandler:
async def stream_with_tools(self, messages, tools):
async with self.client.messages.stream(
model="claude-sonnet-4-20250514",
messages=messages, tools=tools, max_tokens=4096
) as stream:
collected_content = []
async for event in stream:
if event.type == "content_block_start":
if event.content_block.type == "tool_use":
# Tool call detected mid-stream
tool_input = await self._collect_tool_input(stream)
result = await self.executor.execute(
event.content_block.name, tool_input
)
yield ToolResultEvent(result)
else:
yield TextStartEvent()
elif event.type == "content_block_delta":
if hasattr(event.delta, 'text'):
yield TextDeltaEvent(event.delta.text)Real-Time Communication Patterns
Server-Sent Events
Browser native streaming. Perfect for web clients and long-lived connections.
WebSocket
Bidirectional real-time communication. Best for interactive, multi-turn streams.
Token Rendering
Display each token as it arrives. Users see response forming in real-time.
Latency Hiding
Mask tool execution behind visible text. Perceived latency drops dramatically.
Agent Memory & Context Management
Multi-turn agents fill context windows. After 5-10 tool calls, you've used 50K+ tokens. What do you keep? What do you drop? Strategic memory management prevents catastrophic context exhaustion.
Four Memory Strategies
Sliding Window
Keep last N turns, drop oldest. Simple but loses early context about original request.
Summarization
LLM summarizes old turns into compact summary. Best quality/cost tradeoff for most agents.
Selective Pruning
Keep results for active topics, drop resolved ones. Smart but complex state tracking needed.
Hierarchical
Short-term (last 5 turns) + Long-term (vector store). Retrieve older context on demand.
AgentMemoryManager (Summarization Strategy)
class AgentMemoryManager:
def __init__(self, max_tokens=8000, summary_threshold=6000):
self.max_tokens = max_tokens
self.threshold = summary_threshold
self.messages = []
self.summary = ""
async def add(self, role, content, tool_result=None):
self.messages.append({"role": role, "content": content})
if self.count_tokens() > self.threshold:
await self.compress()
async def compress(self):
old = self.messages[:-3] # keep last 3 turns
summary = await self.llm.summarize(old, self.summary)
self.summary = summary
self.messages = [
{"role": "system", "content": f"Previous context: {self.summary}"}
] + self.messages[-3:]
def get_context(self):
return [{"role": "system", "content": self.summary}] + self.messagesMemory Strategy Comparison
| Strategy | Token Efficiency | Quality | Complexity | Latency |
|---|---|---|---|---|
| Sliding Window | Medium | Low | Low | Instant |
| Summarization | High | High | Medium | Extra LLM call |
| Selective Pruning | High | Medium | High | Instant |
| Hierarchical | Very High | High | Very High | Vector lookup |
Tool Result Truncation
Large API responses must be truncated before injecting back into context. A single 10K-token API result can exhaust your entire remaining budget.
def truncate_tool_result(result, max_chars=2000):
if len(result) <= max_chars:
return result
# Extract key sections: first 30% + last 30%
first_part = result[:max_chars // 3]
last_part = result[-max_chars // 3:]
return first_part + "\n[... TRUNCATED ...]\n" + last_partAsync & Long-Running Tool Patterns
Some tools take minutes or hours: report generation, human approval, data pipelines, deployments. Don't block the agent. Use callbacks, polling, or durable execution to handle async work.
Three Async Patterns
Polling
Agent periodically checks tool status. Simple but wasteful, adds latency, burns tokens on every check.
Webhook Callback
Tool calls you back when done. Efficient but requires infrastructure: queue, webhook endpoint, state storage.
Durable Execution
Use Temporal/LangGraph to checkpoint state. Resume from exact point when result arrives. Best for production.
AsyncToolExecutor (Webhook Pattern)
class AsyncToolExecutor:
async def execute_async(self, tool_name, params):
job_id = str(uuid.uuid4())
webhook_url = self.config.webhook_base + f"/complete/{job_id}"
# Send to tool with callback
await self.tool_service.enqueue(
tool_name, params, webhook_url
)
# Store pending job
await self.store.set(
f"job:{job_id}",
{"status": "pending", "tool": tool_name}
)
# Return job ID to agent (don't block)
return {"job_id": job_id, "status": "pending"}
async def on_webhook_complete(self, job_id, result):
# Inject result and resume agent
await self.agent_queue.put(
{"job_id": job_id, "result": result}
)Pattern Comparison
| Pattern | Latency | Cost | Infrastructure |
|---|---|---|---|
| Polling | High (interval-based) | High (repeated checks) | Minimal |
| Webhook | Low (event-driven) | Low (one call) | Queue + endpoint |
| Durable Execution | Very Low | Low | Temporal/LangGraph |
Multi-Modal Tool Calling
Modern agents work with vision, audio, and structured files. Vision tools analyze screenshots and charts. Audio tools transcribe and synthesize. File tools parse PDFs and spreadsheets.
Vision Tools Example
async def analyze_screenshot(image_base64: str) -> str:
"""Extract text, UI elements, and structure from screenshot"""
response = await client.messages.create(
model="claude-opus-4-1-20250805",
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_base64}},
{"type": "text", "text": "Extract text, buttons, forms. Return JSON."}
]
}]
)
return response.content[0].textMulti-Modal Capabilities by Provider
| Provider | Vision | Audio | Video |
|---|---|---|---|
| Anthropic Claude | Yes (vision) | No (use external) | No |
| OpenAI GPT-4o | Yes | Yes (in API) | Yes |
| Google Gemini | Yes | Yes | Yes |
| Open-source (LLaVA) | Yes (limited) | No | No |
When to Use Each Modality
Vision vs Text Extraction
Use Vision: UI screenshots, charts, diagrams, handwriting. Use Text: PDFs with structured text, OCR-ed documents.
Audio Processing
Transcription (speech-to-text) and synthesis (TTS) require external APIs. Integrate with agent tools for voice interfaces.
Observability & Tracing
Trace every LLM call, every tool selection, every tool execution, every result, every error. Collect token counts, latencies, and costs. Without observability, you can't debug or optimize.
OpenTelemetry Integration
from opentelemetry import trace
tracer = trace.get_tracer("agent")
class TracedAgent:
@tracer.start_as_current_span("agent.run")
async def run(self, query):
span = trace.get_current_span()
span.set_attribute("query", query[:200])
for step in range(self.max_steps):
with tracer.start_span("llm.call") as llm_span:
response = await self.llm.generate(...)
llm_span.set_attribute("model", self.model)
llm_span.set_attribute("tokens.input", response.usage.input)
llm_span.set_attribute("tokens.output", response.usage.output)
if response.has_tool_use:
with tracer.start_span(f"tool.{tool_name}") as tool_span:
result = await self.execute_tool(...)
tool_span.set_attribute("tool.success", not result.is_error)Key Metrics Dashboard
LLM Calls
Calls per query, latency distribution (P50/P95/P99), token usage trends.
Tool Selection
Accuracy rate, which tools selected most, tool selection errors.
Costs & Budget
Token usage & cost per query, cost per tool, budget tracking.
Tool Latency
Execution time per tool, P50/P95/P99 distribution, bottleneck tools.
Error Tracking
Error rate by tool, error types, error recovery success.
Tools: LangSmith, Arize Phoenix, OpenTelemetry
Datadog, Braintrust, custom observability backends.
Testing & Evaluation
Build a testing pyramid: Unit tests for tool functions → Integration tests for tool calling flow → Agent eval suites for end-to-end task completion → Red team for adversarial testing.
AgentEvaluator Implementation
class AgentEvaluator:
def __init__(self, agent, eval_set):
self.agent = agent
self.eval_set = eval_set # [(query, expected_tools, expected_answer)]
async def run_eval(self):
results = []
for query, expected_tools, expected_answer in self.eval_set:
trace = await self.agent.run_traced(query)
results.append(EvalResult(
tool_selection_accuracy=self._check_tools(trace, expected_tools),
answer_correctness=self._check_answer(trace.answer, expected_answer),
steps_taken=len(trace.steps),
total_tokens=trace.total_tokens,
latency_ms=trace.duration_ms,
))
return EvalReport(results)Evaluation Metrics & Patterns
Deterministic Tests
Tool functions must be deterministic. Unit test each tool independently.
LLM-as-Judge
Use another LLM to grade answer quality. Good for subjective tasks.
Regression Testing
Run full eval suite on model upgrades. Lock in baseline before changes.
Adversarial Testing
Prompt injection, jailbreak attempts, malformed inputs, edge cases.
Mock Tool Testing Pattern
Mock tools return deterministic responses for testing agent logic without calling real APIs. Critical for CI/CD pipelines.
class MockToolRegistry:
def __init__(self, fixtures: Dict):
self.fixtures = fixtures
async def execute(self, tool_name, params):
# Return fixture if available, else real call
if tool_name in self.fixtures:
return self.fixtures[tool_name]
return await real_tools[tool_name].execute(params)
# Usage in tests
mock_tools = MockToolRegistry({
"get_weather": {"temp": 72, "conditions": "sunny"},
"fetch_data": [1, 2, 3]
})Snapshot Testing for Tool Call Sequences
Record expected tool call sequences. Compare against baseline on each run to catch logic regressions.
def test_agent_snapshot():
trace = await agent.run_traced("Find flights to NYC")
tool_calls = [step.tool_name for step in trace.steps]
# Snapshot: ["search_flights", "get_prices", "format_response"]
assert_snapshot(tool_calls, "test_agent_snapshot.json")GitHub Actions CI Example
name: Test Agents
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run agent eval suite
run: pytest tests/agent_evals.py -v
env:
TOOL_MOCK_MODE: true
SNAPSHOT_UPDATE: falseCost Control & Optimization
LLM tokens drive costs: input context grows with each step. Tool execution, retries, and multi-agent fan-out add up. Set budgets, track spend, optimize aggressively.
CostAwareAgent with Budgets
class CostAwareAgent:
def __init__(self, budget_tokens=50000, budget_usd=0.10):
self.token_budget = budget_tokens
self.usd_budget = budget_usd
self.tokens_used = 0
self.cost_usd = 0.0
async def run(self, query):
for step in range(self.max_steps):
if self.tokens_used > self.token_budget * 0.9:
return self._force_answer("Approaching token budget")
response = await self.llm.generate(...)
self.tokens_used += response.usage.total_tokens
self.cost_usd += self._calculate_cost(response.usage)
if self.cost_usd > self.usd_budget:
return self._force_answer("Cost budget exceeded")Optimization Strategies
Prompt Caching
Reuse system prompts and tool definitions across queries. Dramatic savings on repeated context.
Context Pruning
Summarize old steps and tool results. Keep only recent, relevant context.
Model Tiering
Cheap model for routing. Strong model only for final answer synthesis.
Tool Result Caching
Cache tool outputs (weather, stock prices, search results) aggressively.
Per-Tool Cost Attribution
Track which tools drive costs. Some tools are expensive (API calls, vision models). Implement cost tracking per tool to identify optimization opportunities.
class CostTracker:
def __init__(self):
self.tool_costs = defaultdict(lambda: {"calls": 0, "tokens": 0, "cost_usd": 0.0})
async def record_call(self, tool_name, response):
tokens = response.usage.total_tokens
cost = self._price_tokens(tokens, tool_name)
self.tool_costs[tool_name]["calls"] += 1
self.tool_costs[tool_name]["tokens"] += tokens
self.tool_costs[tool_name]["cost_usd"] += cost
def get_report(self):
return sorted(
self.tool_costs.items(),
key=lambda x: x[1]["cost_usd"],
reverse=True
)Cost Anomaly Detection
Monitor for unexpected cost spikes. Alert when per-query cost exceeds baseline + 2σ or when tokens exceed budget threshold.
async def detect_anomaly(query_cost, history: List[float]):
mean = np.mean(history)
std = np.std(history)
threshold = mean + 2 * std
if query_cost > threshold:
await alert(f"Cost spike: {query_cost}. Baseline: {mean}")Model Pricing Comparison (March 2026)
| Model | Input Token Rate | Output Token Rate | Vision (if available) | Best For |
|---|---|---|---|---|
| Claude Sonnet 4 | $3/1M | $15/1M | Yes | Complex reasoning |
| Claude Haiku 4.5 | $0.80/1M | $4/1M | Yes | Low-latency routing |
| OpenAI GPT-4o | $2.50/1M | $10/1M | Yes | Multimodal agents |
| OpenAI GPT-4o Mini | $0.15/1M | $0.60/1M | Yes | High-volume routing |
| Google Gemini 2.0 | $0.075/1M | $0.30/1M | Yes | Cost-sensitive scale |
| Meta Llama 3.1 (self-hosted) | Compute cost | Compute cost | Limited | Privacy-critical |
Production Deployment
Stateless agent services behind a load balancer. Store conversation state in Redis or a database. Trace every call. Monitor tool health. Feature flag new tools. Enable zero-downtime deployments.
Production Best Practices
Stateless Design
Agent services are ephemeral. Store conversation state in Redis or a database for horizontal scaling.
Tool Health Checks
Monitor tool dependencies. Circuit breakers for degraded tools. Graceful degradation.
Feature Flags
Gradually roll out new tools. Feature flag new behaviors. Easy rollback.
Canary Deployments
5% traffic to new version. Progressive 25% → 50% → 100%. Zero downtime.
Production Readiness Checklist
Before you ship, verify all four categories. The best agent architectures are boring: simple orchestration, reliable tools, comprehensive monitoring, and strict safety boundaries.
Tool Design
- ✓ JSON Schema validation
- ✓ Descriptions optimized for models
- ✓ Versioned schemas
- ✓ <20 tools per request
- ✓ Input sanitization
- ✓ Idempotent tools where possible
Reliability
- ✓ Retry with exponential backoff
- ✓ Circuit breakers
- ✓ Timeout on every tool
- ✓ Graceful degradation
- ✓ Dead letter queue
- ✓ Error reporting to LLM
Safety
- ✓ Tool allowlisting per user role
- ✓ Code execution sandboxing
- ✓ Network isolation
- ✓ Argument sanitization
- ✓ Cost budgets
- ✓ Prompt injection defense
Operations
- ✓ Distributed tracing on every call
- ✓ Token/cost dashboards
- ✓ Tool latency monitoring
- ✓ Eval suite in CI/CD
- ✓ Canary deployments
- ✓ Model version pinning
Agent Frameworks & Ecosystem
Landscape of production frameworks ranges from lightweight libraries to full platforms. Start simple. Choose based on your system's maturity level.
Framework Comparison Matrix
| Framework | Strengths | Gaps / Risks | Security Posture | Best-Fit Role |
|---|---|---|---|---|
| LangChain + LangGraph |
Wide tool ecosystem, strong community, good documentation | Can be verbose, cost tracking not native, performance variable | Community-maintained, audit tools available | General-purpose agents, prototyping |
| LlamaIndex | Excellent RAG, semantic caching, fast indexing | Less multi-agent support, narrower tool set | Document-level security, integrations | Document-driven agents, knowledge systems |
| AutoGen | Multi-agent conversations, flexible role def | Unpredictable cost, hard to debug, no built-in guardrails | Limited access controls, relies on models | Research, complex collaborative tasks |
| MS Agent Framework |
Enterprise-grade, strong security, durable execution | Steeper learning curve, Azure-dependent | Built-in RBAC, audit trails, compliance ready | Enterprise production, regulated industries |
| Semantic Kernel |
Plugin model, cross-language support, .NET first | Smaller ecosystem than LangChain, less documentation | Microsoft ecosystem integration | .NET applications, Windows-first orgs |
| Ray Serve | Distributed scaling, low-latency serving, cost-aware | Operational overhead, requires Kubernetes knowledge | Network isolation, resource limits | High-volume production, multi-tenant SaaS |
| CrewAI | Simple role-based design, good for structured workflows | Early stage, smaller community, limited frameworks | Depends on underlying models, basic tooling | Workflow-focused teams, structured tasks |
| Haystack | Modular pipelines, clear abstractions, good docs | Smaller community, less multi-agent tooling | Pipeline-level access control | Search & QA systems, modular pipelines |
| DSPy | Minimal, Pythonic, great for optimization | Limited built-in tools, requires more custom code | Simple surface = easy to audit | Research, custom agents, fine-tuning workflows |
Phased Implementation Roadmap
Production agents aren't built overnight. This phased approach helps you balance velocity with reliability.
Success Metrics & Deliverables
P1: Tools Inventory
Tool catalog doc, risk matrix, schema specs, RBAC roles defined
P2: MVP Deployed
Read-only agent live, tool gateway running, basic observability online
P3: Durable Workflows
Temporal/LangGraph checkpoints, ledger logging, compensation tests
P4: Eval Suite Live
Red-team results, eval CI/CD checks, security audit report
P5: Multi-Agent Ready
MCP integration, distributed tracing proven, cost model validated
Audit & Compliance Data Model
Enterprise agents require complete audit trails. This ER model captures every decision, auth check, and compensation action for compliance, debugging, and forensics.
AuditLogger Implementation
class AuditLogger:
def __init__(self, db):
self.db = db
async def log_tool_call(self, call: ToolCall):
# Insert TOOL_CALL record
call_id = uuid4()
await self.db.execute("""
INSERT INTO TOOL_CALL (call_id, step_id, tool_name, args_hash, created_at)
VALUES (?, ?, ?, ?, now())
""", call_id, call.step_id, call.name, sha256(call.args))
# Log auth decision
await self.db.execute("""
INSERT INTO AUTHZ_DECISION (decision_id, call_id, allowed, reason)
VALUES (?, ?, ?, ?)
""", uuid4(), call_id, call.allowed, call.authz_reason)
return call_id
async def log_tool_result(self, call_id, result, latency_ms):
# Insert TOOL_RESULT record (redacted)
await self.db.execute("""
INSERT INTO TOOL_RESULT (result_id, call_id, result_hash, latency_ms)
VALUES (?, ?, ?, ?)
""", uuid4(), call_id, sha256(result), latency_ms)
async def log_compensation(self, call_id, action):
await self.db.execute("""
INSERT INTO COMPENSATION_ACTION (comp_id, call_id, action, executed)
VALUES (?, ?, ?, true)
""", uuid4(), call_id, action)
async def audit_trail(self, run_id):
# Full audit trail for a run: all steps, calls, auth, results
return await self.db.query("""
SELECT s.step_num, t.tool_name, a.allowed, a.reason, r.latency_ms, c.action
FROM STEP s
JOIN TOOL_CALL t ON s.step_id = t.step_id
LEFT JOIN AUTHZ_DECISION a ON t.call_id = a.call_id
LEFT JOIN TOOL_RESULT r ON t.call_id = r.call_id
LEFT JOIN COMPENSATION_ACTION c ON t.call_id = c.call_id
WHERE s.run_id = ?
ORDER BY s.step_num
""", run_id)AI Agents — Advanced Tool Calling
Production Patterns • 23 Sections • Architecture Diagrams • Code Examples
Glossary of AI Agent Terms
13 key technical terms used throughout this guide.
A
| Term | Definition |
|---|---|
| Agent Loop (ReAct) | The Reasoning-Acting loop where an LLM reasons about a task, selects a tool, executes it, observes results, and iterates until complete. The core pattern for AI agents. |
| Agentic RAG | A RAG pattern where an LLM agent autonomously decides when to retrieve, which tools to call, and whether to iterate — orchestrating multi-step retrieval and reasoning. |
C
| Term | Definition |
|---|---|
| Chain-of-Thought (CoT) | A prompting technique that elicits step-by-step reasoning before the final answer, improving performance on complex multi-step tasks. |
D
| Term | Definition |
|---|---|
| Durable Execution | A workflow pattern (Temporal, Inngest) that persists agent state across failures, enabling automatic retries and recovery for long-running multi-step tasks. |
F
| Term | Definition |
|---|---|
| Function Calling | The LLM's ability to generate structured JSON tool invocations in response to a query, enabling it to interact with external APIs, databases, and services. |
G
| Term | Definition |
|---|---|
| Guardrails | Input/output validation rules that constrain agent behavior — preventing prompt injection, enforcing output schemas, and blocking unsafe tool invocations. |
H
| Term | Definition |
|---|---|
| Human-in-the-Loop (HITL) | A pattern where the agent pauses for human approval before executing high-risk actions (e.g., financial transactions, deletions). Critical for production safety. |
I
| Term | Definition |
|---|---|
| InjecAgent | An adversarial framework for testing tool-calling agents against prompt injection attacks that attempt to hijack tool invocations through malicious instructions. |
J
| Term | Definition |
|---|---|
| JSON Schema (Tool Schema) | The structured definition of a tool's name, description, and parameters that the LLM uses to understand what tools are available and how to invoke them. |
M
| Term | Definition |
|---|---|
| Multi-Agent Orchestration | Coordinating multiple specialized agents (e.g., researcher, coder, reviewer) that collaborate on complex tasks through message passing or shared state. |
P
| Term | Definition |
|---|---|
| Prompt Injection | An adversarial attack where malicious instructions are embedded in inputs to manipulate the LLM's behavior, potentially causing unauthorized tool invocations. |
S
| Term | Definition |
|---|---|
| Saga Pattern | A distributed transaction pattern for multi-step agent workflows where each step has a compensating action (rollback) if a later step fails. |
T
| Term | Definition |
|---|---|
| Tool Use | The ability of an LLM to call external functions, APIs, or services to perform actions beyond text generation — retrieving data, executing code, or modifying state. |