AI Agents — Advanced Tool Calling
Production patterns for building reliable, observable, and safe tool-using AI agents — from schema design through multi-agent orchestration to deployment.
Why Tool Calling Changes Everything
Tool calling is when LLMs generate structured function calls instead of free text. The model outputs a tool name + arguments, your code executes it, returns results, and the model continues. This unlocks deterministic, repeatable, and composable agent workflows.
Structured I/O
From text-in/text-out to deterministic function calls. The model outputs JSON matching your schema, not ambiguous natural language.
Multi-Step Reasoning
From single-turn to iterative loops. The agent thinks, acts, observes, and repeats until the goal is reached.
Integrated Systems
From isolated model to end-to-end system. Your code controls execution, validation, and error recovery.
Why Agents Need Tools
Real-Time Data
Access APIs, databases, and live information the model has no knowledge of.
Take Actions
Send emails, create records, modify systems, and execute business logic.
Precise Computation
Offload math, date calculations, and deterministic logic to code.
Private Systems
Connect to internal services, databases, and proprietary systems safely.
Not 'asking the AI to write code'. The model outputs JSON matching your schema, your code validates and executes. You're always in control.
Five Production Design Commitments
1. Capability Boundary
Curated tools with least-privilege, rate limits, and auditing. Never expose untrusted capabilities.
2. Reliable Substrate
Idempotency, retries, durable state checkpoints, and saga compensation patterns.
3. Grounded Loop
Schema validation + explicit grounding rules. Tool inputs must be validated before execution.
4. Observability
Tool-call spans, auth decisions, retries, and state checkpoints. Monitor actions, not just tokens.
5. Continuous Eval
Adversarial testing, InjecAgent benchmarks, and red-team suites in CI/CD.
Tool Calling Fundamentals
The Tool Calling Lifecycle
Complete Tool Calling Flow with Anthropic Claude API
import anthropic
client = anthropic.Anthropic()
tools = [{
"name": "get_weather",
"description": "Get current weather for a city",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"units": {"type": "string", "enum": ["celsius", "fahrenheit"], "default": "celsius"}
},
"required": ["city"]
}
}]
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}]
)
# Handle tool use response
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
# Send result back to continue conversationTool Definition
Name, description, and JSON schema that defines the interface the model must follow.
Tool Selection
The model picks which tool to call based on the user query and available tools.
Tool Execution
Your code runs the selected tool with the model-provided arguments.
Result Injection
Feed results back to the model to continue reasoning or take next steps.
It outputs structured JSON saying "I want to call X with args Y". Your code is always in control. This is the core principle that makes tool calling safe and deterministic.
Tool Schema Design & Best Practices
Well-designed schemas dramatically improve model accuracy and reduce errors. The schema is the interface between your LLM and your system.
Well-Designed vs Poorly-Designed Schemas
"input_schema": {
"type": "object",
"properties": {
"date_range": {
"type": "object",
"properties": {
"start": {
"type": "string",
"format": "date",
"desc": "YYYY-MM-DD"
}
}
},
"limit": {
"type": "integer",
"minimum": 1,
"maximum": 100
}
},
"required": ["date_range"]
}"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string"
},
"options": {
"type": "string"
}
}
}
# No descriptions, no constraints,
# vague parameter names, string for
# everything, no validationSchema Validation Pipeline
Best Practices for Tool Schemas
1. Write Descriptions for the MODEL
Be explicit about format, constraints, and examples. The model reads the description to understand what you want.
2. Use Enums to Constrain Choices
Don't let the model invent values. Define allowed options explicitly in the schema.
3. Keep Tool Count Under 20
More tools = worse selection accuracy. Group related tools or use sub-actions.
4. Version Your Schemas
Breaking changes need migration. Track schema versions and deprecate gradually.
5. Include Examples in Descriptions
Show the model what good input looks like. "Example: 2024-03-15" is better than "ISO date format".
6. Validate Server-Side
NEVER trust model output. Validate all inputs, check ranges, verify enums, and handle edge cases.
Parameter Type Patterns
| Type | Pattern | Use Case |
|---|---|---|
| String with Regex | "pattern": "^[A-Z]{2}\\d{3}$" |
Codes, SKUs, phone numbers |
| Number with Bounds | "type": "number", "minimum": 0, "maximum": 100 |
Prices, percentages, counts |
| Enum | "enum": ["pending", "active", "archived"] |
Status, categories, choices |
| Array with Items | "type": "array", "items": {"type": "string"}, "maxItems": 10 |
Tags, email lists, IDs |
| Nested Object | "type": "object", "properties": {...}, "required": [...] |
Complex data, structured inputs |
Tool Interface Patterns
| Interface Pattern | Contract Format | Security Implications | Best-Fit Use Cases |
|---|---|---|---|
| REST+OpenAPI | OpenAPI 3.0 spec, HTTP verbs | Network isolation, TLS required, easy MITM if not HTTPS | External APIs, standard microservices |
| gRPC+Protobuf | Protocol Buffers, binary format | Strongly typed, harder to modify, TLS required | High-throughput internal services, low-latency |
| Provider Function Calling | Native code, type signatures | Direct code execution, trust model is critical | Same-process agents, embedded systems |
| MCP (Model Context Protocol) | JSON-RPC, tools as resources | Standardized, audit trails, capability discovery | Multi-agent systems, plugin architectures |
| Tool-Proxy/Firewall | Transparent proxy, schema validation | Input sanitization, rate-limiting, logging layer | Enterprise, compliance-heavy, zero-trust |
Agent Orchestration Patterns
Different patterns for different needs. Most production systems need Sequential Chain. Start simple.
Orchestration Pattern Comparison
| Pattern | When to Use | Complexity | Latency | Reliability |
|---|---|---|---|---|
| Single Tool | Simple tasks, one action per query | Low | Fast | High |
| Sequential Chain | Multi-step workflows with known order | Medium | Moderate | High |
| Router | Classification or dispatch to different tools | Medium | Fast | High |
| Autonomous Agent | Complex reasoning, unknown steps, exploration | High | Slow | Medium |
AgentOrchestrator Implementation (Autonomous Loop)
class AgentOrchestrator:
def __init__(self, max_iterations=10):
self.max_iterations = max_iterations
self.tool_registry = {}
self.conversation = []
def register_tool(self, name, tool_func, schema):
self.tool_registry[name] = {"func": tool_func, "schema": schema}
def run(self, query):
self.conversation = [{"role": "user", "content": query}]
for iteration in range(self.max_iterations):
# Get LLM response with tool definitions
response = call_llm(self.conversation, self.tool_registry)
if response.stop_reason == "end_turn":
return response.text
for block in response.content:
if block.type == "tool_use":
result = self.tool_registry[block.name]["func"](block.input)
self.conversation.append({"role": "user", "content": result})
return "Max iterations reached"Most production use cases need Sequential Chain, not full autonomous agents. Autonomous loops are harder to debug, slower, and more expensive. Use them only when the task requires true exploration.
Agent Skill Map — Capabilities & Tool Taxonomy
A production agent system is only as powerful as its tools. This skill map charts the full landscape of agent capabilities, the tool categories that enable them, and how they compose into real-world workflows.
Information Retrieval
The foundation of any useful agent — fetching context from the world.
| Tool | Use Case | Complexity |
|---|---|---|
| Web Search | Real-time facts, news, current events | Low |
| RAG / Vector Search | Private knowledge base Q&A | Medium |
| SQL Query | Structured data: metrics, reports, analytics | Medium |
| API Fetch | Live system data (weather, stock, status) | Low |
| Doc Reader | PDF/DOCX parsing, contract analysis | Medium |
Code & Computation
Extends the agent beyond language into precise computation and system control.
| Tool | Use Case | Complexity |
|---|---|---|
| Code Exec | Data analysis, plotting, scripting | High |
| Shell/CLI | File ops, git, system admin | High |
| Calculator | Precise math, financial calcs | Low |
| Data Transform | CSV cleaning, JSON reshape, ETL | Medium |
| Notebook | Interactive data exploration | High |
Actions & Integrations
Where agents become truly useful — taking actions in real systems on behalf of users.
| Tool | Use Case | Risk Level |
|---|---|---|
| Draft, send, search, summarize | Medium | |
| Calendar | Schedule, reschedule, check availability | Medium |
| Messaging | Post updates, respond, search channels | Medium |
| Ticketing | Create, update, assign, close | Low |
| CRM | Update contacts, log activities | High |
| Git | Commit, create PRs, review code | High |
Skill Composition — Mapping Use Cases to Tool Sets
Real agent tasks rarely need a single tool. Here's how skills compose for common production use cases:
| Use Case | Required Skills | Tool Chain | Orchestration |
|---|---|---|---|
| Research Assistant | Retrieval Compute | Web Search → Document Reader → Summarize → Write Report | Sequential Chain |
| Data Analyst | Retrieval Compute Generate | SQL Query → Code Exec (pandas) → Chart Builder → Slide Deck | Sequential + Parallel |
| Customer Support | Retrieval Actions Memory | RAG Search → CRM Lookup → Ticket Create → Email Draft | Router + Sequential |
| DevOps Copilot | Compute Retrieval Actions | Log Search → Shell Exec → Runbook Lookup → Slack Alert | ReAct (Autonomous) |
| Meeting Prep Agent | Retrieval Actions Generate | Calendar Check → CRM Lookup → Web Search → Doc Writer | Sequential Chain |
| Code Review Agent | Compute Retrieval Actions | Git Diff → Code Exec (tests) → Style Check → PR Comment | Parallel + Sequential |
Tool Design Principles
# The UNIX philosophy for agent tools:
# 1. Do one thing well
# 2. Composable inputs/outputs
# 3. Fail loudly with clear errors
# 4. Idempotent where possible
# 5. Observable (logs, metrics, traces)
class ToolDesignChecklist:
single_responsibility: bool # One action per tool
clear_description: bool # LLM can understand when to use
typed_schema: bool # JSON Schema with constraints
error_messages: bool # Actionable, not cryptic
idempotent: bool # Safe to retry
timeout: bool # Never hangs
rate_limited: bool # Protects downstream
permission_scoped: bool # Least privilege
observable: bool # Emits traces/metricsTool Risk Tiers & Approval Gates
Not all tools carry equal risk. Tier them and enforce appropriate approval flows:
Read-Only (Auto-approve)
Search, lookup, calculate, read. No side effects. Safe for autonomous execution.
Reversible Write (Soft-approve)
Draft email, create ticket, update status. Can be undone. Log + notify, execute with confirmation for sensitive data.
Irreversible Action (Human-approve)
Send email, publish post, execute payment, delete record. Require explicit human confirmation before execution.
Privileged / Admin (Never auto-approve)
Access controls, billing changes, system config, code deploy. Always require authenticated human approval with audit trail.
ReAct & Reasoning Loops
ReAct (Reason + Act) is the pattern for agentic behavior: Think → Act → Observe → Repeat. The agent reasons about what to do, takes action, observes results, and continues.
ReActAgent Implementation
class ReActAgent:
def __init__(self, llm, tools, max_steps=10):
self.llm = llm
self.tools = {t.name: t for t in tools}
self.max_steps = max_steps
self.scratchpad = []
async def run(self, query):
for step in range(self.max_steps):
# Think: reason about what to do next
thought = await self.llm.think(query, self.scratchpad)
self.scratchpad.append(("thought", thought))
if thought.action == "finish":
return thought.answer
# Act: execute the chosen tool
tool = self.tools[thought.tool_name]
result = await tool.execute(thought.tool_args)
# Observe: record the result
self.scratchpad.append(("observation", result))
return "Max steps reached — could not complete"ReAct vs Other Approaches
ReAct (Reason + Act)
Explicit think-act-observe loop. Best for complex tasks requiring exploration and recovery.
Chain-of-Thought
Reasoning without action. Better for analysis but can't execute tasks or access new data.
Plan-and-Execute
Create a plan first, then execute. Good for structured workflows, poor for adaptive tasks.
LATS (Language Agent Tree Search)
Tree search over actions. Expensive but explores multiple paths for complex problems.
Planning Algorithms Comparison
| Paradigm | Core Idea | Strengths | Weaknesses |
|---|---|---|---|
| ReAct | Reason → Act → Observe → Repeat | Flexible, adaptive, handles new situations, easy to debug | Can be verbose, slower on simple tasks |
| Plan-and-Solve | Create plan first, then execute steps | Structured, good for complex tasks, reduces errors | Brittle when plan becomes stale, poor for exploration |
| Tree of Thoughts | Explore multiple reasoning branches | Finds better solutions, good for reasoning, backtracking | High token cost, slow, overkill for simple tasks |
| Reflexion | Agent self-reflects on failures | Learns from mistakes, improves over iterations | Requires many iterations, slow convergence |
| Toolformer | Model decides when to call tools | Minimal tokens, fast, uses tools sparingly | Requires fine-tuning, less transparent reasoning |
| Graph of Thoughts | Reasoning as directed acyclic graph | Expresses complex dependencies, good for workflows | Complex to implement, expensive to evaluate |
Parallel & Multi-Tool Execution
Modern LLMs can request multiple tools at once. Instead of waiting for each result, execute them in parallel for 3-5x latency improvement.
ParallelToolExecutor with asyncio.gather
class ParallelToolExecutor:
async def execute_batch(self, tool_calls, timeout=30):
tasks = [
asyncio.wait_for(
self._execute_one(call),
timeout=timeout
)
for call in tool_calls
]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Handle partial failures gracefully
return [
ToolResult(call.id, result) if not isinstance(result, Exception)
else ToolResult(call.id, error=str(result))
for call, result in zip(tool_calls, results)
]Parallel Execution Benefits & Challenges
Benefits
Latency: 3-5x faster for multi-step tasks. Throughput: Execute 10 tools simultaneously instead of sequentially.
Challenges
Ordering: Dependency handling (Tool B needs Tool A result). Cost: Fan-out multiplies API calls.
Error Handling
Partial failures (1 of 3 fails). Timeouts on slow tools. Always use return_exceptions=True.
Execution Strategies
| Strategy | Use Case | Latency | Complexity |
|---|---|---|---|
| Serial | Tool B depends on Tool A result | Slowest | Simple |
| Parallel | Independent tools (most common) | 3-5x faster | Moderate |
| DAG-Based | Complex dependency graphs | Optimal | High |
| Speculative | Execute multiple paths, pick best | Variable | Very High |
Multi-Agent Delegation
Delegate complex tasks to specialized sub-agents. Supervisor routes work, workers execute independently, results compose back up.
SupervisorAgent Router Implementation
class SupervisorAgent:
def __init__(self):
self.workers = {
"research": ResearchAgent(tools=[web_search, doc_reader]),
"code": CodeAgent(tools=[code_exec, file_write]),
"data": DataAgent(tools=[sql_query, chart_gen]),
}
async def handle(self, task):
plan = await self.planner.decompose(task)
results = {}
for step in plan.steps:
worker = self.workers[step.agent]
results[step.id] = await worker.execute(step.instruction, context=results)
return await self.synthesizer.combine(results)Multi-Agent Orchestration Patterns
Supervisor / Worker
One coordinator routes tasks to specialized workers. Best for diverse skill domains. Easy to scale.
Peer-to-Peer
Agents communicate directly. More flexible but harder to debug. Good for negotiation.
Hierarchical
Tree of agents. Scales better for large teams. Messaging overhead increases.
Swarm
Decentralized with local rules. Complex behaviors from simple agents. Hard to predict.
Error Handling & Recovery
Real agents fail. Your job is to define what's retryable, what's not, and how the model can recover.
ResilientToolExecutor with Decorators
class ResilientToolExecutor:
@retry(max_attempts=3, backoff=exponential(base=2))
@circuit_breaker(failure_threshold=5, recovery_timeout=60)
@timeout(seconds=30)
async def execute(self, tool_name, args):
try:
result = await self.tools[tool_name].run(args)
self.metrics.record_success(tool_name)
return result
except ValidationError as e:
return ToolError(f"Invalid args: {e}", retryable=False)
except RateLimitError:
raise # Let retry decorator handle
except TimeoutError:
self.metrics.record_timeout(tool_name)
return ToolError("Tool timed out", retryable=True)Error Taxonomy
Tool Execution Failure
API unreachable, permission denied, network timeout. Often retryable with backoff.
Schema Validation
LLM outputs invalid args. Not retryable at executor level—return to LLM to correct.
Timeout / Rate Limit
Tool slow or API throttled. Retryable with exponential backoff and jitter.
Durable Execution & Saga Patterns
Production systems require durability. If an agent crashes mid-execution, you must be able to resume. Use idempotency keys, durable state checkpoints, and compensation patterns.
Temporal Idempotent Activities
Use Temporal workflows to define durable, idempotent tool executions. If a tool call succeeds but the workflow crashes, Temporal retries from the checkpoint—not from scratch.
Stripe-Style Idempotency Keys
For external APIs, include idempotency keys in request headers. If the same request is sent twice, the server returns cached result—no duplicate charge.
Saga Compensation Pattern
For multi-step workflows, define compensating actions (rollbacks). If step 3 fails, execute reverse of step 2, then step 1. AWS Prescriptive Guidance reference.
Workflow Ledger
Persist every tool call and result to a ledger. Enables auditing, replay, and recovery. Each entry is immutable and timestamped.
SagaExecutor with Compensation
class SagaExecutor:
def __init__(self, ledger):
self.ledger = ledger
self.compensations = []
async def execute_saga(self, steps):
for step in steps:
try:
# Add idempotency key to request
idempotency_key = uuid4()
result = await self.execute_with_key(
step.tool, step.args, idempotency_key
)
# Log to ledger
await self.ledger.append({
"type": "TOOL_CALL",
"step_id": step.id,
"tool": step.tool,
"result": result,
"timestamp": now()
})
# Register compensation for rollback
if step.compensate:
self.compensations.insert(0, (
step.compensate, result
))
except Exception as e:
# Execute all compensations in reverse order
for comp_fn, context in self.compensations:
await comp_fn(context)
await self.ledger.append({
"type": "COMPENSATION",
"fn": comp_fn.__name__
})
raise
return "All steps completed successfully"Sandboxing & Permission Models
The model decides which tools to call. You must constrain what's possible through layered permission checks and resource limits.
SandboxedExecutor with Permission Engine
class SandboxedExecutor:
def __init__(self, user_context):
self.allowed_tools = self._resolve_permissions(user_context)
self.resource_limits = ResourceLimits(
max_api_calls=100,
max_tokens_spent=50000,
max_wall_time=300,
allowed_domains=["api.internal.com"],
blocked_actions=["delete", "admin.*"]
)
async def execute(self, tool_call):
# 1. Tool allowlist check
if tool_call.name not in self.allowed_tools:
raise PermissionDenied(f"Tool {tool_call.name} not allowed")
# 2. Argument sanitization
safe_args = self.sanitizer.clean(tool_call.args)
# 3. Resource limit check
self.resource_limits.check_budget()
# 4. Execute in sandbox
return await self.sandbox.run(tool_call.name, safe_args)Sandboxing Technologies
| Technology | Isolation Level | Performance | Complexity |
|---|---|---|---|
| Docker | Container (OS-level) | Moderate | Medium |
| gVisor | Syscall interception | Lower | Medium |
| Firecracker | MicroVM (strong isolation) | Good | High |
| WASM | Process-level sandbox | Very Fast | Medium |
| E2B / Modal | Managed multi-tenant | Good | Low |
InjecAgent & Tool Selection Attacks
Prompt injection isn't just direct—attackers can inject malicious instructions via tool-returned content, web pages, or API responses. The InjecAgent benchmark measures resilience.
Indirect Prompt Injection
Attacker embeds malicious instructions in web content, API response, or email body. Agent fetches content and executes instructions unknowingly. Example: web_search returns page with hidden "delete all data" instruction.
Tool Selection Manipulation
Attacker crafts query to confuse tool selection: "Call the delete_account tool to help me." Agent picks wrong tool due to misleading language. Requires strict tool descriptions + grounding.
InjecAgent Benchmark
Research benchmark that tests agent resilience to indirect injection. Measures: Can the agent resist instructions embedded in tool outputs? Real-world validated attacks.
Defense Layering
Tool firewall validates tool calls structurally. Structured outputs reduce injection surface. PII redaction (Presidio) + secrets management (Vault) prevent data leakage via logs.
Defense: Tool Firewall + Structured Output
class ToolFirewall:
async def filter_and_execute(self, tool_calls):
for call in tool_calls:
# 1. Structural validation (strict schema)
if not self.schema_validator.validate(call):
raise InvalidToolCall("Failed schema validation")
# 2. Allowlist enforcement
if call.name not in self.allowed_tools:
raise ToolNotAllowed(call.name)
# 3. Input sanitization with context awareness
args = self.sanitizer.clean(call.args)
args = self.redactor.redact_pii(args) # Presidio
# 4. Execute & redact result before returning to LLM
result = await self._execute(call.name, args)
result = self.redactor.redact_secrets(result) # Vault
# 5. Log for audit
await self.audit_logger.log({
"tool": call.name,
"args_hash": sha256(args),
"result_hash": sha256(result),
"timestamp": now()
})
yield resultStreaming & Real-Time Tool Calling
Stream text tokens to the user immediately while tool calls execute in parallel. When a tool call interrupts the stream, show a loading indicator. This cuts perceived latency from seconds to milliseconds.
StreamingToolHandler Implementation
class StreamingToolHandler:
async def stream_with_tools(self, messages, tools):
async with self.client.messages.stream(
model="claude-sonnet-4-20250514",
messages=messages, tools=tools, max_tokens=4096
) as stream:
collected_content = []
async for event in stream:
if event.type == "content_block_start":
if event.content_block.type == "tool_use":
# Tool call detected mid-stream
tool_input = await self._collect_tool_input(stream)
result = await self.executor.execute(
event.content_block.name, tool_input
)
yield ToolResultEvent(result)
else:
yield TextStartEvent()
elif event.type == "content_block_delta":
if hasattr(event.delta, 'text'):
yield TextDeltaEvent(event.delta.text)Real-Time Communication Patterns
Server-Sent Events
Browser native streaming. Perfect for web clients and long-lived connections.
WebSocket
Bidirectional real-time communication. Best for interactive, multi-turn streams.
Token Rendering
Display each token as it arrives. Users see response forming in real-time.
Latency Hiding
Mask tool execution behind visible text. Perceived latency drops dramatically.
Observability & Tracing
Trace every LLM call, every tool selection, every tool execution, every result, every error. Collect token counts, latencies, and costs. Without observability, you can't debug or optimize.
OpenTelemetry Integration
from opentelemetry import trace
tracer = trace.get_tracer("agent")
class TracedAgent:
@tracer.start_as_current_span("agent.run")
async def run(self, query):
span = trace.get_current_span()
span.set_attribute("query", query[:200])
for step in range(self.max_steps):
with tracer.start_span("llm.call") as llm_span:
response = await self.llm.generate(...)
llm_span.set_attribute("model", self.model)
llm_span.set_attribute("tokens.input", response.usage.input)
llm_span.set_attribute("tokens.output", response.usage.output)
if response.has_tool_use:
with tracer.start_span(f"tool.{tool_name}") as tool_span:
result = await self.execute_tool(...)
tool_span.set_attribute("tool.success", not result.is_error)Key Metrics Dashboard
LLM Calls
Calls per query, latency distribution (P50/P95/P99), token usage trends.
Tool Selection
Accuracy rate, which tools selected most, tool selection errors.
Costs & Budget
Token usage & cost per query, cost per tool, budget tracking.
Tool Latency
Execution time per tool, P50/P95/P99 distribution, bottleneck tools.
Error Tracking
Error rate by tool, error types, error recovery success.
Tools: LangSmith, Arize Phoenix, OpenTelemetry
Datadog, Braintrust, custom observability backends.
Testing & Evaluation
Build a testing pyramid: Unit tests for tool functions → Integration tests for tool calling flow → Agent eval suites for end-to-end task completion → Red team for adversarial testing.
AgentEvaluator Implementation
class AgentEvaluator:
def __init__(self, agent, eval_set):
self.agent = agent
self.eval_set = eval_set # [(query, expected_tools, expected_answer)]
async def run_eval(self):
results = []
for query, expected_tools, expected_answer in self.eval_set:
trace = await self.agent.run_traced(query)
results.append(EvalResult(
tool_selection_accuracy=self._check_tools(trace, expected_tools),
answer_correctness=self._check_answer(trace.answer, expected_answer),
steps_taken=len(trace.steps),
total_tokens=trace.total_tokens,
latency_ms=trace.duration_ms,
))
return EvalReport(results)Evaluation Metrics & Patterns
Deterministic Tests
Tool functions must be deterministic. Unit test each tool independently.
LLM-as-Judge
Use another LLM to grade answer quality. Good for subjective tasks.
Regression Testing
Run full eval suite on model upgrades. Lock in baseline before changes.
Adversarial Testing
Prompt injection, jailbreak attempts, malformed inputs, edge cases.
Cost Control & Optimization
LLM tokens drive costs: input context grows with each step. Tool execution, retries, and multi-agent fan-out add up. Set budgets, track spend, optimize aggressively.
CostAwareAgent with Budgets
class CostAwareAgent:
def __init__(self, budget_tokens=50000, budget_usd=0.10):
self.token_budget = budget_tokens
self.usd_budget = budget_usd
self.tokens_used = 0
self.cost_usd = 0.0
async def run(self, query):
for step in range(self.max_steps):
if self.tokens_used > self.token_budget * 0.9:
return self._force_answer("Approaching token budget")
response = await self.llm.generate(...)
self.tokens_used += response.usage.total_tokens
self.cost_usd += self._calculate_cost(response.usage)
if self.cost_usd > self.usd_budget:
return self._force_answer("Cost budget exceeded")Optimization Strategies
Prompt Caching
Reuse system prompts and tool definitions across queries. Dramatic savings on repeated context.
Context Pruning
Summarize old steps and tool results. Keep only recent, relevant context.
Model Tiering
Cheap model for routing. Strong model only for final answer synthesis.
Tool Result Caching
Cache tool outputs (weather, stock prices, search results) aggressively.
Production Deployment
Stateless agent services behind a load balancer. Store conversation state in Redis or a database. Trace every call. Monitor tool health. Feature flag new tools. Enable zero-downtime deployments.
Production Best Practices
Stateless Design
Agent services are ephemeral. Store conversation state in Redis or a database for horizontal scaling.
Tool Health Checks
Monitor tool dependencies. Circuit breakers for degraded tools. Graceful degradation.
Feature Flags
Gradually roll out new tools. Feature flag new behaviors. Easy rollback.
Canary Deployments
5% traffic to new version. Progressive 25% → 50% → 100%. Zero downtime.
Production Readiness Checklist
Before you ship, verify all four categories. The best agent architectures are boring: simple orchestration, reliable tools, comprehensive monitoring, and strict safety boundaries.
Tool Design
- ✓ JSON Schema validation
- ✓ Descriptions optimized for models
- ✓ Versioned schemas
- ✓ <20 tools per request
- ✓ Input sanitization
- ✓ Idempotent tools where possible
Reliability
- ✓ Retry with exponential backoff
- ✓ Circuit breakers
- ✓ Timeout on every tool
- ✓ Graceful degradation
- ✓ Dead letter queue
- ✓ Error reporting to LLM
Safety
- ✓ Tool allowlisting per user role
- ✓ Code execution sandboxing
- ✓ Network isolation
- ✓ Argument sanitization
- ✓ Cost budgets
- ✓ Prompt injection defense
Operations
- ✓ Distributed tracing on every call
- ✓ Token/cost dashboards
- ✓ Tool latency monitoring
- ✓ Eval suite in CI/CD
- ✓ Canary deployments
- ✓ Model version pinning
Agent Frameworks & Ecosystem
Landscape of production frameworks ranges from lightweight libraries to full platforms. Start simple. Choose based on your system's maturity level.
Framework Comparison Matrix
| Framework | Strengths | Gaps / Risks | Security Posture | Best-Fit Role |
|---|---|---|---|---|
| LangChain + LangGraph |
Wide tool ecosystem, strong community, good documentation | Can be verbose, cost tracking not native, performance variable | Community-maintained, audit tools available | General-purpose agents, prototyping |
| LlamaIndex | Excellent RAG, semantic caching, fast indexing | Less multi-agent support, narrower tool set | Document-level security, integrations | Document-driven agents, knowledge systems |
| AutoGen | Multi-agent conversations, flexible role def | Unpredictable cost, hard to debug, no built-in guardrails | Limited access controls, relies on models | Research, complex collaborative tasks |
| MS Agent Framework |
Enterprise-grade, strong security, durable execution | Steeper learning curve, Azure-dependent | Built-in RBAC, audit trails, compliance ready | Enterprise production, regulated industries |
| Semantic Kernel |
Plugin model, cross-language support, .NET first | Smaller ecosystem than LangChain, less documentation | Microsoft ecosystem integration | .NET applications, Windows-first orgs |
| Ray Serve | Distributed scaling, low-latency serving, cost-aware | Operational overhead, requires Kubernetes knowledge | Network isolation, resource limits | High-volume production, multi-tenant SaaS |
| CrewAI | Simple role-based design, good for structured workflows | Early stage, smaller community, limited frameworks | Depends on underlying models, basic tooling | Workflow-focused teams, structured tasks |
| Haystack | Modular pipelines, clear abstractions, good docs | Smaller community, less multi-agent tooling | Pipeline-level access control | Search & QA systems, modular pipelines |
| DSPy | Minimal, Pythonic, great for optimization | Limited built-in tools, requires more custom code | Simple surface = easy to audit | Research, custom agents, fine-tuning workflows |
Phased Implementation Roadmap
Production agents aren't built overnight. This phased approach helps you balance velocity with reliability.
Success Metrics & Deliverables
P1: Tools Inventory
Tool catalog doc, risk matrix, schema specs, RBAC roles defined
P2: MVP Deployed
Read-only agent live, tool gateway running, basic observability online
P3: Durable Workflows
Temporal/LangGraph checkpoints, ledger logging, compensation tests
P4: Eval Suite Live
Red-team results, eval CI/CD checks, security audit report
P5: Multi-Agent Ready
MCP integration, distributed tracing proven, cost model validated
Audit & Compliance Data Model
Enterprise agents require complete audit trails. This ER model captures every decision, auth check, and compensation action for compliance, debugging, and forensics.
AuditLogger Implementation
class AuditLogger:
def __init__(self, db):
self.db = db
async def log_tool_call(self, call: ToolCall):
# Insert TOOL_CALL record
call_id = uuid4()
await self.db.execute("""
INSERT INTO TOOL_CALL (call_id, step_id, tool_name, args_hash, created_at)
VALUES (?, ?, ?, ?, now())
""", call_id, call.step_id, call.name, sha256(call.args))
# Log auth decision
await self.db.execute("""
INSERT INTO AUTHZ_DECISION (decision_id, call_id, allowed, reason)
VALUES (?, ?, ?, ?)
""", uuid4(), call_id, call.allowed, call.authz_reason)
return call_id
async def log_tool_result(self, call_id, result, latency_ms):
# Insert TOOL_RESULT record (redacted)
await self.db.execute("""
INSERT INTO TOOL_RESULT (result_id, call_id, result_hash, latency_ms)
VALUES (?, ?, ?, ?)
""", uuid4(), call_id, sha256(result), latency_ms)
async def log_compensation(self, call_id, action):
await self.db.execute("""
INSERT INTO COMPENSATION_ACTION (comp_id, call_id, action, executed)
VALUES (?, ?, ?, true)
""", uuid4(), call_id, action)
async def audit_trail(self, run_id):
# Full audit trail for a run: all steps, calls, auth, results
return await self.db.query("""
SELECT s.step_num, t.tool_name, a.allowed, a.reason, r.latency_ms, c.action
FROM STEP s
JOIN TOOL_CALL t ON s.step_id = t.step_id
LEFT JOIN AUTHZ_DECISION a ON t.call_id = a.call_id
LEFT JOIN TOOL_RESULT r ON t.call_id = r.call_id
LEFT JOIN COMPENSATION_ACTION c ON t.call_id = c.call_id
WHERE s.run_id = ?
ORDER BY s.step_num
""", run_id)AI Agents — Advanced Tool Calling
Production Patterns • 20 Sections • Architecture Diagrams • Code Examples