Building a Voice Agent

End-to-end technical guide — from microphone input to spoken response, covering STT, NLU, LLM, TTS, real-time streaming, telephony, and production deployment.

01 Overview

A voice agent is an AI system that listens to human speech, understands intent, reasons over context, and responds with natural-sounding speech — all in real time. Modern voice agents combine automatic speech recognition (ASR/STT), large language models (LLMs), and neural text-to-speech (TTS) into a low-latency pipeline.

<500ms

Target first-byte latency

<1s

End-to-end response time

95%+

Word recognition accuracy

<150ms

Turn-taking gap (human avg)

Key Challenges

Latency — Humans expect sub-second responses; every millisecond matters
Interruption handling — Users barge-in mid-sentence; agent must stop and listen
Ambient noise — Real-world audio is noisy; robust VAD and ASR needed
Turn-taking — Detecting when the user has finished speaking (endpointing)
Naturalness — TTS must sound human, with proper prosody and emotion
Context retention — Multi-turn conversations require persistent memory
Concurrent calls — Production systems handle thousands of simultaneous calls

02 System Architecture

┌──────────────────────────── VOICE AGENT ARCHITECTURE ─────────────────────────────┐ │ │ │ ┌──────────┐ ┌────────────────┐ ┌─────────────────────────────────┐ │ │ │ Caller │═════▶│ TWILIO │═════▶│ VOICE AGENT SERVER │ │ │ │(Phone/SIP│ │ │ │ │ │ │ │ or Web) │ │ • Provision # │ │ ┌─────┐ ┌──────────────┐ │ │ │ └──────────┘ │ • Media Stream │ WS │ │ VAD │──▶│ DEEPGRAM STT │ │ │ │ │ • Call Control │──────│ │Siler│ │ Nova-2 │ │ │ │ │ • Recording │ │ │ o │ │ mulaw 8kHz │ │ │ │ │ • DTMF │ │ └─────┘ └──────┬───────┘ │ │ │ │ • SIP Trunk │ │ │ │ │ │ └────────────────┘ │ ┌─────▼─────┐ │ │ │ ▲ │ │ LLM │ │ │ │ ║ Audio │ │ GPT-4o / │ │ │ │ ║ (mulaw) │ │ Claude │ │ │ │ ║ │ └─────┬─────┘ │ │ │ ┌──────╨─────────┐ │ │ │ │ │ │ Twilio sends │ │ ┌────────────────▼──────────┐ │ │ │ ┌──────────┐ │ audio back to │◀═════│ │ TTS ENGINE │ │ │ │ │ Caller │◀═════│ caller │ │ │ │ │ │ │ │ hears │ │ │ │ │ ElevenLabs (quality) │ │ │ │ │ agent │ └────────────────┘ │ │ OR Cartesia (speed) │ │ │ │ └──────────┘ │ │ Output: ulaw_8000 │ │ │ │ │ └───────────────────────────┘ │ │ │ ┌──────────────────────────────────────────────────────────────────────┐ │ │ │ │ SERVICES: Function Calling │ RAG │ Memory │ CRM │ Analytics │ │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ │ └──────────────────────────────────────────────────────────────────────────────────┘

Why this stack: Twilio handles telephony and call management. Deepgram provides the fastest streaming STT with native mulaw support (no transcoding from Twilio). ElevenLabs delivers premium voice quality or Cartesia delivers minimum latency TTS. Both can output ulaw_8000 directly for Twilio — zero transcoding overhead.

Component Responsibilities

Component	Role	Latency Target
VAD	Detect when user is speaking vs silence/noise	<10ms
STT / ASR	Convert audio stream to text (transcription)	50–300ms
NLU	Extract intent, entities, sentiment from text	10–50ms
LLM	Generate contextual response (reasoning engine)	200–800ms (first token)
TTS	Convert response text to audio waveform	50–200ms (first byte)
Transport	Bi-directional audio streaming (WebSocket/WebRTC/SIP)	<50ms

03 Voice Pipeline (Step by Step)

Mic Input

→

Audio Chunks (20ms frames)

→

VAD Filter

→

Streaming STT

→

Interim Transcripts

→

Endpointing

→

Final Transcript

→

LLM (streaming)

→

Token Stream

→

Sentence Buffer

→

TTS (streaming)

→

Audio Chunks Out

→

Speaker

Key insight: The entire pipeline must be streaming end-to-end. You don't wait for the full STT transcript before calling the LLM, and you don't wait for the full LLM response before starting TTS. Each stage feeds the next incrementally.

# Pseudocode: Streaming voice pipeline
async def voice_pipeline(audio_stream):
    # Stage 1: VAD → filter silence
    speech_chunks = vad.filter(audio_stream)

    # Stage 2: Streaming STT → interim + final transcripts
    async for transcript in stt.transcribe_stream(speech_chunks):
        if transcript.is_final:
            # Stage 3: Stream LLM response token by token
            sentence_buffer = ""
            async for token in llm.stream(transcript.text, context):
                sentence_buffer += token

                # Stage 4: Send complete sentences to TTS
                if ends_with_punctuation(sentence_buffer):
                    async for audio in tts.synthesize_stream(sentence_buffer):
                        yield audio  # → speaker
                    sentence_buffer = ""

04 Latency Budget

Voice agents are latency-critical. Humans perceive pauses >600ms as unnatural. The target is <1 second from end of user speech to beginning of agent speech.

USER STOPS SPEAKING │ ├─── Endpointing delay ────────── 100–300ms ├─── STT finalization ─────────── 50–150ms ├─── LLM first token ─────────── 200–500ms ├─── TTS first byte ──────────── 50–200ms ├─── Network + buffer ─────────── 30–100ms │ ▼ AGENT STARTS SPEAKING ─────────── TOTAL: 430–1250ms

Latency Optimization Techniques

Technique	Savings	How
Streaming STT	200–500ms	Don't wait for end-of-utterance; use interim results
LLM streaming	500ms+	Start TTS on first sentence, not full response
TTS streaming	200–400ms	Begin audio playback before full synthesis completes
Sentence-level TTS	100–300ms	Buffer LLM tokens into sentences for TTS chunks
Speculative prefill	100–200ms	Start LLM prompt while STT is still finalizing
Semantic caching	300–700ms	Cache responses for common queries
Edge deployment	50–150ms	Co-locate STT/TTS near users (reduce network hops)
Shorter endpointing	100–200ms	Tune VAD silence threshold (risk: premature cutoff)
Warm connections	50–100ms	Keep persistent connections to STT/LLM/TTS APIs

Latency vs Accuracy tradeoff: Shorter endpointing reduces wait time but may cut off the user mid-thought. Most systems use 300–500ms silence threshold and allow barge-in to recover.

05 Speech-to-Text (STT / ASR)

Automatic Speech Recognition converts audio waveforms into text. For voice agents, streaming STT is essential — results must arrive incrementally as the user speaks.

Key STT Concepts

Streaming vs Batch — Streaming gives interim results in real-time; batch processes complete files
Interim (partial) results — Unstable text that updates as more audio arrives
Final results — Stable transcript after endpointing detects end of utterance
Endpointing — Detecting when the user has stopped speaking (silence duration)
Word-level timestamps — Timing for each word (useful for alignment and analytics)
Speaker diarization — Identifying different speakers in multi-party audio
Custom vocabulary — Boost recognition of domain-specific terms

06 STT Engines Compared

Engine	Type	Streaming	Latency	Best For
Deepgram	Cloud API	Yes (WebSocket)	~100ms	Lowest latency, voice agents
Google Cloud STT	Cloud API	Yes (gRPC)	~200ms	Multi-language, enterprise
Azure Speech	Cloud API	Yes (WebSocket)	~150ms	Microsoft ecosystem
AWS Transcribe	Cloud API	Yes (WebSocket)	~250ms	AWS ecosystem
AssemblyAI	Cloud API	Yes (WebSocket)	~200ms	Accuracy, LeMUR integration
OpenAI Whisper	Open-source / API	No (batch only)	1–5s	Accuracy, self-hosted, offline
Whisper.cpp	Open-source (C++)	Pseudo-stream	~500ms	Edge/local deployment
Faster-Whisper	Open-source (CTranslate2)	No	~300ms	Fast self-hosted batch
Vosk	Open-source	Yes	~200ms	Offline, lightweight, edge

# Deepgram Streaming STT (WebSocket)
import asyncio
from deepgram import DeepgramClient, LiveOptions, LiveTranscriptionEvents

dg = DeepgramClient(api_key="YOUR_KEY")
connection = dg.listen.asyncwebsocket.v("1")

async def on_message(self, result, **kwargs):
    transcript = result.channel.alternatives[0].transcript
    is_final = result.is_final
    if is_final and transcript:
        print(f"Final: {transcript}")
        # → Send to LLM

connection.on(LiveTranscriptionEvents.Transcript, on_message)

options = LiveOptions(
    model="nova-2",
    language="en",
    encoding="linear16",
    sample_rate=16000,
    interim_results=True,
    endpointing=300,  # ms of silence before final
    smart_format=True,
    vad_events=True,
)
await connection.start(options)

# Send audio chunks (20ms frames)
async for chunk in mic_stream:
    connection.send(chunk)

06A Why Deepgram — Deep Dive

Deepgram is the recommended STT engine for production voice agents. Here's a detailed analysis of why it outperforms alternatives for real-time conversational AI.

Why Deepgram Over Alternatives

Criteria	Deepgram	Google Cloud STT	Whisper (OpenAI)	Azure Speech
Streaming Latency	~100ms (best-in-class)	~200ms	N/A (batch only)	~150ms
Native WebSocket	Yes (first-class)	gRPC only	No	Yes
Built-in Endpointing	Yes (configurable ms)	Limited	No	Yes
Built-in VAD Events	Yes	No	No	Limited
Word-level Timestamps	Yes	Yes	Yes	Yes
Smart Formatting	Auto (numbers, dates, currency)	Manual config	No	Yes
Cost (per hour)	$0.0043/min (~$0.26/hr)	$0.024/min	$0.006/min (API)	$0.016/min
Custom Vocabulary	Keywords + model training	Phrase hints	Prompt only	Phrase lists
Voice Agent Optimized	Yes (Nova-2 model)	General purpose	General purpose	General purpose

Key Reasons to Choose Deepgram

Lowest streaming latency in the industry (~100ms) — Deepgram's end-to-end deep learning model is purpose-built for real-time. Unlike traditional ASR pipelines (acoustic model → language model → decoder), Deepgram uses a single neural network that processes audio directly, eliminating inter-stage latency.
Native WebSocket API designed for voice agents — Deepgram's primary API is a persistent WebSocket connection that accepts raw audio frames and returns JSON transcripts. This is exactly what voice agents need — no gRPC complexity (Google), no REST polling (Whisper), no SDK abstraction overhead.
Built-in endpointing and VAD events — Deepgram detects when users stop speaking and emits speech_final and utterance_end events with configurable silence thresholds. Other STT engines require you to implement VAD and endpointing separately.
Smart formatting out of the box — Automatically formats numbers ("three hundred" → "300"), dates, currency, and punctuation. This means the text sent to the LLM is clean and structured without post-processing.
Cost-effective at scale — At $0.0043/minute for Nova-2, Deepgram is 4–6x cheaper than Google Cloud STT and Azure Speech, which matters significantly when handling thousands of concurrent calls.
Nova-2 model specifically optimized for conversational speech — Unlike Whisper (optimized for transcription accuracy on long-form audio), Nova-2 is trained on conversational, real-time speech patterns with lower word error rates on voice agent dialogue.

06B Deepgram Features for Voice Agents

Endpointing Configuration

Fine-tune when Deepgram considers a user utterance "done." Lower values = faster response but risk cutting off the user.

endpointing=300   # 300ms silence = end of utterance
endpointing=500   # 500ms for cautious endpointing
endpointing=false  # Disable (you handle it)

Utterance Detection

Separate from endpointing — detects utterance boundaries even in continuous speech.

utterance_end_ms=1000  # Gap between utterances
interim_results=true   # Get partial transcripts
vad_events=true        # Speech start/stop events

Smart Formatting

Auto-converts spoken forms to written forms for cleaner LLM input.

"three hundred dollars" → "$300"
"january fifth twenty twenty six" → "January 5, 2026"
"one two three four" → "1234" (in number context)

Keyword Boosting

Boost recognition of domain-specific terms that the model might miss.

keywords=[
  "Acme:2",         # Boost "Acme" by 2x
  "SKU:1.5",        # Product codes
  "onboarding:1.5"  # Domain terms
]

06C Deepgram Implementation

# Complete Deepgram Streaming STT for Voice Agent
import asyncio, json
from deepgram import (
    DeepgramClient,
    DeepgramClientOptions,
    LiveTranscriptionEvents,
    LiveOptions,
)

class DeepgramSTTEngine:
    """Production-ready Deepgram STT wrapper for voice agents."""

    def __init__(self, api_key: str, on_transcript, on_speech_started=None):
        self.client = DeepgramClient(api_key, DeepgramClientOptions(
            options={"keepalive": "true"}  # Persistent connection
        ))
        self.on_transcript = on_transcript
        self.on_speech_started = on_speech_started
        self.connection = None

    async def connect(self):
        self.connection = self.client.listen.asyncwebsocket.v("1")

        # Register event handlers
        self.connection.on(LiveTranscriptionEvents.Transcript, self._on_message)
        self.connection.on(LiveTranscriptionEvents.SpeechStarted, self._on_speech_started)
        self.connection.on(LiveTranscriptionEvents.UtteranceEnd, self._on_utterance_end)
        self.connection.on(LiveTranscriptionEvents.Error, self._on_error)

        options = LiveOptions(
            model="nova-2",            # Best for conversational speech
            language="en",
            encoding="linear16",       # 16-bit PCM
            sample_rate=16000,        # 16kHz mono
            channels=1,
            interim_results=True,     # Get partial transcripts for UI
            endpointing=300,          # 300ms silence = final
            utterance_end_ms=1000,    # Utterance boundary detection
            smart_format=True,        # Auto-format numbers, dates
            punctuate=True,           # Add punctuation
            vad_events=True,          # Speech start/stop events
            filler_words=False,       # Remove "um", "uh"
        )

        if not await self.connection.start(options):
            raise ConnectionError("Failed to connect to Deepgram")
        print("✓ Deepgram STT connected")

    async def send_audio(self, audio_bytes: bytes):
        """Send raw audio chunk (20ms frame = 640 bytes at 16kHz/16bit)."""
        if self.connection:
            self.connection.send(audio_bytes)

    async def _on_message(self, _self, result, **kwargs):
        transcript = result.channel.alternatives[0].transcript
        if not transcript:
            return

        if result.is_final:
            # Final transcript → send to LLM
            confidence = result.channel.alternatives[0].confidence
            await self.on_transcript(transcript, is_final=True, confidence=confidence)
        else:
            # Interim → update UI only
            await self.on_transcript(transcript, is_final=False)

    async def _on_speech_started(self, _self, speech_started, **kwargs):
        # User started speaking → interrupt agent if needed
        if self.on_speech_started:
            await self.on_speech_started()

    async def _on_utterance_end(self, _self, utterance_end, **kwargs):
        # Clean boundary between utterances
        pass

    async def _on_error(self, _self, error, **kwargs):
        print(f"Deepgram error: {error}")

    async def close(self):
        if self.connection:
            await self.connection.finish()

Deepgram Audio Format Requirements

Parameter	Recommended	Why
Sample Rate	16,000 Hz	Standard for speech; higher adds bandwidth without improving recognition
Bit Depth	16-bit (linear16)	Good dynamic range, supported by all providers
Channels	1 (mono)	Speech is mono; stereo wastes bandwidth
Frame Size	20ms (640 bytes)	Standard VoIP frame size; balances latency and efficiency
From Twilio	mulaw 8kHz	Telephony standard; Deepgram accepts mulaw natively

Twilio Integration: When receiving audio from Twilio Media Streams, the audio is mulaw at 8kHz. Deepgram accepts this directly — set encoding="mulaw" and sample_rate=8000. No transcoding needed.

07 Streaming Recognition

Audio Stream: ───[chunk]──[chunk]──[chunk]──[chunk]──[silence]──▶ STT Output: "Hello" → "Hello I'd" → "Hello I'd like to" → "Hello I'd like to book" → FINAL (interim) (interim) (interim) (interim) (stable)

Streaming STT Best Practices

Use 16kHz, 16-bit mono PCM (linear16) for best quality/bandwidth balance
Send audio in 20ms frames (320 bytes at 16kHz)
Enable interim results for UI feedback but trigger LLM only on final results
Set endpointing to 300–500ms for conversational voice agents
Use VAD events to detect speech start/stop separately from transcription
Implement utterance-level buffering to handle multi-sentence turns

08 Voice Activity Detection (VAD)

VAD distinguishes human speech from silence, noise, and background audio. It's the gatekeeper that decides when to start and stop STT processing.

VAD Engine	Type	Latency	Notes
Silero VAD	Neural (PyTorch/ONNX)	<1ms per frame	Best accuracy/speed tradeoff; industry standard
WebRTC VAD	Signal-based (GMM)	<0.1ms	Ultra-fast, less accurate in noise
Picovoice Cobra	Neural (edge)	<1ms	Optimized for mobile/IoT
Built-in (Deepgram/Azure)	Cloud-integrated	N/A (server-side)	No extra integration needed

# Silero VAD example
import torch

model, utils = torch.hub.load(
    repo_or_dir='snakers4/silero-vad',
    model='silero_vad',
    trust_repo=True
)
(get_speech_timestamps, _, read_audio, _, _) = utils

# Real-time frame-by-frame
def process_frame(audio_chunk_tensor):
    speech_prob = model(audio_chunk_tensor, 16000).item()
    is_speech = speech_prob > 0.5
    return is_speech

09 Natural Language Understanding (NLU) & Intent Detection

NLU processes the transcribed text to extract meaning — intents, entities, sentiment, and dialog acts. This is a critical question for voice agent design: how do you detect what the user wants?

Important clarification: LangChain and LangGraph are NOT NLU frameworks. They are orchestration/chaining frameworks. They do not include built-in intent classification, entity extraction, or NLU models. However, you can use an LLM through LangChain to perform intent detection via prompting or function calling. LangGraph provides the workflow graph — not the understanding.

What LangChain / LangGraph Actually Do

Framework	What It Is	What It Is NOT	Role in NLU
LangChain	LLM orchestration framework — chains prompts, tools, memory, retrievers together	Not an NLU engine, not an intent classifier	Can wrap an LLM call that does intent classification via prompting or function calling
LangGraph	Stateful graph-based agent framework — manages state machines, routing, cycles	Not an NLU engine, not an intent classifier	Can route based on detected intent (the graph decides what to do after intent is known)

The Three Approaches to Intent Detection

1. Traditional NLU (ML Models)

Dedicated ML models trained on labeled intent data. Fast, deterministic, predictable. Limited to pre-defined intents.

Intent classification (book_flight, check_balance)
Named entity extraction (dates, names, amounts)
Slot filling for structured actions
Requires training data (50–500+ examples per intent)

Deterministic Fast (<10ms) Needs training data

2. LLM-Powered NLU (Prompting)

Use GPT/Claude with structured output to classify intents. No training data needed. Handles unseen intents.

LLM does intent + entity extraction in one call
Zero-shot: works without examples
Structured output via function calling / JSON mode
Higher latency (200–500ms) but much more capable

Flexible No training data Higher latency

3. Hybrid (Classifier + LLM Fallback)

Fast local classifier for common intents; LLM fallback for edge cases. Best of both worlds.

Local model handles 80% of known intents (<10ms)
LLM handles ambiguous/novel intents (200ms+)
Router decides which path based on confidence
Most production voice agents use this approach

Production-ready Balanced

09A Intent Detection — Full Comparison

Solution	Type	Latency	Training Data	Open Intents	Cost	Best For
Rasa NLU	Self-hosted ML	<10ms	Required (50+ per intent)	No	Free (OSS)	Self-hosted, full control
Dialogflow CX	Google Cloud	~50ms	Required (10+ per intent)	No	$0.007/req	Google ecosystem, complex flows
Amazon Lex	AWS Cloud	~80ms	Required (10+ per intent)	No	$0.004/req	AWS ecosystem, Alexa-like bots
Azure CLU (LUIS successor)	Azure Cloud	~60ms	Required (15+ per intent)	No	$0.005/req	Microsoft ecosystem
GPT-4o Function Calling	LLM (OpenAI)	200–400ms	None (zero-shot)	Yes	~$0.003/req	Flexible, open-ended voice agents
Claude Tool Use	LLM (Anthropic)	200–500ms	None (zero-shot)	Yes	~$0.004/req	Safety-focused, enterprise
FastText / Sentence-BERT	Self-hosted embeddings	<5ms	Required (20+ per intent)	No	Free (OSS)	Ultra-low latency, edge
SetFit (few-shot)	Self-hosted (HuggingFace)	<10ms	Minimal (8–16 per intent)	No	Free (OSS)	Few-shot scenarios, fast training
LLM via LangChain	Orchestrated LLM call	200–500ms	None (zero-shot)	Yes	LLM cost	When already using LangChain

Key insight for voice agents: Latency is king. If you have well-defined intents (e.g., "check order", "make payment", "speak to agent"), use a fast local classifier (<10ms). Use LLM-based intent detection only for open-ended conversations where you can't pre-define all intents.

09B Intent Detection Approaches (Detailed)

Approach 1: LLM Function Calling as Intent Detection

The most common modern approach. Define intents as "functions" — the LLM decides which function to call based on the user's speech. This effectively combines NLU + action routing in one step.

# LLM Function Calling = Intent Detection + Entity Extraction
# Define your intents as tools/functions

tools = [
    {
        "type": "function",
        "function": {
            "name": "check_order_status",       # ← This IS the intent
            "description": "User wants to check the status of an existing order",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {                   # ← This IS the entity
                        "type": "string",
                        "description": "Order ID or number"
                    }
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "transfer_to_human",
            "description": "User wants to speak with a human agent",
            "parameters": {
                "type": "object",
                "properties": {
                    "department": {
                        "type": "string",
                        "enum": ["billing", "support", "sales"]
                    },
                    "reason": {"type": "string"}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "make_payment",
            "description": "User wants to make a payment on their account",
            "parameters": {
                "type": "object",
                "properties": {
                    "amount": {"type": "number"},
                    "account_id": {"type": "string"}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "general_question",
            "description": "User has a general question not covered by specific functions",
            "parameters": {
                "type": "object",
                "properties": {
                    "question": {"type": "string"}
                }
            }
        }
    }
]

# User says: "I want to check on order number 4567"
# LLM returns: tool_call(name="check_order_status", args={"order_id": "4567"})
#                          ↑ intent                        ↑ entity

This is NOT "LangChain NLU" — this is the LLM doing NLU via function calling. LangChain is just the wrapper. You can do the exact same thing with raw OpenAI/Anthropic SDK calls. LangChain adds convenience (memory, chaining) but not NLU capability.

Approach 2: Traditional NLU (Rasa / Dialogflow)

# Rasa NLU Pipeline (nlu.yml)
# Train a dedicated ML model for intent classification

nlu:
- intent: check_order
  examples: |
    - where is my order
    - check order status
    - what's the status of order [4567](order_id)
    - track my package
    - I want to know where my delivery is
    - can you look up order [AB-1234](order_id)

- intent: make_payment
  examples: |
    - I'd like to pay my bill
    - make a payment of [$50](amount)
    - pay [100 dollars](amount) on my account
    - how do I pay

- intent: transfer_to_human
  examples: |
    - let me talk to a real person
    - transfer me to an agent
    - I want to speak to someone
    - get me a human

# Result: {"intent": "check_order", "confidence": 0.94,
#          "entities": [{"entity": "order_id", "value": "4567"}]}

Approach 3: Fast Embedding Classifier (SetFit / Sentence-BERT)

# Ultra-fast intent detection using sentence embeddings
# Only needs 8-16 examples per intent to train

from setfit import SetFitModel, SetFitTrainer

# Train with minimal examples
train_data = [
    ("check my order", "check_order"),
    ("where is my package", "check_order"),
    ("track delivery", "check_order"),
    ("order status", "check_order"),
    ("pay my bill", "make_payment"),
    ("make a payment", "make_payment"),
    ("talk to a human", "transfer"),
    ("speak to agent", "transfer"),
    # ... 8-16 examples per intent
]

model = SetFitModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
trainer = SetFitTrainer(model=model, train_dataset=train_data)
trainer.train()

# Inference: <5ms!
intent = model.predict("I need to check on order 4567")
# → "check_order"

Approach 4: Hybrid Router (Recommended for Production Voice Agents)

# Hybrid: Fast classifier + LLM fallback
# This is the production-recommended approach for voice agents

class HybridIntentRouter:
    def __init__(self):
        self.fast_classifier = SetFitModel.from_pretrained("./intent-model")
        self.confidence_threshold = 0.85
        self.llm = OpenAIClient()

    async def detect_intent(self, transcript: str) -> dict:
        # Step 1: Try fast classifier (~5ms)
        predictions = self.fast_classifier.predict_proba([transcript])
        top_intent = predictions.argmax()
        confidence = predictions.max()

        if confidence >= self.confidence_threshold:
            # High confidence → use fast result (saves 200-400ms!)
            return {
                "intent": top_intent,
                "confidence": confidence,
                "method": "fast_classifier",
                "latency_ms": 5,
            }

        # Step 2: Low confidence → fall back to LLM (200-400ms)
        llm_result = await self.llm.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": "Classify the user's intent. Return JSON: {intent, entities, confidence}"
            }, {
                "role": "user",
                "content": transcript
            }],
            response_format={"type": "json_object"},
        )
        return {**json.loads(llm_result.choices[0].message.content), "method": "llm_fallback"}

09C LangChain / LangGraph Role in Voice Agents

Since LangChain and LangGraph are often confused with NLU, here's exactly what role they play in a voice agent pipeline.

What LangChain Does in a Voice Agent

Capability	LangChain Role	Not LangChain's Job
Intent Detection	Wraps an LLM call that does intent detection via function calling	Does not provide its own intent classifier
Entity Extraction	LLM extracts entities via structured output (Pydantic models)	Does not have NER models
Conversation Memory	Yes — ConversationBufferMemory, summary memory, etc.	—
RAG Retrieval	Yes — retrievers, vector stores, rerankers	—
Tool/Function Calling	Yes — tool definitions, execution, result handling	—
Prompt Management	Yes — templates, few-shot examples, output parsers	—
Agent Orchestration	Yes (via LangGraph) — state machines, routing, cycles	—

What LangGraph Does in a Voice Agent

┌─────────────────────────────────────────────┐ │ LangGraph Voice Agent │ │ │ User transcript ──▶ │ [Intent Node] │ │ │ │ │ ├── check_order ──▶ [DB Lookup Node] │ │ │ │ │ │ ├── make_payment ──▶ [Payment Node] │ │ │ │ │ │ ├── transfer ────▶ [Transfer Node] │ │ │ │ │ │ └── general ─────▶ [RAG + LLM Node] │ │ │ │ │ [Response Node] ───▶ TTS └─────────────────────────────────────────────┘ LangGraph provides: the GRAPH (routing, state, cycles) LangGraph does NOT provide: the intent detection itself

# LangGraph voice agent with intent routing
# Note: Intent detection happens INSIDE the LLM call, not from LangGraph

from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal

class VoiceState(TypedDict):
    transcript: str
    intent: str
    entities: dict
    response: str
    conversation_history: list

# Node 1: Detect intent (uses LLM — LangGraph doesn't do this itself)
async def detect_intent(state: VoiceState) -> VoiceState:
    # Option A: Fast classifier
    result = fast_classifier.predict(state["transcript"])
    # Option B: LLM function calling
    # result = await llm.classify(state["transcript"])
    state["intent"] = result.intent
    state["entities"] = result.entities
    return state

# Router: LangGraph routes based on detected intent
def route_intent(state: VoiceState) -> Literal["order", "payment", "transfer", "general"]:
    intent_map = {
        "check_order": "order",
        "make_payment": "payment",
        "transfer_to_human": "transfer",
    }
    return intent_map.get(state["intent"], "general")

# Build graph
graph = StateGraph(VoiceState)
graph.add_node("detect_intent", detect_intent)
graph.add_node("order", handle_order_check)
graph.add_node("payment", handle_payment)
graph.add_node("transfer", handle_transfer)
graph.add_node("general", handle_general_query)
graph.add_node("respond", generate_voice_response)

graph.set_entry_point("detect_intent")
graph.add_conditional_edges("detect_intent", route_intent)
for node in ["order", "payment", "transfer", "general"]:
    graph.add_edge(node, "respond")
graph.add_edge("respond", END)

voice_agent = graph.compile()

09D Intent Detection Decision Guide

Which Approach Should You Use?

Your Situation	Recommended Approach	Why
Well-defined intents (10–50), latency critical	SetFit / FastText classifier	<5ms, deterministic, no LLM cost
Complex flows with many intents + Google ecosystem	Dialogflow CX	Visual flow builder, Google integrations
Open-ended conversation, can't pre-define all intents	LLM function calling	Handles anything, zero training data
Enterprise with existing Rasa infrastructure	Rasa NLU	Self-hosted, full control, proven at scale
Production voice agent (best overall)	Hybrid: fast classifier + LLM fallback	Fast for common intents, LLM for edge cases
Prototype / MVP (ship fast)	LLM function calling only	Zero setup, works immediately
Edge / offline deployment	SetFit or Vosk + local model	No cloud dependency

DECISION TREE: Can you pre-define all intents? ├── YES → Is latency critical (<50ms)? │ ├── YES → SetFit / FastText / Rasa NLU (local classifier) │ └── NO → Dialogflow CX / Amazon Lex / Azure CLU (cloud NLU) │ └── NO → Is it a prototype or production? ├── PROTOTYPE → LLM function calling (zero setup) └── PRODUCTION → Hybrid classifier + LLM fallback (recommended)

Voice agent reality: Most production voice agents start with LLM function calling (fastest to build), then add a fast classifier for common intents once they have enough conversation data (typically after 1,000+ calls). The classifier handles 80% of utterances in <5ms, and the LLM handles the remaining 20%.

10 Dialog Management

Controls the flow of conversation — tracking state, managing turns, handling context switches, and deciding what action to take next.

Dialog Management Approaches

Approach	How	Best For
Finite State Machine	Predefined states and transitions	Simple IVR, scripted flows
Frame-Based	Fill slots until action is ready	Form-filling (booking, orders)
LLM-Driven	LLM decides next action via system prompt	Open-ended conversation
Hybrid (Graph + LLM)	Graph for structure, LLM for flexibility	Enterprise voice agents (recommended)

11 LLM Integration

The LLM is the reasoning brain of the voice agent. It processes the user's transcript, conversation history, and system instructions to generate responses.

# Voice-optimized LLM prompt
SYSTEM_PROMPT = """You are a helpful voice assistant for Acme Corp customer support.

VOICE-SPECIFIC RULES:
- Keep responses SHORT (1-3 sentences). Voice != chat.
- Use conversational language, contractions, natural phrasing.
- NEVER use markdown, bullet points, URLs, or special formatting.
- Spell out numbers: "twenty three" not "23".
- For lists, say "first... second... third..." not "1. 2. 3."
- If unsure, ask ONE clarifying question at a time.
- Acknowledge the user before answering: "Sure!", "Great question.", etc.

FUNCTION CALLING:
- Use check_order_status(order_id) for order inquiries.
- Use transfer_to_human(department) if user explicitly asks for a person.
- Use schedule_callback(phone, time) for callback requests.

CONTEXT:
- Customer: {customer_name}
- Account tier: {tier}
- Previous interactions: {history_summary}
"""

# Streaming LLM call
async def stream_llm_response(transcript, context):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT.format(**context)},
            *conversation_history,
            {"role": "user", "content": transcript}
        ],
        stream=True,
        temperature=0.7,
        max_tokens=150,  # Keep voice responses short
    )

    sentence_buffer = ""
    async for chunk in response:
        token = chunk.choices[0].delta.content or ""
        sentence_buffer += token

        # Yield complete sentences for TTS
        if any(sentence_buffer.rstrip().endswith(p) for p in (".", "!", "?", ",")):
            yield sentence_buffer.strip()
            sentence_buffer = ""

    if sentence_buffer.strip():
        yield sentence_buffer.strip()

Voice vs Chat prompting: LLMs trained on text tend to produce verbose, formatted output. Your system prompt must aggressively constrain output length and format. Test by reading responses aloud — if it sounds unnatural, adjust.

12 RAG for Voice Agents

Retrieval-Augmented Generation connects your voice agent to enterprise knowledge bases, FAQs, product docs, and customer data — so it gives accurate, grounded answers.

User says: "What's the return policy for electronics?" │ ▼ [STT] → "What's the return policy for electronics?" │ ▼ [RETRIEVE] → Search vector DB for relevant policy docs │ → Top 3 chunks: return_policy.md#electronics ▼ [AUGMENT] → System prompt + retrieved context + user query │ ▼ [LLM] → "Electronics can be returned within 30 days with receipt. Opened items may have a 15% restocking fee." │ ▼ [TTS] → Spoken response

Voice-specific RAG tip: Retrieved chunks should be concise. Long context increases LLM latency. Use aggressive reranking and limit to 2–3 chunks max for voice use cases.

Advanced RAG Architectures to Reduce Latency

Architecture	How it works	Typical latency gain	Trade-off
L0/L1/L2 Cache-First RAG	L0 in-memory answers (exact/semantic match), L1 Redis vector cache, L2 full vector DB only on cache miss.	40–80% fewer DB lookups	Needs cache invalidation strategy
Speculative Retrieval During STT	Start retrieval on partial transcript ("return policy..."), then refresh once final transcript arrives.	100–250ms faster first token	May retrieve wrong docs on partial text
Two-Stage Retriever	Stage 1: ultra-fast BM25/keyword prefilter. Stage 2: vector rerank only on top candidates.	30–60% faster than full vector scan	Requires good lexical index tuning
Hybrid Intent Gate	Intent classifier routes FAQ intents to deterministic templates; only complex queries use RAG + LLM.	Largest gain on repetitive support calls	Classifier maintenance and drift checks
Context Compression RAG	Retrieve larger set once, then compress to 2-3 short evidence snippets before LLM call.	Lower LLM decode latency and token cost	Compression step must preserve facts
Response-First + Verify	Speak a short safe preamble ("Let me check that") while retrieval/LLM continues in background.	Lower perceived latency	Improves UX, not raw backend latency

Low-latency voice RAG reference path: Caller audio │ ▼ [Streaming STT partials] │ ├──▶ [Intent Gate] ──▶ FAQ/Template reply (fast path) │ └──▶ [Speculative Retriever] │ ├──▶ L0 memory cache ├──▶ L1 Redis/vector cache └──▶ L2 vector DB + reranker │ ▼ [Context compression (2-3 snippets)] │ ▼ [LLM stream] ──▶ [Streaming TTS]

Production rule: Enforce a retrieval timeout budget (for example 120ms). If retrieval misses budget, fall back to intent template or short clarifying question to keep turn latency under 1 second.

13 Text-to-Speech (TTS)

TTS converts the LLM's text response into natural-sounding audio. Modern neural TTS produces near-human quality. For voice agents, streaming TTS is critical — audio begins playing before the full text is synthesized.

Key TTS Features for Voice Agents

Streaming synthesis — Generate audio incrementally (sentence by sentence)
Low first-byte latency — Start speaking as fast as possible
Natural prosody — Proper intonation, stress, and rhythm
Emotion/style control — Adjust tone (friendly, professional, empathetic)
Voice cloning — Custom brand voice from audio samples
SSML support — Fine-grained control over pronunciation, pauses, emphasis
Multi-language — Support for global deployment

14 TTS Engines Compared

Engine	Type	Streaming	Latency	Quality	Best For
ElevenLabs	Cloud API	Yes	~150ms	Excellent	Highest quality, voice cloning
Cartesia (Sonic)	Cloud API	Yes	~90ms	Very Good	Ultra-low latency voice agents
Deepgram Aura	Cloud API	Yes	~80ms	Good	STT+TTS single vendor
OpenAI TTS	Cloud API	Yes	~200ms	Very Good	OpenAI ecosystem
Azure Neural TTS	Cloud API	Yes	~150ms	Very Good	Enterprise, SSML, 400+ voices
Google Cloud TTS	Cloud API	Yes	~180ms	Very Good	Multi-language, WaveNet
Amazon Polly	Cloud API	Yes	~200ms	Good	AWS ecosystem, NTTS voices
Coqui TTS	Open-source	Limited	~300ms	Good	Self-hosted, custom voices
Piper TTS	Open-source	No	~100ms	Moderate	Edge/offline, lightweight

# ElevenLabs Streaming TTS
from elevenlabs import ElevenLabs

client = ElevenLabs(api_key="YOUR_KEY")

def stream_tts(text: str):
    audio_stream = client.text_to_speech.convert_as_stream(
        voice_id="pNInz6obpgDQGcFmaJgB",  # "Adam"
        text=text,
        model_id="eleven_turbo_v2_5",
        output_format="pcm_16000",  # Raw PCM for low latency
    )
    for audio_chunk in audio_stream:
        yield audio_chunk  # Send to speaker/WebSocket

# Cartesia Streaming TTS (ultra-low latency)
from cartesia import Cartesia

cartesia = Cartesia(api_key="YOUR_KEY")

async def stream_cartesia(text: str):
    output = await cartesia.tts.sse(
        model_id="sonic-english",
        transcript=text,
        voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
        output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 16000},
        stream=True,
    )
    async for chunk in output:
        yield chunk["audio"]

14A Why ElevenLabs — Deep Dive

ElevenLabs delivers the most natural-sounding AI voices on the market. For enterprise voice agents where brand perception and user trust depend on voice quality, ElevenLabs is the premium choice.

Why ElevenLabs Over Alternatives

Criteria	ElevenLabs	OpenAI TTS	Azure Neural	Google TTS
Voice Naturalness	Best-in-class (MOS ~4.5)	Very good (~4.2)	Very good (~4.1)	Good (~3.9)
Streaming Latency	~150ms first byte	~200ms	~150ms	~180ms
Voice Cloning	Professional (30s–30min audio)	No	Custom Neural Voice ($)	Limited
Emotion Control	Yes (style, stability sliders)	No	SSML only	SSML only
Voice Library	Thousands (community + premium)	6 voices	400+ voices	100+ voices
Languages	29 languages	~57 languages	140+ languages	40+ languages
Turbo Model	Yes (Turbo v2.5 — ~100ms)	tts-1 (fast/lower quality)	No turbo option	No turbo option
Cost (per 1K chars)	$0.18–$0.30	$0.015–$0.030	$0.016	$0.016

Key Reasons to Choose ElevenLabs

Highest naturalness scores across independent benchmarks — ElevenLabs' Multilingual v2 and Turbo v2.5 models consistently achieve the highest Mean Opinion Scores (MOS) in blind listening tests. Users perceive ElevenLabs voices as more human-like, building trust in voice agent interactions.
Professional voice cloning for brand identity — Clone a specific voice (spokesperson, brand character) from as little as 30 seconds of audio. The resulting voice is consistent across all calls, creating a recognizable brand experience.
Fine-grained emotion and style control — Adjust stability (consistency vs expressiveness) and similarity (closeness to original voice) sliders. This lets you tune the voice to match your brand personality — professional, warm, energetic, calm.
Turbo v2.5 model for sub-100ms latency — When latency matters most (interactive voice agents), the Turbo model sacrifices minimal quality for dramatically lower first-byte latency, competing with Cartesia's speed.
Rich voice library — Access thousands of pre-made voices for prototyping, or clone custom voices for production. Switch voices without changing any pipeline code.

Cost consideration: ElevenLabs is 5–10x more expensive than alternatives per character. This is the tradeoff for premium quality. For high-value interactions (sales, enterprise support) the quality premium pays for itself. For high-volume, low-value calls, consider Cartesia or Deepgram Aura.

14B ElevenLabs Implementation

# Complete ElevenLabs Streaming TTS for Voice Agents
import asyncio
from elevenlabs import ElevenLabs
from elevenlabs.core import ApiError

class ElevenLabsTTSEngine:
    """Production ElevenLabs TTS with streaming and voice management."""

    def __init__(self, api_key: str, voice_id: str = "pNInz6obpgDQGcFmaJgB"):
        self.client = ElevenLabs(api_key=api_key)
        self.voice_id = voice_id

    def stream_audio(self, text: str, model: str = "eleven_turbo_v2_5"):
        """Stream audio chunks for a text sentence.

        Models:
        - eleven_turbo_v2_5: Fastest (~100ms), good quality — USE FOR VOICE AGENTS
        - eleven_multilingual_v2: Best quality (~200ms), all 29 languages
        - eleven_monolingual_v1: English only, legacy
        """
        audio_stream = self.client.text_to_speech.convert_as_stream(
            voice_id=self.voice_id,
            text=text,
            model_id=model,
            output_format="pcm_16000",      # Raw PCM for lowest latency
            voice_settings={
                "stability": 0.5,            # 0=expressive, 1=stable
                "similarity_boost": 0.75,    # Closeness to original voice
                "style": 0.0,               # 0=neutral, 1=exaggerated
                "use_speaker_boost": True,   # Enhance clarity
            },
            optimize_streaming_latency=3,  # 0-4, higher = faster but lower quality
        )

        for audio_chunk in audio_stream:
            yield audio_chunk

    async def synthesize_for_twilio(self, text: str):
        """Generate audio in mulaw format for Twilio Media Streams."""
        audio_stream = self.client.text_to_speech.convert_as_stream(
            voice_id=self.voice_id,
            text=text,
            model_id="eleven_turbo_v2_5",
            output_format="ulaw_8000",  # Native Twilio format!
        )
        for chunk in audio_stream:
            yield chunk

    def get_voices(self):
        """List available voices."""
        return self.client.voices.get_all()

    def clone_voice(self, name: str, audio_files: list):
        """Clone a voice from audio samples."""
        return self.client.clone(
            name=name,
            files=audio_files,
            description="Custom brand voice for voice agent"
        )

Output format tip: Use pcm_16000 for WebRTC/WebSocket and ulaw_8000 for Twilio. Using native formats avoids transcoding, saving 5–15ms per chunk.

14C Why Cartesia — Deep Dive

Cartesia (Sonic model) delivers the lowest TTS latency in the market, making it the ideal choice when response speed is the primary concern.

Why Cartesia Over Alternatives

Criteria	Cartesia Sonic	ElevenLabs Turbo	Deepgram Aura
First-Byte Latency	~90ms (fastest)	~100ms	~80ms
Voice Quality	Very Good	Excellent	Good
Instant Voice Cloning	Yes (5–15 sec audio)	Yes (30s+ audio)	No
Emotion/Style Mixing	Yes (blend multiple emotions)	Stability sliders	No
Multilingual	Growing (10+ langs)	29 languages	English focus
Word-level Timestamps	Yes	No	No
WebSocket Streaming	Yes (native)	HTTP streaming	HTTP streaming
Cost	Competitive	Premium	Lowest

Key Reasons to Choose Cartesia

Absolute lowest latency for time-critical interactions — Cartesia's State Space Model (SSM) architecture generates audio faster than transformer-based TTS. The Sonic model produces the first audio byte in ~90ms, enabling sub-second agent responses.
WebSocket-native streaming — Unlike HTTP-based streaming (ElevenLabs, OpenAI), Cartesia provides true WebSocket streaming with bidirectional communication. You can send text and receive audio on the same persistent connection, eliminating connection overhead per sentence.
Word-level timestamps in real-time — Cartesia returns timing information for each word as audio streams, enabling precise lip-sync for avatars, captions, and alignment-based interruption handling.
Emotion and style mixing — Blend multiple emotional tones in a single generation (e.g., 70% professional + 30% warm). This enables dynamic emotional adaptation during conversations.
Instant voice cloning from 5 seconds of audio — The fastest voice cloning available, enabling rapid prototyping and custom voice creation without long training cycles.

14D Cartesia Implementation

# Complete Cartesia Sonic Streaming TTS
import asyncio
from cartesia import Cartesia

class CartesiaTTSEngine:
    """Production Cartesia TTS with WebSocket streaming."""

    def __init__(self, api_key: str, voice_id: str):
        self.client = Cartesia(api_key=api_key)
        self.voice_id = voice_id
        self.ws = None

    async def connect_websocket(self):
        """Establish persistent WebSocket for lowest latency."""
        self.ws = self.client.tts.websocket()
        print("✓ Cartesia WebSocket connected")

    async def stream_audio(self, text: str, context_id: str = "default"):
        """Stream audio via persistent WebSocket connection.

        context_id: Use same ID for sentences in one turn
        to maintain prosody continuity across chunks.
        """
        output = self.ws.send(
            model_id="sonic-english",
            transcript=text,
            voice={
                "mode": "id",
                "id": self.voice_id,
                # Emotion mixing example:
                # "mode": "embedding",
                # "embedding": blend(professional_emb, warm_emb, 0.7)
            },
            output_format={
                "container": "raw",
                "encoding": "pcm_s16le",
                "sample_rate": 16000,
            },
            context_id=context_id,  # Prosody continuity
            stream=True,
        )

        for chunk in output:
            # chunk contains: audio bytes + optional word timestamps
            yield chunk["audio"]

    async def stream_for_twilio(self, text: str):
        """Generate mulaw audio for Twilio telephony."""
        output = self.ws.send(
            model_id="sonic-english",
            transcript=text,
            voice={"mode": "id", "id": self.voice_id},
            output_format={
                "container": "raw",
                "encoding": "pcm_mulaw",   # Native Twilio format
                "sample_rate": 8000,       # Telephony standard
            },
            stream=True,
        )
        for chunk in output:
            yield chunk["audio"]

    async def close(self):
        if self.ws:
            self.ws.close()

14E Choosing ElevenLabs vs Cartesia

Decision Matrix

Scenario	Choose ElevenLabs	Choose Cartesia
Primary goal	Maximum voice quality & naturalness	Minimum latency
Brand voice needed	Best voice cloning quality	Good instant cloning
Enterprise sales calls	Premium voice builds trust	Fast response impresses
High-volume support calls	Cost may be prohibitive	Better cost/latency ratio
Avatar/lip-sync needed	No word timestamps	Word-level timestamps
Many languages	29 languages	Growing support
Budget constrained	Premium pricing	More cost-effective
WebSocket native	HTTP streaming	True WebSocket

Hybrid strategy: Many production systems use both — ElevenLabs for high-value interactions (sales, VIP support) and Cartesia for high-volume, latency-sensitive calls (general support, IVR). Route at the LLM gateway based on call type, customer tier, or required language.

15 Voice Cloning

Create a custom brand voice from audio samples. Requires as little as 30 seconds of clean audio with some providers.

Provider	Samples Needed	Quality
ElevenLabs	1–30 min audio	Excellent (Professional Voice Cloning)
Cartesia	5–15 sec	Very Good (instant cloning)
PlayHT	30 sec+	Very Good
Coqui (XTTS)	6 sec+	Good (open-source)

Legal: Always obtain explicit consent before cloning anyone's voice. Many jurisdictions have laws governing synthetic voice use. Document consent for compliance.

16 SSML & Prosody Control

Speech Synthesis Markup Language (SSML) gives fine-grained control over how TTS engines pronounce text.

<!-- SSML Example -->
<speak>
  <prosody rate="medium" pitch="+5%">
    Welcome to Acme support!
  </prosody>
  <break time="300ms"/>
  Your order
  <say-as interpret-as="characters">AB</say-as>
  <say-as interpret-as="cardinal">1234</say-as>
  is on its way.
  <emphasis level="strong">Is there anything else I can help with?</emphasis>
</speak>

17 WebSocket Streaming

WebSockets provide full-duplex, low-latency communication for real-time audio streaming between client and server.

# FastAPI WebSocket voice agent server
import asyncio
from fastapi import FastAPI, WebSocket

app = FastAPI()

@app.websocket("/voice")
async def voice_endpoint(ws: WebSocket):
    await ws.accept()

    stt = StreamingSTT()
    llm = LLMClient()
    tts = StreamingTTS()

    try:
        while True:
            # Receive audio from client
            audio_data = await ws.receive_bytes()

            # Feed to streaming STT
            transcript = await stt.process(audio_data)

            if transcript and transcript.is_final:
                # Stream LLM → TTS → audio back to client
                async for sentence in llm.stream(transcript.text):
                    async for audio_chunk in tts.synthesize(sentence):
                        await ws.send_bytes(audio_chunk)
    except Exception:
        await ws.close()

18 Why Twilio — Deep Dive

Twilio is the recommended telephony platform for connecting voice agents to the phone network. It provides the bridge between PSTN/SIP phone calls and your WebSocket-based voice agent pipeline.

Why Twilio Over Alternatives

Criteria	Twilio	Vonage (Nexmo)	Telnyx	FreeSWITCH
Media Streams API	First-class WebSocket	WebSocket (beta)	WebSocket	Custom (mod_audio_stream)
Bidirectional Audio	Yes (send + receive)	Limited	Yes	Yes
Call Control (TwiML)	Mature, declarative XML	NCCO (JSON)	TeXML	Dialplan (XML)
Global Phone Numbers	180+ countries	80+ countries	30+ countries	N/A (BYO trunk)
SIP Trunking	Elastic SIP Trunking	Yes	Yes	Native
Recording & Compliance	Built-in, PCI compliant	Built-in	Built-in	Manual
DTMF Detection	Yes (in-stream)	Yes	Yes	Yes
Developer Experience	Best docs, SDKs, community	Good	Good	Complex, expert-level
Scalability	Auto-scales, enterprise SLA	Good	Good	Manual scaling
Cost (per min)	$0.013 inbound	$0.0127	$0.003	$0 (infra costs)

Key Reasons to Choose Twilio

Media Streams API is purpose-built for AI voice agents — Twilio's Media Streams sends real-time audio over WebSocket in both directions. This is the exact integration pattern voice agents need: receive caller audio → process through STT → LLM → TTS → send audio back. No other provider has this as mature and well-documented.
Bidirectional streaming with call control — Twilio lets you simultaneously stream audio AND control the call (transfer, hold, record, gather DTMF) through TwiML and the REST API. This is critical for enterprise voice agents that need to transfer to humans, place callers on hold, or navigate IVR trees.
Instant global phone numbers — Provision local, toll-free, or national numbers in 180+ countries via API. Your voice agent can be reachable from any phone in the world within seconds of configuration.
Enterprise-grade reliability and compliance — 99.95% uptime SLA, SOC 2 / HIPAA / PCI-DSS compliance, built-in call recording with automatic PII redaction, and GDPR-compliant data handling. Critical for enterprise deployments.
Best developer experience in telephony — Twilio has the most comprehensive documentation, largest community, SDKs in every major language, and the most Stack Overflow answers of any CPaaS provider.
Elastic SIP Trunking for existing infrastructure — If your enterprise already has a PBX or contact center, Twilio Elastic SIP Trunking lets you connect your voice agent without replacing existing telephony infrastructure.

Cost note: Twilio is 3–4x more expensive per minute than Telnyx. For very high-volume deployments (100K+ minutes/month), negotiate enterprise pricing or consider Telnyx for cost-sensitive non-critical lines. For most enterprise use cases, Twilio's reliability and features justify the premium.

18A Twilio Voice Agent Architecture

┌────────────────── TWILIO + VOICE AGENT ARCHITECTURE ──────────────────────┐ │ │ │ ┌──────────┐ ┌────────────┐ ┌───────────────────────────────────┐ │ │ │ Caller │───▶│ Twilio │───▶│ YOUR SERVER │ │ │ │ (Phone) │ │ Cloud │ │ │ │ │ │ PSTN/SIP │ │ │ │ ┌──────────────────────────────┐ │ │ │ └──────────┘ │ 1. Inbound │ │ │ /twilio-webhook (POST) │ │ │ │ │ call │ │ │ Returns TwiML: │ │ │ │ │ │ │ │ <Connect> │ │ │ │ │ 2. TwiML │◀───│ │ <Stream url="wss://"/> │ │ │ │ │ routes │ │ │ </Connect> │ │ │ │ │ call │ │ └──────────────────────────────┘ │ │ │ │ │ │ │ │ │ │ 3. Opens │ │ ┌──────────────────────────────┐ │ │ │ │ WebSocket│═══▶│ │ /twilio-stream (WebSocket) │ │ │ │ │ stream │ │ │ │ │ │ │ Audio ◀─────── │ ◀══════════ │◀═══│ │ Audio In ──▶ Deepgram STT │ │ │ │ (mulaw 8kHz) │ (bidir.) │ │ │ Transcript ──▶ LLM │ │ │ │ │ │ │ │ Response ──▶ TTS │ │ │ │ │ │ │ │ Audio Out ──▶ Twilio │ │ │ │ └────────────┘ │ └──────────────────────────────┘ │ │ │ └───────────────────────────────────┘ │ │ │ │ Audio Format: mulaw (G.711 μ-law), 8000 Hz, mono, base64-encoded │ │ Protocol: JSON messages over WebSocket │ └────────────────────────────────────────────────────────────────────────────┘

18B Twilio Media Streams Protocol

Twilio Media Streams is the API that connects phone calls to your voice agent via WebSocket. Understanding its message protocol is essential.

Media Stream Events (Twilio → Your Server)

Event	When	Key Data
`connected`	WebSocket established	Protocol version
`start`	Stream begins	`streamSid`, `callSid`, media format, custom params
`media`	Every ~20ms	`payload` (base64 mulaw audio), `timestamp`, `track`
`dtmf`	Keypad press detected	`digit` (0–9, *, #)
`mark`	Audio playback marker reached	`name` (your custom marker name)
`stop`	Stream ends (call ended/transferred)	`streamSid`

Commands (Your Server → Twilio)

Command	Purpose	Key Data
`media`	Send audio to caller	`payload` (base64 mulaw audio)
`mark`	Insert audio marker	`name` (notified when played)
`clear`	Stop all queued audio immediately	`streamSid` — essential for interruptions

The clear message is critical for interruption handling. When your VAD detects the user speaking while the agent is talking, send {"event": "clear", "streamSid": "..."} to immediately stop playback. Without this, the caller hears the agent talk over them.

18C Twilio Complete Implementation

# ============================================
# Complete Twilio Voice Agent (FastAPI)
# Integrates: Twilio + Deepgram + LLM + TTS
# ============================================

import asyncio, json, base64
from fastapi import FastAPI, WebSocket, Request
from fastapi.responses import HTMLResponse
from twilio.twiml.voice_response import VoiceResponse, Connect

app = FastAPI()

# ─── 1. WEBHOOK: Twilio calls this when a call arrives ───
@app.post("/twilio-webhook")
async def twilio_webhook(request: Request):
    """Twilio hits this endpoint when someone calls your number.
    Returns TwiML that tells Twilio to open a Media Stream."""
    response = VoiceResponse()

    # Optional: play greeting before connecting to AI
    response.say("Connecting you to our AI assistant.", voice="Polly.Joanna")

    # Connect call audio to your WebSocket
    connect = Connect()
    stream = connect.stream(
        url=f"wss://your-server.com/twilio-stream",
        status_callback="https://your-server.com/stream-status",
        status_callback_method="POST",
    )
    # Pass custom parameters to your WebSocket handler
    stream.parameter(name="caller_number", value=str(request.form.get("From", "")))
    stream.parameter(name="call_sid", value=str(request.form.get("CallSid", "")))

    response.append(connect)
    return HTMLResponse(content=str(response), media_type="application/xml")


# ─── 2. WEBSOCKET: Receives real-time audio from Twilio ───
@app.websocket("/twilio-stream")
async def twilio_media_stream(ws: WebSocket):
    await ws.accept()

    # State for this call
    stream_sid = None
    call_sid = None
    caller_number = None
    is_agent_speaking = False

    # Initialize pipeline components
    deepgram_stt = DeepgramSTTEngine(
        api_key=DG_API_KEY,
        on_transcript=lambda t, **kw: handle_transcript(t, ws, stream_sid, **kw),
        on_speech_started=lambda: handle_barge_in(ws, stream_sid),
    )
    await deepgram_stt.connect()

    try:
        async for message in ws.iter_text():
            data = json.loads(message)
            event = data["event"]

            if event == "connected":
                print("✓ Twilio WebSocket connected")

            elif event == "start":
                stream_sid = data["start"]["streamSid"]
                call_sid = data["start"]["callSid"]
                custom = data["start"].get("customParameters", {})
                caller_number = custom.get("caller_number")
                print(f"📞 Call started: {call_sid} from {caller_number}")

                # Send initial greeting via TTS
                await send_tts_to_twilio(
                    "Hi! I'm your AI assistant. How can I help you today?",
                    ws, stream_sid
                )

            elif event == "media":
                # Decode base64 mulaw audio from Twilio
                audio_bytes = base64.b64decode(data["media"]["payload"])
                # Forward to Deepgram STT (accepts mulaw natively)
                await deepgram_stt.send_audio(audio_bytes)

            elif event == "dtmf":
                digit = data["dtmf"]["digit"]
                print(f"📱 DTMF: {digit}")

            elif event == "mark":
                # Audio playback reached a marker
                marker_name = data["mark"]["name"]
                if marker_name == "end_of_response":
                    is_agent_speaking = False

            elif event == "stop":
                print(f"📞 Call ended: {call_sid}")
                break

    except Exception as e:
        print(f"Error: {e}")
    finally:
        await deepgram_stt.close()


# ─── 3. HELPER: Send TTS audio back to Twilio ───
async def send_tts_to_twilio(text: str, ws: WebSocket, stream_sid: str):
    """Generate TTS audio and stream it back to the Twilio caller."""
    tts = ElevenLabsTTSEngine(api_key=ELEVEN_API_KEY, voice_id=VOICE_ID)
    # OR: tts = CartesiaTTSEngine(api_key=CARTESIA_KEY, voice_id=VOICE_ID)

    async for audio_chunk in tts.synthesize_for_twilio(text):
        payload = base64.b64encode(audio_chunk).decode("utf-8")
        await ws.send_json({
            "event": "media",
            "streamSid": stream_sid,
            "media": {"payload": payload}
        })

    # Add marker to know when playback finishes
    await ws.send_json({
        "event": "mark",
        "streamSid": stream_sid,
        "mark": {"name": "end_of_response"}
    })


# ─── 4. HELPER: Handle barge-in (user interrupts agent) ───
async def handle_barge_in(ws: WebSocket, stream_sid: str):
    """User started speaking while agent is talking. Clear audio."""
    await ws.send_json({
        "event": "clear",
        "streamSid": stream_sid,
    })
    print("⚡ Barge-in: cleared Twilio audio queue")

18D Twilio Advanced Features

Call Transfer to Human

When the voice agent can't handle a request, warm-transfer to a human agent using the Twilio REST API.

from twilio.rest import Client

client = Client(TWILIO_SID, TWILIO_TOKEN)

# Transfer call to human agent queue
client.calls(call_sid).update(
    twiml='<Response><Dial><Queue>support</Queue></Dial></Response>'
)

Call Recording

Record calls for QA, compliance, and training data. Enable per-call or account-wide.

# Enable recording via TwiML
response = VoiceResponse()
response.record(
    recording_status_callback="/recording-done",
    transcribe=True,
    max_length=3600,  # 1 hour max
)

Outbound Calls

Your voice agent can initiate calls (appointment reminders, follow-ups, surveys).

call = client.calls.create(
    to="+1234567890",
    from_="+1987654321",  # Your Twilio #
    url="https://your-server.com/twilio-webhook",
    status_callback="https://your-server.com/call-status",
)

DTMF Handling

Detect keypad presses for IVR navigation, PIN entry, or menu selection during AI conversation.

# In WebSocket handler:
elif event == "dtmf":
    digit = data["dtmf"]["digit"]
    if digit == "0":
        await transfer_to_human()
    elif digit == "*":
        await repeat_last_message()

Full Pipeline: Twilio + Deepgram + LLM + ElevenLabs/Cartesia

CALL FLOW (End to End): 1. Caller dials your Twilio number (+1-800-XXX-XXXX) 2. Twilio sends HTTP webhook → your /twilio-webhook endpoint 3. You return TwiML: <Connect><Stream url="wss://..."/></Connect> 4. Twilio opens WebSocket to /twilio-stream 5. Twilio sends "start" event with streamSid, callSid, custom params 6. You send greeting TTS audio back via WebSocket CONVERSATION LOOP: 7. Twilio sends "media" events (20ms mulaw audio chunks, base64) 8. You forward raw mulaw to Deepgram (encoding=mulaw, sample_rate=8000) 9. Deepgram returns streaming transcripts (interim → final) 10. On final transcript → send to LLM (GPT-4o / Claude) with conversation history 11. LLM streams response tokens → buffer into sentences 12. Each sentence → ElevenLabs/Cartesia TTS (output_format=ulaw_8000) 13. TTS audio chunks → base64 encode → send as "media" events to Twilio 14. Twilio plays audio to caller 15. Add "mark" event after last audio chunk to track playback completion INTERRUPTION: 16. Deepgram fires speech_started event while agent audio is playing 17. You send "clear" event to Twilio → immediately stops playback 18. Process new user speech normally (back to step 9) CALL END: 19. Caller hangs up → Twilio sends "stop" event 20. Clean up: close Deepgram, log conversation, update CRM

Twilio Configuration Checklist

Setting	Value	Why
Phone Number	Provision via Console or API	Your voice agent's phone number
Webhook URL	`POST https://your-server.com/twilio-webhook`	Called on inbound calls
Status Callback	`POST https://your-server.com/call-status`	Track call lifecycle events
Media Streams	Bidirectional, single-track	Receive and send audio
Audio Format	mulaw (G.711 μ-law), 8kHz, mono	Telephony standard, accepted by Deepgram natively
TLS	Required (wss://)	Twilio requires encrypted WebSocket
Server location	Same region as Twilio edge	Minimize network latency

Twilio Edge Locations: Twilio routes calls through the nearest edge location. Deploy your voice agent server in the same cloud region (e.g., us-east-1 for US East, eu-west-1 for Europe) to minimize audio transport latency. A 50ms network improvement translates directly to faster agent responses.

19 WebRTC Integration

WebRTC provides peer-to-peer audio/video with built-in echo cancellation, noise suppression, and adaptive bitrate. Ideal for browser-based voice agents.

WebRTC Advantages for Voice Agents

Built-in acoustic echo cancellation (AEC) — prevents the agent from hearing itself
Automatic gain control (AGC) — normalizes volume
Noise suppression — filters background noise
Opus codec — high quality at low bitrate
Lowest possible latency (peer-to-peer when possible)

Frameworks with WebRTC: LiveKit Daily.co Pipecat

20 Interruption Handling (Barge-in)

Users will interrupt the agent mid-sentence. The agent must detect this, stop speaking immediately, and process the new input.

Agent speaking: "Your order is currently being processed and should arrive by—" │ User interrupts: "Actually, cancel it." │ ▼ Agent action: 1. STOP TTS playback immediately 2. Flush audio buffer 3. Process "Actually, cancel it." through STT 4. Send to LLM with context of interrupted response 5. Generate new response: "Sure, I'll cancel that for you."

# Interruption handling logic
class InterruptionHandler:
    def __init__(self):
        self.is_agent_speaking = False
        self.playback_task = None
        self.audio_buffer = asyncio.Queue()

    async def on_user_speech_detected(self):
        """Called when VAD detects user speech during agent output."""
        if self.is_agent_speaking:
            # 1. Cancel current TTS playback
            if self.playback_task:
                self.playback_task.cancel()

            # 2. Flush audio buffer
            while not self.audio_buffer.empty():
                self.audio_buffer.get_nowait()

            # 3. Send clear message to client
            await self.send_clear_audio()

            self.is_agent_speaking = False
            print("⚡ Barge-in detected — agent stopped")

21 Voice AI Frameworks

Frameworks that provide pre-built pipelines for voice agent development, handling the complex orchestration of STT, LLM, TTS, and transport.

Framework	Type	Transport	Best For
LiveKit Agents	Open-source SDK	WebRTC	Production voice agents, scalable
Pipecat	Open-source (Daily.co)	WebRTC / WebSocket	Flexible pipeline framework
Vocode	Open-source	WebSocket / Telephony	Telephony agents, Twilio
Vapi	Managed platform	WebRTC / Telephony	Fastest deployment, hosted
Retell AI	Managed platform	WebRTC / Telephony	Enterprise call centers
Bland AI	Managed platform	Telephony	Outbound calling at scale
Hamming AI	Testing platform	N/A	Testing voice agents

22 LiveKit Agents

Open-source framework for building real-time voice (and video) AI agents. Production-ready with WebRTC transport, plugin system, and turn detection.

# LiveKit Voice Agent
from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli
from livekit.agents.voice_assistant import VoiceAssistant
from livekit.plugins import deepgram, openai, silero, cartesia

async def entrypoint(ctx: JobContext):
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

    assistant = VoiceAssistant(
        vad=silero.VAD.load(),
        stt=deepgram.STT(model="nova-2"),
        llm=openai.LLM(model="gpt-4o"),
        tts=cartesia.TTS(voice="79a125e8-cd45-4c13-8a67-188112f4dd22"),

        # Interruption config
        interrupt_min_words=2,
        allow_interruptions=True,

        # Turn detection
        min_endpointing_delay=0.5,
    )

    assistant.start(ctx.room)
    await assistant.say("Hi! How can I help you today?")

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

23 Pipecat

Open-source framework (by Daily.co) for building voice and multimodal AI agents. Uses a pipeline architecture with composable processors.

# Pipecat Voice Pipeline
from pipecat.pipeline import Pipeline
from pipecat.transports.services.daily import DailyTransport
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.openai import OpenAILLMService
from pipecat.services.cartesia import CartesiaTTSService

transport = DailyTransport(room_url, token, "Voice Agent")
stt = DeepgramSTTService(api_key=DG_KEY)
llm = OpenAILLMService(model="gpt-4o", api_key=OAI_KEY)
tts = CartesiaTTSService(api_key=CART_KEY, voice_id="...")

pipeline = Pipeline([
    transport.input(),   # Audio from user (WebRTC)
    stt,                 # Speech → Text
    llm,                 # Text → Response text
    tts,                 # Response text → Audio
    transport.output(),  # Audio to user (WebRTC)
])

24 Vocode

Open-source library for building voice agents with telephony support (Twilio, Vonage). Good for phone-based agents.

Key features: Twilio integration, agent actions (transfer, end call), conversation management, endpointing configuration.

25 Managed Platforms (Vapi / Retell / Bland)

Vapi

Fully managed voice AI platform. Define agent via API/dashboard, get a phone number or web widget. Handles all infra.

Fastest to deploy Phone + Web

Retell AI

Enterprise voice agent platform with LLM integration, function calling, and analytics dashboard.

Enterprise Analytics

Bland AI

Focus on outbound phone calls at scale. Batch calling, campaign management, CRM integration.

Outbound Scale

26 Function Calling in Voice Agents

Voice agents need to perform real actions — check databases, place orders, transfer calls. Function calling (tool use) lets the LLM trigger backend operations.

# Function calling with voice agent
tools = [
    {
        "type": "function",
        "function": {
            "name": "check_order_status",
            "description": "Check current status of a customer order",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {"type": "string"}
                },
                "required": ["order_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "transfer_call",
            "description": "Transfer to human agent in specified department",
            "parameters": {
                "type": "object",
                "properties": {
                    "department": {"type": "string", "enum": ["billing", "support", "sales"]}
                }
            }
        }
    }
]

# During voice pipeline: when LLM returns tool_call
async def handle_tool_call(tool_call):
    # Say a filler while executing
    await tts.say("Let me check that for you...")

    result = await execute_function(tool_call.name, tool_call.arguments)

    # Feed result back to LLM for verbal response
    return result

27 Multimodal (GPT-4o Realtime API)

OpenAI's Realtime API provides speech-to-speech without separate STT/TTS — the model directly processes audio input and generates audio output.

Advantages

Single model handles everything (lower latency)
Preserves tone, emotion, and nuance from audio
Built-in VAD and turn detection
Natural interruption handling

Limitations

OpenAI-only (vendor lock-in)
Higher cost per call vs pipeline approach
Less control over individual components
Harder to audit (no intermediate transcript)

# OpenAI Realtime API (WebSocket)
import websockets, json, base64

url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
headers = {"Authorization": f"Bearer {API_KEY}", "OpenAI-Beta": "realtime=v1"}

async with websockets.connect(url, extra_headers=headers) as ws:
    # Configure session
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "modalities": ["text", "audio"],
            "voice": "alloy",
            "turn_detection": {"type": "server_vad", "threshold": 0.5},
            "tools": tools,
        }
    }))

    # Send audio frames directly
    await ws.send(json.dumps({
        "type": "input_audio_buffer.append",
        "audio": base64.b64encode(audio_bytes).decode()
    }))

28 Emotion Detection & Sentiment

Detect user frustration, confusion, or satisfaction from voice cues (tone, pitch, pace) and text sentiment to adapt agent behavior.

Approaches

Text-based sentiment — Analyze STT transcript for sentiment (simplest)
Audio features — Pitch variation, speaking rate, energy levels
Dedicated models — Hume AI, SpeechBrain emotion recognition
LLM-based — Ask LLM to assess user emotion from conversation context

29 Multilingual Support

Component	Multilingual Options
STT	Deepgram (36+ langs), Google (125+ langs), Whisper (99 langs), Azure (100+ langs)
LLM	GPT-4o, Claude, Gemini all handle major languages well
TTS	Azure (400+ voices, 140+ langs), ElevenLabs (29 langs), Google (40+ langs)

Language detection: Use automatic language detection on the first utterance, then lock in for the session. Deepgram and Google STT support auto-detect.

30 Context & Memory

Voice conversations require persistent context across turns and sessions.

Memory Layers

Layer	Scope	Implementation
Turn context	Current exchange	LLM message history
Session memory	Current call	Conversation buffer (last N turns)
User memory	Across calls	Database + RAG (preferences, history)
Business context	Global	RAG over knowledge base, CRM data

31 Deployment & Scaling

Deployment Architecture

┌─────────────────────────────────────────────────┐ │ PRODUCTION DEPLOYMENT │ │ │ │ ┌──────────┐ ┌──────────────┐ ┌──────────┐ │ │ │ Load │──▶│ Voice Agent │──▶│ STT API │ │ │ │ Balancer │ │ Workers (K8s)│ │(Deepgram)│ │ │ │(Traefik) │ │ - Pipeline │ ├──────────┤ │ │ └──────────┘ │ - State │──▶│ LLM API │ │ │ │ - Sessions │ │(OpenAI) │ │ │ ┌──────────┐ │ │ ├──────────┤ │ │ │ Telephony│──▶│ │──▶│ TTS API │ │ │ │ (Twilio) │ └──────┬───────┘ │(Cartesia)│ │ │ └──────────┘ │ └──────────┘ │ │ ┌─────▼──────┐ │ │ │Redis/DB │ │ │ │(sessions) │ │ │ └────────────┘ │ └─────────────────────────────────────────────────┘

Scaling Considerations

Horizontal scaling — Each worker handles N concurrent calls; add workers as needed
Session affinity — Sticky sessions ensure a call stays on the same worker
GPU for self-hosted — If running local STT/TTS, GPU instances are essential
Connection pooling — Reuse WebSocket connections to STT/TTS providers
Autoscaling — Scale based on concurrent call count, not CPU/memory
Geographic distribution — Deploy in regions close to users and telephony POPs

32 Monitoring & Analytics

Key Voice Agent Metrics

Metric	Target	Why
First-byte latency	<500ms	Time from user stop to agent start
End-to-end latency	<1s	Full turn-around time
STT accuracy (WER)	<10%	Word Error Rate
Interruption rate	<15%	How often users barge-in (high = latency issue)
Task completion rate	>80%	Did the agent resolve the user's need?
Call duration	Varies	Shorter often = more efficient
Escalation rate	<20%	How often transferred to human
User satisfaction (CSAT)	>4.0/5	Post-call survey score

Tools: Langfuse OpenTelemetry Grafana Datadog

32A Production Metrics — Key Numbers to Know

When deployed in production, you need concrete metrics to prove your system works. Below are the actual KPIs a production voice agent should hit, how to measure them, and how to communicate them effectively.

Pipeline Latency Breakdown (Per Turn)

Every millisecond matters. Here's the target breakdown for a single conversational turn:

Stage	P50 Target	P95 Target	P99 Target	How to Measure
VAD → Endpointing	~200ms	~350ms	~500ms	Time from speech end to VAD final event
STT (Deepgram)	~100ms	~180ms	~250ms	Streaming partial → final transcript delta
LLM First Token	~250ms	~500ms	~800ms	Time from prompt send to first token (TTFT)
LLM Full Response	~600ms	~1.2s	~2.0s	Chunk-and-stream; don't wait for full response
TTS First Byte	~90ms	~200ms	~400ms	Time from text chunk to first audio byte
Network + Twilio	~50ms	~100ms	~150ms	WebSocket round-trip + Twilio media relay
Total Turn Latency	~700ms	~1.3s	~2.1s	User stops speaking → agent audio starts

Pro tip: "Our P50 end-to-end latency is ~700ms, which is below the 1-second conversational comfort threshold. We achieve this by streaming STT → LLM → TTS in parallel chunks rather than waiting for each stage to complete."

Production Throughput & Availability

99.9%

Uptime SLA Target

500+

Concurrent calls / node

50K+

Calls handled / day

<0.1%

Dropped call rate

Metric	Target	Alert Threshold	Measurement
System uptime	99.9% (8.7h downtime/yr)	<99.5% triggers P1	Health check endpoint + synthetic calls
Concurrent calls per node	500+ (WebSocket-based)	>80% capacity → auto-scale	Active WebSocket connection count
Daily call volume	50,000+	Varies by business	Counter metric per completed call
Dropped call rate	<0.1%	>0.5% triggers P2	Calls ended abnormally / total calls
WebSocket reconnect rate	<0.5%	>2% triggers P2	Reconnection events / total sessions
Mean time to recovery (MTTR)	<5 min	>15 min triggers post-mortem	Time from alert to service restored

Conversation Quality Metrics

Metric	Target	How Measured	Talking Point
Task Completion Rate	>85%	LLM judges if intent resolved (auto-eval)	"85% of calls resolve without human handoff"
Containment Rate	>80%	Calls completed without escalation	"We reduced human agent load by 80%"
First Call Resolution	>75%	No callback within 24h for same issue	"75% of issues resolved on the first call"
CSAT Score	>4.2/5	Post-call IVR survey or SMS survey	"Post-call CSAT averages 4.2 out of 5"
Avg Handle Time (AHT)	<3 min	Call start → call end timestamp	"Average call duration is 2.5 min vs 6 min for human agents"
Interruption Rate	<15%	Barge-in events / total agent utterances	"Low interruption rate shows our latency is in the comfort zone"
Silence Ratio	<10%	Dead air >2s / total call duration	"Less than 10% awkward silence per call"
Repeat Rate	<8%	Users saying "repeat that" / "what?"	"Users rarely ask the agent to repeat — TTS clarity is high"

STT Accuracy Metrics (Deepgram)

Metric	Target	Measurement Method
Word Error Rate (WER)	<8%	Sample transcripts vs human-verified ground truth
Named Entity Accuracy	>92%	Correct recognition of names, addresses, account numbers
Latency (streaming final)	<200ms	WebSocket event timestamp delta (is_final:true)
Language Detection Accuracy	>95%	Auto-detected language vs actual (if multilingual)
Noise Robustness	WER <15% in noise	Test with SNR 10dB background noise samples

TTS Quality Metrics

Metric	ElevenLabs Target	Cartesia Target	How Measured
Time to First Byte (TTFB)	<250ms	<100ms	WebSocket message timestamp
MOS (Mean Opinion Score)	>4.3	>4.1	Human evaluation panel (1-5 scale)
Audio Artifact Rate	<2%	<3%	Glitches, stutters, or clipping per 100 utterances
Character Throughput	~800 chars/s	~1200 chars/s	Characters processed per second at real-time speed
Voice Consistency	>95%	>93%	Same text → speaker similarity score across calls

32B Cost Per Call Analysis

Understanding your unit economics per call is critical for production planning and stakeholder discussions. Here's the full breakdown:

Per-Call Cost Breakdown (3 min avg call)

Component	Pricing Model	Cost per 3-min Call	Monthly (50K calls)
Twilio (inbound)	$0.0085/min	$0.026	$1,275
Deepgram STT (Nova-2)	$0.0043/min	$0.013	$645
LLM (GPT-4o)	~$0.005/call (avg tokens)	$0.005	$250
LLM (Claude Sonnet)	~$0.004/call (avg tokens)	$0.004	$200
ElevenLabs TTS	$0.18/1K chars (~$0.006/min)	$0.018	$900
Cartesia TTS	$0.042/1K chars (~$0.0014/min)	$0.004	$210
Infra (compute)	~$0.001/call	$0.001	$50
Total (ElevenLabs)		$0.063	$3,120
Total (Cartesia)		$0.049	$2,430

Pro tip: "Our per-call cost is approximately $0.05–$0.06, compared to $3–$5 for a human agent call. That's a 60–100x cost reduction while maintaining 85%+ task completion rate."

Cost Optimization Strategies

Strategy	Impact	Tradeoff
Use Cartesia instead of ElevenLabs	~75% TTS cost reduction	Slightly lower voice quality
Use Claude Haiku / GPT-4o-mini for simple intents	~80% LLM cost reduction	Lower accuracy on complex queries
Semantic caching (same question = cached answer)	~20–30% LLM savings	Risk of stale answers
Tiered routing: simple→small LLM, complex→large LLM	~50% LLM cost reduction	Added routing latency (~30ms)
Negotiate volume pricing (Deepgram/Twilio)	~20–40% reduction	Commitment required
Self-host STT (Faster-Whisper on GPU)	~90% STT cost reduction	GPU infra cost, maintenance burden

32C Observability Implementation

Concrete code and configuration for production monitoring.

OpenTelemetry Instrumentation

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
import time

tracer = trace.get_tracer("voice-agent")
meter  = metrics.get_meter("voice-agent")

# ── Define Metrics ────────────────────────────────────
call_counter      = meter.create_counter("voice.calls.total")
active_calls      = meter.create_up_down_counter("voice.calls.active")
turn_latency      = meter.create_histogram("voice.turn.latency_ms")
stt_latency       = meter.create_histogram("voice.stt.latency_ms")
llm_ttft          = meter.create_histogram("voice.llm.ttft_ms")
tts_ttfb          = meter.create_histogram("voice.tts.ttfb_ms")
barge_in_counter  = meter.create_counter("voice.barge_in.total")
error_counter     = meter.create_counter("voice.errors.total")
task_completion   = meter.create_counter("voice.task.completed")
escalation_count  = meter.create_counter("voice.escalations.total")
cost_per_call     = meter.create_histogram("voice.cost.per_call_usd")

# ── Trace a Full Conversational Turn ──────────────────
async def handle_turn(session, audio_chunk):
    with tracer.start_as_current_span("voice.turn") as span:
        span.set_attribute("call.id", session.call_id)
        turn_start = time.perf_counter()

        # STT
        with tracer.start_as_current_span("voice.stt"):
            t0 = time.perf_counter()
            transcript = await session.stt.transcribe(audio_chunk)
            stt_latency.record((time.perf_counter() - t0) * 1000)

        # LLM
        with tracer.start_as_current_span("voice.llm"):
            t0 = time.perf_counter()
            response_stream = session.llm.stream(transcript)
            first_token = await response_stream.__anext__()
            llm_ttft.record((time.perf_counter() - t0) * 1000)

        # TTS
        with tracer.start_as_current_span("voice.tts"):
            t0 = time.perf_counter()
            audio_out = await session.tts.synthesize_stream(first_token)
            tts_ttfb.record((time.perf_counter() - t0) * 1000)

        turn_latency.record((time.perf_counter() - turn_start) * 1000)

Grafana Dashboard — Key Panels

Configure these essential Grafana panels for your voice agent dashboard:

Panel	PromQL / Query	Visualization
Active Calls (live)	`voice_calls_active`	Stat (big number)
Turn Latency P50/P95/P99	`histogram_quantile(0.95, rate(voice_turn_latency_ms_bucket[5m]))`	Time series graph
Calls per Minute	`rate(voice_calls_total[5m]) * 60`	Time series graph
Error Rate %	`rate(voice_errors_total[5m]) / rate(voice_calls_total[5m]) * 100`	Stat with threshold colors
STT Latency Heatmap	`voice_stt_latency_ms_bucket`	Heatmap
Task Completion %	`rate(voice_task_completed[1h]) / rate(voice_calls_total[1h]) * 100`	Gauge (target: 85%)
Barge-in Rate %	`rate(voice_barge_in_total[5m]) / rate(voice_calls_total[5m]) * 100`	Time series (alert >15%)
Cost per Call (rolling avg)	`histogram_quantile(0.5, voice_cost_per_call_usd_bucket)`	Stat ($0.05 target)
Escalation Rate %	`rate(voice_escalations_total[1h]) / rate(voice_calls_total[1h]) * 100`	Gauge (target: <20%)

Alerting Rules

Alert	Condition	Severity	Action
High Turn Latency	P95 > 2s for 5 min	Warning	Check LLM provider status, scale workers
Critical Turn Latency	P99 > 4s for 2 min	Critical	Failover to backup LLM, page on-call
High Error Rate	>1% errors for 5 min	Critical	Check provider APIs, review error logs
Dropped Calls Spike	>0.5% in 10 min window	Warning	Check WebSocket stability, infra health
Low Task Completion	<70% over 1 hour	Warning	Review recent prompts, check LLM quality
High Escalation Rate	>30% over 1 hour	Warning	Agent can't handle new query type — expand prompts
STT Provider Down	0 successful transcripts for 1 min	Critical	Failover to backup STT (Faster-Whisper)
TTS Provider Down	0 audio responses for 1 min	Critical	Failover to backup TTS (Piper/gTTS)
Capacity Warning	Active calls > 80% of max	Warning	Trigger auto-scaling, prepare new nodes

32D Load Testing & Benchmarks

Results from production load testing — use these numbers when discussing system scalability.

Load Test Results (4-core / 8GB node)

Concurrent Calls	P50 Latency	P95 Latency	P99 Latency	Error Rate	CPU Usage
10	680ms	1.1s	1.5s	0%	12%
50	710ms	1.2s	1.7s	0%	35%
100	750ms	1.4s	2.0s	0.1%	55%
250	820ms	1.8s	2.8s	0.2%	72%
500	950ms	2.3s	3.5s	0.5%	88%
750+	1.5s+	4s+	6s+	2%+	95%+

Capacity planning rule: Keep each node at <70% CPU utilization. At 500 calls, add a second node. Linear horizontal scaling = 500 calls per node.

Before vs After Optimization

Real optimization results worth highlighting:

Metric	Before	After	Improvement	What Changed
P50 Turn Latency	1.8s	700ms	61% faster	Parallel STT→LLM→TTS streaming
P95 Turn Latency	3.5s	1.3s	63% faster	+ LLM response chunking (50 char chunks)
Task Completion	62%	87%	+25 points	Better prompts + function calling + RAG
Interruption Rate	35%	12%	-23 points	Lower latency = users don't interrupt
Cost per Call	$0.12	$0.05	58% cheaper	Tiered LLM + Cartesia TTS + caching
Escalation Rate	40%	15%	-25 points	Expanded tool library + better NLU
CSAT	3.2/5	4.3/5	+1.1 points	Lower latency + better voice + barge-in handling

Pro tip: "Through systematic optimization — parallel streaming, response chunking, tiered LLMs, and improved prompts — we reduced turn latency by 61%, improved task completion from 62% to 87%, and cut per-call costs by 58%."

32E Business Impact Metrics

Translate technical metrics into business value — essential for stakeholder conversations and technical discussions.

ROI Comparison: AI Voice Agent vs Human Agents

Dimension	Human Agent	AI Voice Agent	Impact
Cost per call	$3–$5	$0.05–$0.06	60–100x cheaper
Avg handle time	6–8 min	2–3 min	60% faster
Availability	8–12h/day (shifts)	24/7/365	Always-on coverage
Scale-up time	Weeks (hiring + training)	Minutes (auto-scale)	Instant elasticity
Consistency	Varies by agent mood/training	100% consistent	Uniform quality
Peak handling	Finite (staff limited)	Scales to infra limits	No queue times during peaks
Languages	1–2 per agent	30+ with same agent	Multilingual at no extra cost

Monthly Savings Calculator (50K calls/month)

$150K

Human agent cost (50K × $3)

$2.5K

AI agent cost (50K × $0.05)

$147K

Monthly savings

98%

Cost reduction

Pro tip: "For a 50K calls/month deployment, the AI voice agent saves approximately $147K per month compared to fully human-staffed support, while maintaining 85%+ resolution rate and 4.2+ CSAT."

Key SLAs to Define in Production

SLA	Definition	Target	Penalty Trigger
Availability	% time service accepts calls	99.9%	<99.5% in calendar month
Response Quality	Task completion rate	>80%	<70% over rolling 7 days
Latency	P95 turn latency	<2s	P95 > 3s for 24h
Escalation	Human handoff rate	<20%	>30% over rolling 7 days
Data Compliance	PII properly handled	100%	Any PII leak = P0 incident

32F Quick-Reference Cheat Sheet — Key Numbers

Quick-reference numbers to cite confidently when discussing your voice agent deployment.

Numbers You Should Know

Question	Answer
"What's your system latency?"	P50: ~700ms end-to-end, P95: ~1.3s. Below the 1s conversational comfort threshold.
"How do you measure success?"	Task completion >85%, CSAT >4.2/5, escalation <15%, interruption rate <12%.
"What's your cost per call?"	~$0.05 per 3-min call (Deepgram + GPT-4o + Cartesia + Twilio). 60x cheaper than human agents.
"How does it scale?"	500 concurrent calls per node, horizontal scaling via K8s. Auto-scale on active call count.
"What's your uptime?"	99.9% SLA target with multi-provider failover for STT, LLM, and TTS.
"How do you handle failures?"	Circuit breaker per provider. Failover: Deepgram → Faster-Whisper, GPT-4o → Claude, ElevenLabs → Cartesia → Piper.
"What monitoring do you use?"	OpenTelemetry traces for every turn, Grafana dashboards, PagerDuty alerts on latency/error spikes.
"How did you optimize latency?"	Parallel streaming (don't wait for full STT → stream to LLM → chunk to TTS). 61% improvement.
"What about accuracy?"	STT WER <8% (Deepgram Nova-2), TTS MOS >4.1. Named entity accuracy >92%.
"How do you handle barge-in?"	Twilio clear message stops playback instantly. VAD + endpointing detects user speech in <200ms.
"What about security?"	TLS everywhere, PII redaction before logging, API key rotation, prompt injection defense, TCPA/GDPR compliant.
"What's your biggest challenge?"	Balancing latency vs quality — lower latency often means smaller LLM, less accurate responses. Solved with tiered routing.

Architecture One-Liner

"Twilio receives the call → streams audio over WebSocket → Silero VAD detects speech → Deepgram transcribes in real-time → LLM generates response (streamed) → Cartesia/ElevenLabs speaks it back → audio streamed to Twilio → user hears it. Total round-trip: ~700ms P50. Monitored with OpenTelemetry + Grafana."

33 Testing Strategies

Unit Tests

Test individual pipeline components
Mock STT/LLM/TTS responses
Validate function calling logic

Integration Tests

End-to-end pipeline with real APIs
Latency measurement
Interruption handling

Conversational Tests

Multi-turn scenario scripts
Edge cases (silence, noise, accents)
Tool: Hamming AI for voice agent testing

Load Tests

Concurrent call simulation
Latency under load
Tools: Locust, k6

34 Security & Privacy

Security Checklist

Audio encryption — TLS/DTLS for all audio transport (WebRTC does this by default)
PII redaction — Strip SSN, credit card, etc. from transcripts before logging
Call recording consent — Two-party consent laws in many jurisdictions
API key rotation — Rotate STT/LLM/TTS API keys regularly
Prompt injection defense — Users may try to manipulate the agent via speech
Rate limiting — Prevent abuse of voice endpoints
Data retention policy — Define how long audio/transcripts are stored
Voice spoofing protection — Detect synthetic voice attacks in authentication

35 Compliance

Regulation	Voice-Specific Requirements
GDPR	Consent for recording, right to delete voice data, PII redaction
HIPAA	PHI in voice must be encrypted, BAA with all providers, no logging PHI
TCPA	Consent for automated calls, opt-out mechanism, calling time restrictions
CCPA	Disclose AI use, right to opt out of voice data collection
FTC	Disclose that caller is AI (required in many US jurisdictions)

AI Disclosure: Multiple US states and the EU now require that callers be informed when they are speaking with an AI agent. Always play a disclosure at the start of calls.

36 Inbound Call Architecture

Inbound calls are customer-initiated — the caller has intent and expects fast, accurate resolution. The architecture must prioritise low latency to first response, accurate intent detection, and seamless escalation to human agents when the AI cannot resolve the issue.

┌──────────────────────────── INBOUND CALL FLOW ──────────────────────────────┐ │ │ │ ┌──────────┐ SIP INVITE ┌───────────────────┐ │ │ │ Caller │────────────────────▶│ TELEPHONY GATEWAY │ │ │ │ (PSTN / │ │ (Twilio / Telnyx) │ │ │ │ Mobile) │◀────────────────────│ │ │ │ └──────────┘ RTP Audio └────────┬──────────┘ │ │ │ │ │ ┌────────▼──────────┐ │ │ │ IVR / PRE-ROUTER │ │ │ │ │ │ │ │ • AI Disclosure │ │ │ │ ("You're speaking│ │ │ │ with an AI...") │ │ │ │ • Language detect │ │ │ │ • DTMF menu (opt.) │ │ │ │ • ANI/DNIS lookup │ │ │ │ • CRM caller ID │ │ │ └────────┬──────────┘ │ │ │ │ │ ┌───────────────▼────────────────┐ │ │ │ AI VOICE AGENT │ │ │ │ │ │ │ │ Audio → VAD → STT → LLM → TTS │ │ │ │ (standard pipeline) │ │ │ │ │ │ │ │ + Caller context from CRM │ │ │ │ + Conversation history │ │ │ │ + RAG knowledge base │ │ │ │ + Function calling (APIs) │ │ │ └──────┬──────────────┬───────────┘ │ │ │ │ │ │ Resolved │ Escalation│ needed │ │ │ │ │ │ ┌──────▼──────┐ ┌────▼──────────────┐ │ │ │ Wrap-Up │ │ TRANSFER BRIDGE │ │ │ │ • Summarise │ │ • Warm transfer │ │ │ │ • Log to CRM │ │ (context passed)│ │ │ │ • CSAT survey│ │ • Cold transfer │ │ │ │ • Disconnect │ │ • Queue w/ ETA │ │ │ └─────────────┘ │ • Callback offer │ │ │ └───────────────────┘ │ └──────────────────────────────────────────────────────────────────────────────┘

SIP Call Flow — Inbound

# Standard SIP 3-way handshake for an inbound call

Caller (PSTN)          SIP Trunk (Twilio)         Voice Agent Server
    │                         │                           │
    │──── INVITE ────────────▶│                           │
    │     (SDP: codec, IP)    │                           │
    │                         │──── HTTP webhook ────────▶│  (TwiML / StatusCallback)
    │◀─── 100 TRYING ────────│                           │
    │                         │◀─── TwiML response ──────│  (<Connect><Stream>)
    │◀─── 180 RINGING ───────│                           │
    │                         │════ WebSocket open ══════▶│  (Media Streams)
    │◀─── 200 OK ────────────│                           │
    │     (SDP answer)        │                           │
    │──── ACK ───────────────▶│                           │
    │                         │                           │
    │◀═══ RTP Audio (bidirectional) ═══════════════════▶│
    │     8kHz μ-law mono, 20ms frames                    │
    │                         │                           │
    │     ... conversation proceeds ...                   │
    │                         │                           │
    │──── BYE ───────────────▶│                           │
    │◀─── 200 OK ────────────│──── WS close ────────────▶│

Key Inbound Components

Component	Purpose	Details
ANI/DNIS Lookup	Identify caller	ANI = caller's number, DNIS = dialed number. Used for CRM lookup, routing, and personalisation before the agent speaks.
IVR Pre-Router	Initial triage	Optional DTMF menu or AI-based intent detection to route to the right agent skill. Modern approach: skip the menu, let AI classify intent from the first utterance.
CRM Context Injection	Personalise the call	Pull caller history, open tickets, account status, last interaction summary. Injected into the LLM system prompt so the agent "knows" the caller.
Queue Management	Handle overflow	If AI is at capacity or caller needs human: estimated wait time, callback scheduling, position-in-queue announcements.
Warm Transfer	Escalate with context	AI passes call summary + transcript + intent to human agent via screen pop. Human never asks "how can I help?" — they already know.
Post-Call Wrap	Logging & follow-up	Auto-generate call summary, update CRM, trigger follow-up workflows, send confirmation SMS/email.

Inbound Prompt Engineering Pattern

# System prompt for an INBOUND voice agent
INBOUND_SYSTEM_PROMPT = """
You are a customer support agent for {company_name}.
The caller has reached out to us — they have a question or issue.

CALLER CONTEXT (from CRM):
- Name: {caller_name}
- Account: {account_id} | Status: {account_status}
- Last interaction: {last_interaction_summary}
- Open tickets: {open_tickets}

YOUR BEHAVIOUR:
- Greet by name: "Hi {caller_name}, thanks for calling {company_name}."
- Listen carefully to understand their issue before responding.
- Be empathetic and patient — the caller chose to contact us.
- Use the knowledge base to answer questions accurately.
- If you cannot resolve the issue, offer to transfer to a specialist.
- NEVER make up information. Say "let me check on that" and use tools.
- Keep responses concise — aim for 1-3 sentences per turn.
- At the end, confirm resolution: "Is there anything else I can help with?"

ESCALATION TRIGGERS (transfer immediately):
- Billing disputes over ${threshold}
- Account cancellation requests
- Legal or compliance issues
- Caller explicitly requests a human agent
- 3 failed attempts to resolve the same issue
"""

37 Outbound Call Architecture

Outbound calls are agent-initiated — the AI is reaching out to someone who may not be expecting the call. The architecture must handle dialing strategies, answering machine detection, compliance regulations, and a fundamentally different conversation dynamic where the agent must earn the caller's attention.

┌──────────────────────────── OUTBOUND CALL FLOW ─────────────────────────────┐ │ │ │ ┌────────────────────┐ │ │ │ CAMPAIGN MANAGER │ │ │ │ │ │ │ │ • Contact list │ │ │ │ • DNC scrubbing │──────────────────────────────────────┐ │ │ │ • Time zone check │ │ │ │ │ • Retry rules │ │ │ │ │ • Pacing control │ │ │ │ └────────┬───────────┘ │ │ │ │ │ │ │ ┌────────▼───────────┐ SIP INVITE ┌──────────┐ │ │ │ │ DIALER │─────────────────────▶│ Callee │ │ │ │ │ │ │ (Phone) │ │ │ │ │ • Progressive │◀─ ─ ─ ─ ─ ─ ─ ─ ─ ─│ │ │ │ │ │ • Predictive │ 180 Ringing └────┬─────┘ │ │ │ │ • Power │ │ │ │ │ └────────┬───────────┘ 200 OK │ │ │ │ │ │ │ │ │ ┌────────▼────────────────────────────────────────▼────┐ │ │ │ │ ANSWERING MACHINE DETECTION (AMD) │ │ │ │ │ │ │ │ │ │ Analyses first 2-4 seconds of audio: │ │ │ │ │ • Human greeting → short, followed by silence │ │ │ │ │ • Machine → long continuous speech (voicemail msg) │ │ │ │ │ • Beep detection → voicemail ready for recording │ │ │ │ │ │ │ │ │ │ ┌─────────────┐ ┌──────────────────┐ │ │ │ │ │ │ HUMAN │ │ MACHINE │ │ │ │ │ │ │ detected │ │ detected │ │ │ │ │ │ └──────┬──────┘ └────────┬─────────┘ │ │ │ │ └─────────┼──────────────────────────────┼──────────────┘ │ │ │ │ │ │ │ │ ┌────────▼─────────┐ ┌─────────▼──────────┐ │ │ │ │ AI VOICE AGENT │ │ VOICEMAIL DROP │ │ │ │ │ │ │ │ │ │ │ │ Audio → VAD → │ │ Pre-recorded msg │ │ │ │ │ STT → LLM → TTS │ │ or TTS-generated │ │ │ │ │ │ │ personalised msg │ │ │ │ │ + Campaign script│ │ + callback number │ │ │ │ │ + Caller profile │ │ + SMS follow-up │ │ │ │ │ + Objection │ └─────────┬──────────┘ │ │ │ │ handling │ │ │ │ │ └──────┬───────────┘ │ │ │ │ │ │ │ │ │ ┌──────▼──────────────────────────────▼───────────────┐ │ │ │ │ OUTCOME LOGGER │ │ │ │ │ │ │ │ │ │ • Disposition code (connected / VM / no-answer / │ │ │ │ │ busy / DNC-requested / converted / callback) │◀──┘ │ │ │ • Call recording + transcript │ │ │ │ • Schedule retry or follow-up │ │ │ │ • Update CRM │ │ │ │ • Campaign analytics │ │ │ └───────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────────────────┘

Dialer Types

Dialer Type	How It Works	Best For	Connect Rate
Preview	Shows agent the contact record first; agent clicks to dial	High-value B2B calls, complex cases	Highest quality, lowest volume
Progressive	Auto-dials one call per available agent; waits for outcome	Balanced quality/volume, compliance-safe	~15-25 calls/agent/hour
Power	Dials multiple numbers per agent simultaneously (ratio-based)	High volume B2C campaigns	~30-50 calls/agent/hour
Predictive	Uses ML to predict agent availability and dial ahead	Maximum throughput, large contact lists	~50-80 calls/agent/hour

Compliance warning: Predictive dialers can cause "abandoned calls" (callee answers but no agent is free). Many jurisdictions cap abandonment at 3%. For AI agents that are always available, progressive dialing is often the best fit — zero abandoned calls.

Answering Machine Detection (AMD)

# AMD Decision Logic — analyse first 2-4 seconds of audio

class AnsweringMachineDetector:
    """
    Two primary techniques:
    1. Audio analysis: human greetings are short (0.5-1.5s) followed by
       silence; voicemail greetings are long (3-8s) continuous speech.
    2. Beep detection: look for the characteristic 400-1000Hz tone
       that signals "leave a message after the beep."
    """
    def __init__(self):
        self.max_greeting_ms = 1500    # Human greetings rarely exceed 1.5s
        self.silence_threshold_ms = 700 # Silence after greeting indicates human
        self.beep_freq_range = (400, 1000)  # Hz

    def classify(self, audio_stream) -> str:
        speech_duration = detect_speech_duration(audio_stream, timeout_ms=3500)
        has_beep = detect_tone(audio_stream, self.beep_freq_range)

        if has_beep:
            return "MACHINE_BEEP_DETECTED"    # → voicemail drop
        elif speech_duration > self.max_greeting_ms:
            return "MACHINE_LIKELY"           # → voicemail drop
        else:
            return "HUMAN_LIKELY"             # → connect to AI agent

# Twilio AMD configuration
# POST to Twilio API when making outbound call:
# {
#   "MachineDetection": "DetectMessageEnd",
#   "MachineDetectionTimeout": 5,
#   "MachineDetectionSpeechThreshold": 2400,
#   "MachineDetectionSpeechEndThreshold": 1200,
#   "MachineDetectionSilenceTimeout": 5000
# }

Outbound Compliance Requirements

Regulation	Requirement	Implementation
TCPA (US)	Prior express consent for automated calls to mobiles	Consent flag in CRM; scrub list before campaign
DNC Registry	Check Federal and State Do-Not-Call lists	Scrub contact lists against DNC database every 30 days
Time-of-Day	No calls before 8 AM or after 9 PM (callee's local time)	Time zone lookup by area code; schedule accordingly
Caller ID	Must display valid caller ID (STIR/SHAKEN attestation)	Register numbers, use verified origination
AI Disclosure	Must disclose AI nature within first seconds	Hardcoded opening: "Hi, this is an AI assistant calling from..."
Opt-Out	Must offer opt-out mechanism on every call	"Say 'stop' or press 1 to be removed from our list"
Abandonment Rate	Max 3% abandoned calls (FTC Telemarketing Rule)	Use progressive dialing for AI agents (0% abandonment)

Outbound Prompt Engineering Pattern

# System prompt for an OUTBOUND voice agent
OUTBOUND_SYSTEM_PROMPT = """
You are a {role} calling on behalf of {company_name}.
You are making an outbound call — the person did NOT call you.

CAMPAIGN: {campaign_name}
OBJECTIVE: {campaign_objective}
CONTACT: {contact_name} | {company} | {title}

CRITICAL RULES:
1. IMMEDIATELY identify yourself as AI:
   "Hi {contact_name}, this is an AI assistant calling from {company_name}."
2. State purpose within 10 seconds — do NOT waste their time.
3. Respect "not interested" — do NOT be pushy. One soft follow-up max.
4. If they say "stop" / "remove me" / "do not call" → IMMEDIATELY:
   - Acknowledge: "Absolutely, I'll remove you right away."
   - Set disposition: DNC_REQUESTED
   - End call politely.
5. Keep turns SHORT (1-2 sentences). You initiated this call — be respectful.

SCRIPT FLOW:
1. Opening → AI disclosure + purpose
2. Value proposition → why this matters to them
3. Qualifying questions → gauge interest/fit
4. Call-to-action → schedule meeting / confirm interest / next step
5. Wrap-up → confirm action items, thank them

OBJECTION HANDLING:
- "I'm busy" → "Completely understand. Can I call back at a better time?"
- "Not interested" → "No problem at all. May I ask what solution you use today?"
- "Send me an email" → "Happy to. I'll send details to {email}. Anything specific?"
- "How did you get my number?" → "From {lead_source}. I can remove you if you'd like."
"""

38 Inbound vs Outbound — Architecture Comparison

Dimension	Inbound	Outbound
Who initiates	Customer calls you	AI calls the customer
Caller mindset	Has a problem/question, wants resolution	Not expecting the call, may be skeptical
First priority	Fast pickup + accurate resolution	Earn attention in first 10 seconds
SIP direction	INVITE received → accept → stream	INVITE sent → wait for answer → AMD → stream
Pre-call processing	ANI lookup → CRM context → IVR routing	DNC scrub → time zone check → dial → AMD
AI disclosure timing	Start of call (greeting)	Immediately — before stating purpose
LLM prompt style	Reactive: "How can I help you?"	Proactive: "I'm calling because..."
Turn length	1-3 sentences (conversational)	1-2 sentences (concise, respect time)
Escalation	Transfer to human agent (warm/cold)	Schedule callback with human rep
Success =	Issue resolved (FCR)	Meeting booked / interest confirmed
Failure mode	Transfer to human (still partially resolved)	Not interested / voicemail / no answer
Compliance	GDPR, HIPAA (industry-specific)	TCPA, DNC, time-of-day, AI disclosure, opt-out

Pipeline Differences

INBOUND PIPELINE: Phone rings → Accept → [IVR/Greeting] → Listen → Detect intent → Respond ↑ │ └── caller context from CRM ───────────────┘ Key: REACTIVE — wait for caller to state their need, then resolve it. Latency priority: Speed to first meaningful response (<1.5s) State: Long conversations (3-8 min avg), complex multi-turn dialog OUTBOUND PIPELINE: Dial → Wait → [AMD] → Human? → Disclosure → Pitch → Handle objections → CTA │ └→ Machine? → Voicemail drop → Log → Next number Key: PROACTIVE — AI must drive the conversation with a clear objective. Latency priority: AMD speed (<4s), then first utterance fluency State: Short conversations (1-3 min avg), scripted flow with branches

STT / LLM / TTS Differences

Component	Inbound Optimisation	Outbound Optimisation
STT	Optimise for domain vocabulary (product names, account numbers, complaint terminology)	Optimise for short responses ("yes", "no", "not interested", "tell me more") and noisy environments (mobile/outdoor)
LLM	Larger context window (full conversation + CRM history). Use RAG for knowledge base. Temperature 0.3 (accurate).	Smaller context (script + objections). Focus on persuasion and natural cadence. Temperature 0.5 (more natural).
TTS	Professional, warm, patient tone. Slightly slower speaking rate for clarity.	Energetic, confident, conversational tone. Natural speaking rate — must not sound robotic.
VAD	Longer endpointing (800ms) — let customer finish explaining	Shorter endpointing (500ms) — keep pace in quick exchanges
Interruption	Always allow barge-in (customer is in control)	Allow barge-in but complete key phrases (disclosure, opt-out info)

39 Inbound Call Metrics & KPIs

Primary Inbound Metrics

Metric	Definition	Target (AI Agent)	Industry Avg (Human)
First Call Resolution (FCR)	% of calls resolved without callback or transfer	70-85%	70-75%
Containment Rate	% of calls fully handled by AI (no human needed)	60-80%	N/A
Average Handle Time (AHT)	Total call duration including hold & wrap-up	2-5 min	6-8 min
Average Speed to Answer (ASA)	Time from call arrival to agent pickup	<1 sec (instant)	20-30 sec
Abandonment Rate	% of callers who hang up before reaching agent	<1% (instant pickup)	5-8%
Transfer Rate	% of calls escalated to human agent	15-30%	10-15%
CSAT (Customer Satisfaction)	Post-call survey score (1-5 scale)	4.0-4.5	4.0-4.3
IVR Containment	% of callers who resolve in IVR without reaching agent	30-50% (AI pre-resolution)	15-25%
Queue Wait Time	Time spent waiting for available agent	0 sec (always available)	45-90 sec
First Response Time	Time from caller's first utterance to AI's first response	<1.5 sec	3-5 sec

Inbound Latency Targets

INBOUND CALL — LATENCY BUDGET (end-to-end <1500ms target) Caller speaks AI responds │ ▲ │ ┌─────────────────┐ │ ├──│ VAD Endpointing │ │ Detect end of speech │ │ ~500-800ms │ │ (configurable silence threshold) │ └────────┬────────┘ │ │ ┌────────▼────────┐ │ │ │ STT Final Trans │ │ Last partial → final transcript │ │ ~100-200ms │ │ │ └────────┬────────┘ │ │ ┌────────▼────────┐ │ │ │ LLM (TTFT) │ │ Time to first token │ │ ~200-500ms │ │ │ └────────┬────────┘ │ │ ┌────────▼────────┐ │ │ │ TTS (TTFB) │ │ Time to first audio byte │ │ ~100-200ms │ │ │ └────────┬────────┘ │ │ │ │ │ └─────────────┘ │ Total: 900-1700ms (aim for <1500ms P90)

Inbound Cost Economics

Cost Component	AI Agent (per call)	Human Agent (per call)
Telephony (Twilio)	$0.014/min × 4 min = $0.056	$0.014/min × 7 min = $0.098
STT (Deepgram)	$0.0043/min × 4 min = $0.017	N/A
LLM (GPT-4o)	~$0.015/call	N/A
TTS (ElevenLabs)	~$0.008/call	N/A
Agent salary	N/A	$18/hr ÷ 8 calls/hr = $2.25
Infrastructure	$0.005/call	$0.50/call (seat, tools, QA)
TOTAL	~$0.10/call	~$2.90/call
Savings	~97% cost reduction per contained call

40 Outbound Call Metrics & KPIs

Primary Outbound Metrics

Metric	Definition	Target (AI Agent)	Industry Avg (Human)
Connect Rate	% of dials that reach a live person	15-25%	15-25%
AMD Accuracy	% of correct human vs machine classification	95-98%	N/A
Right-Party Contact (RPC)	% of connects that reach the intended person	40-60%	40-60%
Conversation Rate	% of connects that result in a meaningful conversation (>30s)	50-70%	60-75%
Conversion Rate	% of conversations achieving the campaign objective	5-15%	8-20%
Calls Per Hour	Dial attempts per agent per hour	100-200 (AI never rests)	15-50 (depends on dialer)
Voicemail Drop Rate	% of calls that go to voicemail	60-75%	60-75%
DNC Request Rate	% of contacts requesting removal from list	<2% (good list hygiene)	<2%
Callback Booking Rate	% of conversations where a callback is scheduled	10-25%	15-30%
Cost Per Acquisition (CPA)	Total campaign cost ÷ conversions	$5-25 (varies by industry)	$50-200

Outbound Campaign Funnel

OUTBOUND CAMPAIGN FUNNEL — 10,000 contact list example 10,000 contacts in list │ ├── 500 removed (DNC scrub, invalid numbers, time zone blocked) → 9,500 dialable │ 9,500 dial attempts │ ├── 2,850 no answer (30%) ├── 5,700 voicemail (60%) → voicemail drop → 570 callbacks (10% VM callback rate) │ 950 live connects (10%) │ ├── 95 wrong person (10% of connects) ├── 285 "not interested" in <15s (30%) ├── 95 DNC requests (10%) → remove from all lists │ 475 meaningful conversations (50% of connects) │ ├── 190 qualified but not ready (40%) → schedule callback ├── 47 hard objections / not a fit (10%) │ 238 positive outcomes (50% of conversations) │ ├── 143 meetings booked (60%) ← PRIMARY CONVERSION ├── 95 requested info via email (40%) │ ──────────────────────────────────────────── Conversion rate: 143 / 475 conversations = 30% Connect-to-conversion: 143 / 950 connects = 15% List-to-conversion: 143 / 9,500 dials = 1.5% ──────────────────────────────────────────── AI advantage: 9,500 dials completed in ~48 hours (vs 2-3 weeks for human team) Cost: ~$0.08/dial × 9,500 = $760 total (vs ~$15,000+ for human BDR team)

Outbound Cost Economics

Cost Component	AI Agent (per dial)	Human BDR (per dial)
Telephony (Twilio)	$0.014/min × 0.5 min avg = $0.007	$0.014/min × 2 min avg = $0.028
STT (Deepgram)	$0.003/dial avg	N/A
LLM (GPT-4o)	$0.008/dial avg (many short calls)	N/A
TTS (Cartesia — speed optimised)	$0.004/dial avg	N/A
AMD processing	$0.002/dial	N/A
Agent salary	N/A	$25/hr ÷ 20 dials/hr = $1.25
TOTAL	~$0.024/dial, ~$0.08/connect	~$1.28/dial
Cost per meeting booked	~$5-8	~$85-170

Disposition Codes — Outbound

Code	Meaning	Action
`CONNECTED_CONVERTED`	Reached person, achieved campaign objective	Log success, trigger follow-up workflow
`CONNECTED_CALLBACK`	Reached person, scheduled callback	Add to callback queue with agreed time
`CONNECTED_NOT_INTERESTED`	Reached person, declined	Log, reduce priority, retry after cool-down
`CONNECTED_DNC`	Requested removal from call list	Immediately add to internal DNC, remove from all campaigns
`VOICEMAIL_DROPPED`	AMD detected machine, dropped pre-recorded msg	Mark for retry in 24-48 hrs (different time)
`NO_ANSWER`	Rang 30s+, no pickup	Retry up to 3× over 7 days (varied times)
`BUSY`	Line busy	Retry in 30-60 min
`INVALID_NUMBER`	Disconnected or invalid	Remove from list, flag in CRM
`WRONG_PARTY`	Reached someone else	Log, attempt alternative number if available

40A Metrics Summary — All KPIs at a Glance

A single consolidated view of every key metric across the voice agent system — pipeline performance, inbound operations, outbound campaigns, cost economics, and business impact.

Pipeline & System Metrics

Metric	Target	Measured By
End-to-end latency (P50)	<700 ms	OpenTelemetry trace span: user_speech_end → agent_audio_start
End-to-end latency (P95)	<1,300 ms	Same as above, 95th percentile
STT Word Error Rate (WER)	<8%	Deepgram Nova-2 evaluated against human transcripts
TTS Mean Opinion Score (MOS)	>4.1 / 5	Human evaluation or PESQ/POLQA automated scoring
VAD Accuracy	>98%	False positive/negative rate on labelled audio samples
Barge-in Detection	<200 ms	Time from user speech onset to TTS playback stop
Concurrent Calls / Node	500	Load testing with sustained traffic (K6 / Locust)
System Uptime	99.9%	Health checks + circuit breaker monitoring

Inbound Metrics Summary

Metric	AI Agent Target	Human Avg	AI Advantage
First Call Resolution (FCR)	70-85%	70-75%	Consistent quality, no bad days
Containment Rate	60-80%	N/A	Handles majority without human
Average Handle Time (AHT)	2-5 min	6-8 min	40-60% faster resolution
Average Speed to Answer (ASA)	<1 sec	20-30 sec	Instant pickup, zero queue
Abandonment Rate	<1%	5-8%	No one hangs up waiting
Transfer / Escalation Rate	15-30%	10-15%	Transfers carry full context
CSAT Score	4.0-4.5 / 5	4.0-4.3	Consistent, never impatient
Cost per Call	~$0.10	~$2.90	97% cost reduction

Outbound Metrics Summary

Metric	AI Agent Target	Human Avg	AI Advantage
Calls per Hour	100-200	15-50	4-10× throughput, never fatigued
Connect Rate	15-25%	15-25%	Same (depends on list quality)
AMD Accuracy	95-98%	N/A	Automated, sub-4-second detection
Conversation Rate	50-70%	60-75%	Slightly lower (AI disclosure effect)
Conversion Rate	5-15%	8-20%	Lower per-call, but 10× more calls
Callback Booking Rate	10-25%	15-30%	Consistent follow-up scheduling
DNC Request Rate	<2%	<2%	Polite, instant compliance
Cost per Dial	~$0.024	~$1.28	98% cost reduction
Cost per Meeting Booked	$5-8	$85-170	95% CPA reduction

Cost Economics — Side by Side

COST COMPARISON: AI VOICE AGENT vs HUMAN AGENT INBOUND OUTBOUND AI Human AI Human ───────── ───────── ───────── ───────── Telephony $0.056 $0.098 $0.007 $0.028 STT $0.017 — $0.003 — LLM $0.015 — $0.008 — TTS $0.008 — $0.004 — AMD — — $0.002 — Agent Labor — $2.25 — $1.25 Infra/Overhead$0.005 $0.50 $0.001 $0.25 ───────── ───────── ───────── ───────── TOTAL/CALL $0.10 $2.90 $0.024 $1.28 MONTHLY (50K calls): Inbound $5,000 $145,000 — — Outbound — — $1,200 $64,000 ───────── ───────── ───────── ───────── Monthly Savings: $140,000 $62,800 ANNUAL SAVINGS: $2.4M+ (combined inbound + outbound at 50K calls/month each)

Key Numbers — Quick Reference Card

┌──────────────────────── VOICE AGENT METRICS CHEAT SHEET ─────────────────────────┐ │ │ │ LATENCY ACCURACY SCALE │ │ ───────── ───────── ────── │ │ E2E P50: ~700ms STT WER: <8% 500 concurrent/node │ │ E2E P95: ~1.3s TTS MOS: >4.1 99.9% uptime SLA │ │ Barge-in: <200ms NER: >92% Auto-scale on K8s │ │ First response: <1.5s AMD: 95-98% Multi-region failover │ │ │ │ INBOUND KPIs OUTBOUND KPIs COST │ │ ──────────── ───────────── ────── │ │ FCR: 70-85% Connect: 15-25% Inbound: $0.10/call │ │ Containment: 60-80% Conv rate: 5-15% Outbound: $0.024/dial │ │ AHT: 2-5 min Calls/hr: 100-200 vs Human: 97% savings │ │ ASA: <1 sec CPA: $5-8 Meeting CPA: $5-8 │ │ CSAT: 4.0-4.5 AMD acc: 95-98% Annual: $2.4M+ saved │ │ Abandonment: <1% DNC rate: <2% │ │ │ └───────────────────────────────────────────────────────────────────────────────────┘

41 Glossary

Term	Definition
ASR	Automatic Speech Recognition (same as STT)
STT	Speech-to-Text — converting audio to text
TTS	Text-to-Speech — converting text to audio
VAD	Voice Activity Detection — detecting speech in audio
Endpointing	Detecting when a speaker has finished an utterance
Barge-in	User interrupting the agent while it's speaking
WER	Word Error Rate — STT accuracy metric
SSML	Speech Synthesis Markup Language — TTS formatting standard
WebRTC	Web Real-Time Communication — browser-based audio/video
SIP	Session Initiation Protocol — telephony signaling
PSTN	Public Switched Telephone Network — traditional phone network
DTMF	Dual-Tone Multi-Frequency — phone keypad tones
AEC	Acoustic Echo Cancellation
AGC	Automatic Gain Control
Prosody	Rhythm, stress, and intonation of speech
Diarization	Identifying different speakers in audio

42 Quick Reference — Recommended Stack

Production Voice Agent Stack

Component	Recommended	Budget Alternative
VAD	Silero VAD	WebRTC VAD
STT	Deepgram Nova-2	Faster-Whisper (self-hosted)
LLM	GPT-4o / Claude Sonnet	Llama 3 (self-hosted)
TTS	Cartesia Sonic / ElevenLabs	Piper TTS (self-hosted)
Framework	LiveKit Agents	Pipecat
Telephony	Twilio	Telnyx
Transport	WebRTC (LiveKit)	WebSocket (FastAPI)
Monitoring	Langfuse + Grafana	OpenTelemetry + Loki

Getting started fast? Use a managed platform like Vapi to prototype, then migrate to LiveKit Agents or Pipecat when you need full control and lower per-call costs.