Building a Voice Agent

End-to-end technical guide — from microphone input to spoken response, covering STT, NLU, LLM, TTS, real-time streaming, telephony, and production deployment.

01 Overview

A voice agent is an AI system that listens to human speech, understands intent, reasons over context, and responds with natural-sounding speech — all in real time. Modern voice agents combine automatic speech recognition (ASR/STT), large language models (LLMs), and neural text-to-speech (TTS) into a low-latency pipeline.

<500ms
Target first-byte latency
<1s
End-to-end response time
95%+
Word recognition accuracy
<150ms
Turn-taking gap (human avg)

Key Challenges

  • Latency — Humans expect sub-second responses; every millisecond matters
  • Interruption handling — Users barge-in mid-sentence; agent must stop and listen
  • Ambient noise — Real-world audio is noisy; robust VAD and ASR needed
  • Turn-taking — Detecting when the user has finished speaking (endpointing)
  • Naturalness — TTS must sound human, with proper prosody and emotion
  • Context retention — Multi-turn conversations require persistent memory
  • Concurrent calls — Production systems handle thousands of simultaneous calls

02 System Architecture

┌──────────────────────────── VOICE AGENT ARCHITECTURE ─────────────────────────────┐ │ │ │ ┌──────────┐ ┌────────────────┐ ┌─────────────────────────────────┐ │ │ │ Caller │═════▶│ TWILIO │═════▶│ VOICE AGENT SERVER │ │ │ │(Phone/SIP│ │ │ │ │ │ │ │ or Web) │ │ • Provision # │ │ ┌─────┐ ┌──────────────┐ │ │ │ └──────────┘ │ • Media Stream │ WS │ │ VAD │──▶│ DEEPGRAM STT │ │ │ │ │ • Call Control │──────│ │Siler│ │ Nova-2 │ │ │ │ │ • Recording │ │ │ o │ │ mulaw 8kHz │ │ │ │ │ • DTMF │ │ └─────┘ └──────┬───────┘ │ │ │ │ • SIP Trunk │ │ │ │ │ │ └────────────────┘ │ ┌─────▼─────┐ │ │ │ ▲ │ │ LLM │ │ │ │ ║ Audio │ │ GPT-4o / │ │ │ │ ║ (mulaw) │ │ Claude │ │ │ │ ║ │ └─────┬─────┘ │ │ │ ┌──────╨─────────┐ │ │ │ │ │ │ Twilio sends │ │ ┌────────────────▼──────────┐ │ │ │ ┌──────────┐ │ audio back to │◀═════│ │ TTS ENGINE │ │ │ │ │ Caller │◀═════│ caller │ │ │ │ │ │ │ │ hears │ │ │ │ │ ElevenLabs (quality) │ │ │ │ │ agent │ └────────────────┘ │ │ OR Cartesia (speed) │ │ │ │ └──────────┘ │ │ Output: ulaw_8000 │ │ │ │ │ └───────────────────────────┘ │ │ │ ┌──────────────────────────────────────────────────────────────────────┐ │ │ │ │ SERVICES: Function Calling │ RAG │ Memory │ CRM │ Analytics │ │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ │ └──────────────────────────────────────────────────────────────────────────────────┘
Why this stack: Twilio handles telephony and call management. Deepgram provides the fastest streaming STT with native mulaw support (no transcoding from Twilio). ElevenLabs delivers premium voice quality or Cartesia delivers minimum latency TTS. Both can output ulaw_8000 directly for Twilio — zero transcoding overhead.

Component Responsibilities

ComponentRoleLatency Target
VADDetect when user is speaking vs silence/noise<10ms
STT / ASRConvert audio stream to text (transcription)50–300ms
NLUExtract intent, entities, sentiment from text10–50ms
LLMGenerate contextual response (reasoning engine)200–800ms (first token)
TTSConvert response text to audio waveform50–200ms (first byte)
TransportBi-directional audio streaming (WebSocket/WebRTC/SIP)<50ms

03 Voice Pipeline (Step by Step)

Mic Input
Audio Chunks (20ms frames)
VAD Filter
Streaming STT
Interim Transcripts
Endpointing
Final Transcript
LLM (streaming)
Token Stream
Sentence Buffer
TTS (streaming)
Audio Chunks Out
Speaker
Key insight: The entire pipeline must be streaming end-to-end. You don't wait for the full STT transcript before calling the LLM, and you don't wait for the full LLM response before starting TTS. Each stage feeds the next incrementally.
# Pseudocode: Streaming voice pipeline
async def voice_pipeline(audio_stream):
    # Stage 1: VAD → filter silence
    speech_chunks = vad.filter(audio_stream)

    # Stage 2: Streaming STT → interim + final transcripts
    async for transcript in stt.transcribe_stream(speech_chunks):
        if transcript.is_final:
            # Stage 3: Stream LLM response token by token
            sentence_buffer = ""
            async for token in llm.stream(transcript.text, context):
                sentence_buffer += token

                # Stage 4: Send complete sentences to TTS
                if ends_with_punctuation(sentence_buffer):
                    async for audio in tts.synthesize_stream(sentence_buffer):
                        yield audio  # → speaker
                    sentence_buffer = ""

04 Latency Budget

Voice agents are latency-critical. Humans perceive pauses >600ms as unnatural. The target is <1 second from end of user speech to beginning of agent speech.

USER STOPS SPEAKING │ ├─── Endpointing delay ────────── 100–300ms ├─── STT finalization ─────────── 50–150ms ├─── LLM first token ─────────── 200–500ms ├─── TTS first byte ──────────── 50–200ms ├─── Network + buffer ─────────── 30–100ms │ ▼ AGENT STARTS SPEAKING ─────────── TOTAL: 430–1250ms

Latency Optimization Techniques

TechniqueSavingsHow
Streaming STT200–500msDon't wait for end-of-utterance; use interim results
LLM streaming500ms+Start TTS on first sentence, not full response
TTS streaming200–400msBegin audio playback before full synthesis completes
Sentence-level TTS100–300msBuffer LLM tokens into sentences for TTS chunks
Speculative prefill100–200msStart LLM prompt while STT is still finalizing
Semantic caching300–700msCache responses for common queries
Edge deployment50–150msCo-locate STT/TTS near users (reduce network hops)
Shorter endpointing100–200msTune VAD silence threshold (risk: premature cutoff)
Warm connections50–100msKeep persistent connections to STT/LLM/TTS APIs
Latency vs Accuracy tradeoff: Shorter endpointing reduces wait time but may cut off the user mid-thought. Most systems use 300–500ms silence threshold and allow barge-in to recover.

05 Speech-to-Text (STT / ASR)

Automatic Speech Recognition converts audio waveforms into text. For voice agents, streaming STT is essential — results must arrive incrementally as the user speaks.

Key STT Concepts

  • Streaming vs Batch — Streaming gives interim results in real-time; batch processes complete files
  • Interim (partial) results — Unstable text that updates as more audio arrives
  • Final results — Stable transcript after endpointing detects end of utterance
  • Endpointing — Detecting when the user has stopped speaking (silence duration)
  • Word-level timestamps — Timing for each word (useful for alignment and analytics)
  • Speaker diarization — Identifying different speakers in multi-party audio
  • Custom vocabulary — Boost recognition of domain-specific terms

06 STT Engines Compared

EngineTypeStreamingLatencyBest For
DeepgramCloud APIYes (WebSocket)~100msLowest latency, voice agents
Google Cloud STTCloud APIYes (gRPC)~200msMulti-language, enterprise
Azure SpeechCloud APIYes (WebSocket)~150msMicrosoft ecosystem
AWS TranscribeCloud APIYes (WebSocket)~250msAWS ecosystem
AssemblyAICloud APIYes (WebSocket)~200msAccuracy, LeMUR integration
OpenAI WhisperOpen-source / APINo (batch only)1–5sAccuracy, self-hosted, offline
Whisper.cppOpen-source (C++)Pseudo-stream~500msEdge/local deployment
Faster-WhisperOpen-source (CTranslate2)No~300msFast self-hosted batch
VoskOpen-sourceYes~200msOffline, lightweight, edge
# Deepgram Streaming STT (WebSocket)
import asyncio
from deepgram import DeepgramClient, LiveOptions, LiveTranscriptionEvents

dg = DeepgramClient(api_key="YOUR_KEY")
connection = dg.listen.asyncwebsocket.v("1")

async def on_message(self, result, **kwargs):
    transcript = result.channel.alternatives[0].transcript
    is_final = result.is_final
    if is_final and transcript:
        print(f"Final: {transcript}")
        # → Send to LLM

connection.on(LiveTranscriptionEvents.Transcript, on_message)

options = LiveOptions(
    model="nova-2",
    language="en",
    encoding="linear16",
    sample_rate=16000,
    interim_results=True,
    endpointing=300,  # ms of silence before final
    smart_format=True,
    vad_events=True,
)
await connection.start(options)

# Send audio chunks (20ms frames)
async for chunk in mic_stream:
    connection.send(chunk)

06A Why Deepgram — Deep Dive

Deepgram is the recommended STT engine for production voice agents. Here's a detailed analysis of why it outperforms alternatives for real-time conversational AI.

Why Deepgram Over Alternatives

CriteriaDeepgramGoogle Cloud STTWhisper (OpenAI)Azure Speech
Streaming Latency~100ms (best-in-class)~200msN/A (batch only)~150ms
Native WebSocketYes (first-class)gRPC onlyNoYes
Built-in EndpointingYes (configurable ms)LimitedNoYes
Built-in VAD EventsYesNoNoLimited
Word-level TimestampsYesYesYesYes
Smart FormattingAuto (numbers, dates, currency)Manual configNoYes
Cost (per hour)$0.0043/min (~$0.26/hr)$0.024/min$0.006/min (API)$0.016/min
Custom VocabularyKeywords + model trainingPhrase hintsPrompt onlyPhrase lists
Voice Agent OptimizedYes (Nova-2 model)General purposeGeneral purposeGeneral purpose

Key Reasons to Choose Deepgram

  1. Lowest streaming latency in the industry (~100ms)Deepgram's end-to-end deep learning model is purpose-built for real-time. Unlike traditional ASR pipelines (acoustic model → language model → decoder), Deepgram uses a single neural network that processes audio directly, eliminating inter-stage latency.
  2. Native WebSocket API designed for voice agentsDeepgram's primary API is a persistent WebSocket connection that accepts raw audio frames and returns JSON transcripts. This is exactly what voice agents need — no gRPC complexity (Google), no REST polling (Whisper), no SDK abstraction overhead.
  3. Built-in endpointing and VAD eventsDeepgram detects when users stop speaking and emits speech_final and utterance_end events with configurable silence thresholds. Other STT engines require you to implement VAD and endpointing separately.
  4. Smart formatting out of the box — Automatically formats numbers ("three hundred" → "300"), dates, currency, and punctuation. This means the text sent to the LLM is clean and structured without post-processing.
  5. Cost-effective at scale — At $0.0043/minute for Nova-2, Deepgram is 4–6x cheaper than Google Cloud STT and Azure Speech, which matters significantly when handling thousands of concurrent calls.
  6. Nova-2 model specifically optimized for conversational speech — Unlike Whisper (optimized for transcription accuracy on long-form audio), Nova-2 is trained on conversational, real-time speech patterns with lower word error rates on voice agent dialogue.

06B Deepgram Features for Voice Agents

Endpointing Configuration

Fine-tune when Deepgram considers a user utterance "done." Lower values = faster response but risk cutting off the user.

endpointing=300   # 300ms silence = end of utterance
endpointing=500   # 500ms for cautious endpointing
endpointing=false  # Disable (you handle it)

Utterance Detection

Separate from endpointing — detects utterance boundaries even in continuous speech.

utterance_end_ms=1000  # Gap between utterances
interim_results=true   # Get partial transcripts
vad_events=true        # Speech start/stop events

Smart Formatting

Auto-converts spoken forms to written forms for cleaner LLM input.

  • "three hundred dollars" → "$300"
  • "january fifth twenty twenty six" → "January 5, 2026"
  • "one two three four" → "1234" (in number context)

Keyword Boosting

Boost recognition of domain-specific terms that the model might miss.

keywords=[
  "Acme:2",         # Boost "Acme" by 2x
  "SKU:1.5",        # Product codes
  "onboarding:1.5"  # Domain terms
]

06C Deepgram Implementation

# Complete Deepgram Streaming STT for Voice Agent
import asyncio, json
from deepgram import (
    DeepgramClient,
    DeepgramClientOptions,
    LiveTranscriptionEvents,
    LiveOptions,
)

class DeepgramSTTEngine:
    """Production-ready Deepgram STT wrapper for voice agents."""

    def __init__(self, api_key: str, on_transcript, on_speech_started=None):
        self.client = DeepgramClient(api_key, DeepgramClientOptions(
            options={"keepalive": "true"}  # Persistent connection
        ))
        self.on_transcript = on_transcript
        self.on_speech_started = on_speech_started
        self.connection = None

    async def connect(self):
        self.connection = self.client.listen.asyncwebsocket.v("1")

        # Register event handlers
        self.connection.on(LiveTranscriptionEvents.Transcript, self._on_message)
        self.connection.on(LiveTranscriptionEvents.SpeechStarted, self._on_speech_started)
        self.connection.on(LiveTranscriptionEvents.UtteranceEnd, self._on_utterance_end)
        self.connection.on(LiveTranscriptionEvents.Error, self._on_error)

        options = LiveOptions(
            model="nova-2",            # Best for conversational speech
            language="en",
            encoding="linear16",       # 16-bit PCM
            sample_rate=16000,        # 16kHz mono
            channels=1,
            interim_results=True,     # Get partial transcripts for UI
            endpointing=300,          # 300ms silence = final
            utterance_end_ms=1000,    # Utterance boundary detection
            smart_format=True,        # Auto-format numbers, dates
            punctuate=True,           # Add punctuation
            vad_events=True,          # Speech start/stop events
            filler_words=False,       # Remove "um", "uh"
        )

        if not await self.connection.start(options):
            raise ConnectionError("Failed to connect to Deepgram")
        print("✓ Deepgram STT connected")

    async def send_audio(self, audio_bytes: bytes):
        """Send raw audio chunk (20ms frame = 640 bytes at 16kHz/16bit)."""
        if self.connection:
            self.connection.send(audio_bytes)

    async def _on_message(self, _self, result, **kwargs):
        transcript = result.channel.alternatives[0].transcript
        if not transcript:
            return

        if result.is_final:
            # Final transcript → send to LLM
            confidence = result.channel.alternatives[0].confidence
            await self.on_transcript(transcript, is_final=True, confidence=confidence)
        else:
            # Interim → update UI only
            await self.on_transcript(transcript, is_final=False)

    async def _on_speech_started(self, _self, speech_started, **kwargs):
        # User started speaking → interrupt agent if needed
        if self.on_speech_started:
            await self.on_speech_started()

    async def _on_utterance_end(self, _self, utterance_end, **kwargs):
        # Clean boundary between utterances
        pass

    async def _on_error(self, _self, error, **kwargs):
        print(f"Deepgram error: {error}")

    async def close(self):
        if self.connection:
            await self.connection.finish()

Deepgram Audio Format Requirements

ParameterRecommendedWhy
Sample Rate16,000 HzStandard for speech; higher adds bandwidth without improving recognition
Bit Depth16-bit (linear16)Good dynamic range, supported by all providers
Channels1 (mono)Speech is mono; stereo wastes bandwidth
Frame Size20ms (640 bytes)Standard VoIP frame size; balances latency and efficiency
From Twiliomulaw 8kHzTelephony standard; Deepgram accepts mulaw natively
Twilio Integration: When receiving audio from Twilio Media Streams, the audio is mulaw at 8kHz. Deepgram accepts this directly — set encoding="mulaw" and sample_rate=8000. No transcoding needed.

07 Streaming Recognition

Audio Stream: ───[chunk]──[chunk]──[chunk]──[chunk]──[silence]──▶ STT Output: "Hello" → "Hello I'd" → "Hello I'd like to" → "Hello I'd like to book" → FINAL (interim) (interim) (interim) (interim) (stable)

Streaming STT Best Practices

  • Use 16kHz, 16-bit mono PCM (linear16) for best quality/bandwidth balance
  • Send audio in 20ms frames (320 bytes at 16kHz)
  • Enable interim results for UI feedback but trigger LLM only on final results
  • Set endpointing to 300–500ms for conversational voice agents
  • Use VAD events to detect speech start/stop separately from transcription
  • Implement utterance-level buffering to handle multi-sentence turns

08 Voice Activity Detection (VAD)

VAD distinguishes human speech from silence, noise, and background audio. It's the gatekeeper that decides when to start and stop STT processing.

VAD EngineTypeLatencyNotes
Silero VADNeural (PyTorch/ONNX)<1ms per frameBest accuracy/speed tradeoff; industry standard
WebRTC VADSignal-based (GMM)<0.1msUltra-fast, less accurate in noise
Picovoice CobraNeural (edge)<1msOptimized for mobile/IoT
Built-in (Deepgram/Azure)Cloud-integratedN/A (server-side)No extra integration needed
# Silero VAD example
import torch

model, utils = torch.hub.load(
    repo_or_dir='snakers4/silero-vad',
    model='silero_vad',
    trust_repo=True
)
(get_speech_timestamps, _, read_audio, _, _) = utils

# Real-time frame-by-frame
def process_frame(audio_chunk_tensor):
    speech_prob = model(audio_chunk_tensor, 16000).item()
    is_speech = speech_prob > 0.5
    return is_speech

09 Natural Language Understanding (NLU) & Intent Detection

NLU processes the transcribed text to extract meaning — intents, entities, sentiment, and dialog acts. This is a critical question for voice agent design: how do you detect what the user wants?

Important clarification: LangChain and LangGraph are NOT NLU frameworks. They are orchestration/chaining frameworks. They do not include built-in intent classification, entity extraction, or NLU models. However, you can use an LLM through LangChain to perform intent detection via prompting or function calling. LangGraph provides the workflow graph — not the understanding.

What LangChain / LangGraph Actually Do

FrameworkWhat It IsWhat It Is NOTRole in NLU
LangChainLLM orchestration framework — chains prompts, tools, memory, retrievers togetherNot an NLU engine, not an intent classifierCan wrap an LLM call that does intent classification via prompting or function calling
LangGraphStateful graph-based agent framework — manages state machines, routing, cyclesNot an NLU engine, not an intent classifierCan route based on detected intent (the graph decides what to do after intent is known)

The Three Approaches to Intent Detection

1. Traditional NLU (ML Models)

Dedicated ML models trained on labeled intent data. Fast, deterministic, predictable. Limited to pre-defined intents.

  • Intent classification (book_flight, check_balance)
  • Named entity extraction (dates, names, amounts)
  • Slot filling for structured actions
  • Requires training data (50–500+ examples per intent)
Deterministic Fast (<10ms) Needs training data

2. LLM-Powered NLU (Prompting)

Use GPT/Claude with structured output to classify intents. No training data needed. Handles unseen intents.

  • LLM does intent + entity extraction in one call
  • Zero-shot: works without examples
  • Structured output via function calling / JSON mode
  • Higher latency (200–500ms) but much more capable
Flexible No training data Higher latency

3. Hybrid (Classifier + LLM Fallback)

Fast local classifier for common intents; LLM fallback for edge cases. Best of both worlds.

  • Local model handles 80% of known intents (<10ms)
  • LLM handles ambiguous/novel intents (200ms+)
  • Router decides which path based on confidence
  • Most production voice agents use this approach
Production-ready Balanced

09A Intent Detection — Full Comparison

SolutionTypeLatencyTraining DataOpen IntentsCostBest For
Rasa NLUSelf-hosted ML<10msRequired (50+ per intent)NoFree (OSS)Self-hosted, full control
Dialogflow CXGoogle Cloud~50msRequired (10+ per intent)No$0.007/reqGoogle ecosystem, complex flows
Amazon LexAWS Cloud~80msRequired (10+ per intent)No$0.004/reqAWS ecosystem, Alexa-like bots
Azure CLU (LUIS successor)Azure Cloud~60msRequired (15+ per intent)No$0.005/reqMicrosoft ecosystem
GPT-4o Function CallingLLM (OpenAI)200–400msNone (zero-shot)Yes~$0.003/reqFlexible, open-ended voice agents
Claude Tool UseLLM (Anthropic)200–500msNone (zero-shot)Yes~$0.004/reqSafety-focused, enterprise
FastText / Sentence-BERTSelf-hosted embeddings<5msRequired (20+ per intent)NoFree (OSS)Ultra-low latency, edge
SetFit (few-shot)Self-hosted (HuggingFace)<10msMinimal (8–16 per intent)NoFree (OSS)Few-shot scenarios, fast training
LLM via LangChainOrchestrated LLM call200–500msNone (zero-shot)YesLLM costWhen already using LangChain
Key insight for voice agents: Latency is king. If you have well-defined intents (e.g., "check order", "make payment", "speak to agent"), use a fast local classifier (<10ms). Use LLM-based intent detection only for open-ended conversations where you can't pre-define all intents.

09B Intent Detection Approaches (Detailed)

Approach 1: LLM Function Calling as Intent Detection

The most common modern approach. Define intents as "functions" — the LLM decides which function to call based on the user's speech. This effectively combines NLU + action routing in one step.

# LLM Function Calling = Intent Detection + Entity Extraction
# Define your intents as tools/functions

tools = [
    {
        "type": "function",
        "function": {
            "name": "check_order_status",       # ← This IS the intent
            "description": "User wants to check the status of an existing order",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {                   # ← This IS the entity
                        "type": "string",
                        "description": "Order ID or number"
                    }
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "transfer_to_human",
            "description": "User wants to speak with a human agent",
            "parameters": {
                "type": "object",
                "properties": {
                    "department": {
                        "type": "string",
                        "enum": ["billing", "support", "sales"]
                    },
                    "reason": {"type": "string"}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "make_payment",
            "description": "User wants to make a payment on their account",
            "parameters": {
                "type": "object",
                "properties": {
                    "amount": {"type": "number"},
                    "account_id": {"type": "string"}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "general_question",
            "description": "User has a general question not covered by specific functions",
            "parameters": {
                "type": "object",
                "properties": {
                    "question": {"type": "string"}
                }
            }
        }
    }
]

# User says: "I want to check on order number 4567"
# LLM returns: tool_call(name="check_order_status", args={"order_id": "4567"})
#                          ↑ intent                        ↑ entity
This is NOT "LangChain NLU" — this is the LLM doing NLU via function calling. LangChain is just the wrapper. You can do the exact same thing with raw OpenAI/Anthropic SDK calls. LangChain adds convenience (memory, chaining) but not NLU capability.

Approach 2: Traditional NLU (Rasa / Dialogflow)

# Rasa NLU Pipeline (nlu.yml)
# Train a dedicated ML model for intent classification

nlu:
- intent: check_order
  examples: |
    - where is my order
    - check order status
    - what's the status of order [4567](order_id)
    - track my package
    - I want to know where my delivery is
    - can you look up order [AB-1234](order_id)

- intent: make_payment
  examples: |
    - I'd like to pay my bill
    - make a payment of [$50](amount)
    - pay [100 dollars](amount) on my account
    - how do I pay

- intent: transfer_to_human
  examples: |
    - let me talk to a real person
    - transfer me to an agent
    - I want to speak to someone
    - get me a human

# Result: {"intent": "check_order", "confidence": 0.94,
#          "entities": [{"entity": "order_id", "value": "4567"}]}

Approach 3: Fast Embedding Classifier (SetFit / Sentence-BERT)

# Ultra-fast intent detection using sentence embeddings
# Only needs 8-16 examples per intent to train

from setfit import SetFitModel, SetFitTrainer

# Train with minimal examples
train_data = [
    ("check my order", "check_order"),
    ("where is my package", "check_order"),
    ("track delivery", "check_order"),
    ("order status", "check_order"),
    ("pay my bill", "make_payment"),
    ("make a payment", "make_payment"),
    ("talk to a human", "transfer"),
    ("speak to agent", "transfer"),
    # ... 8-16 examples per intent
]

model = SetFitModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
trainer = SetFitTrainer(model=model, train_dataset=train_data)
trainer.train()

# Inference: <5ms!
intent = model.predict("I need to check on order 4567")
# → "check_order"

Approach 4: Hybrid Router (Recommended for Production Voice Agents)

# Hybrid: Fast classifier + LLM fallback
# This is the production-recommended approach for voice agents

class HybridIntentRouter:
    def __init__(self):
        self.fast_classifier = SetFitModel.from_pretrained("./intent-model")
        self.confidence_threshold = 0.85
        self.llm = OpenAIClient()

    async def detect_intent(self, transcript: str) -> dict:
        # Step 1: Try fast classifier (~5ms)
        predictions = self.fast_classifier.predict_proba([transcript])
        top_intent = predictions.argmax()
        confidence = predictions.max()

        if confidence >= self.confidence_threshold:
            # High confidence → use fast result (saves 200-400ms!)
            return {
                "intent": top_intent,
                "confidence": confidence,
                "method": "fast_classifier",
                "latency_ms": 5,
            }

        # Step 2: Low confidence → fall back to LLM (200-400ms)
        llm_result = await self.llm.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": "Classify the user's intent. Return JSON: {intent, entities, confidence}"
            }, {
                "role": "user",
                "content": transcript
            }],
            response_format={"type": "json_object"},
        )
        return {**json.loads(llm_result.choices[0].message.content), "method": "llm_fallback"}

09C LangChain / LangGraph Role in Voice Agents

Since LangChain and LangGraph are often confused with NLU, here's exactly what role they play in a voice agent pipeline.

What LangChain Does in a Voice Agent

CapabilityLangChain RoleNot LangChain's Job
Intent DetectionWraps an LLM call that does intent detection via function callingDoes not provide its own intent classifier
Entity ExtractionLLM extracts entities via structured output (Pydantic models)Does not have NER models
Conversation MemoryYes — ConversationBufferMemory, summary memory, etc.
RAG RetrievalYes — retrievers, vector stores, rerankers
Tool/Function CallingYes — tool definitions, execution, result handling
Prompt ManagementYes — templates, few-shot examples, output parsers
Agent OrchestrationYes (via LangGraph) — state machines, routing, cycles

What LangGraph Does in a Voice Agent

┌─────────────────────────────────────────────┐ │ LangGraph Voice Agent │ │ │ User transcript ──▶ │ [Intent Node] │ │ │ │ │ ├── check_order ──▶ [DB Lookup Node] │ │ │ │ │ │ ├── make_payment ──▶ [Payment Node] │ │ │ │ │ │ ├── transfer ────▶ [Transfer Node] │ │ │ │ │ │ └── general ─────▶ [RAG + LLM Node] │ │ │ │ │ [Response Node] ───▶ TTS └─────────────────────────────────────────────┘ LangGraph provides: the GRAPH (routing, state, cycles) LangGraph does NOT provide: the intent detection itself
# LangGraph voice agent with intent routing
# Note: Intent detection happens INSIDE the LLM call, not from LangGraph

from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal

class VoiceState(TypedDict):
    transcript: str
    intent: str
    entities: dict
    response: str
    conversation_history: list

# Node 1: Detect intent (uses LLM — LangGraph doesn't do this itself)
async def detect_intent(state: VoiceState) -> VoiceState:
    # Option A: Fast classifier
    result = fast_classifier.predict(state["transcript"])
    # Option B: LLM function calling
    # result = await llm.classify(state["transcript"])
    state["intent"] = result.intent
    state["entities"] = result.entities
    return state

# Router: LangGraph routes based on detected intent
def route_intent(state: VoiceState) -> Literal["order", "payment", "transfer", "general"]:
    intent_map = {
        "check_order": "order",
        "make_payment": "payment",
        "transfer_to_human": "transfer",
    }
    return intent_map.get(state["intent"], "general")

# Build graph
graph = StateGraph(VoiceState)
graph.add_node("detect_intent", detect_intent)
graph.add_node("order", handle_order_check)
graph.add_node("payment", handle_payment)
graph.add_node("transfer", handle_transfer)
graph.add_node("general", handle_general_query)
graph.add_node("respond", generate_voice_response)

graph.set_entry_point("detect_intent")
graph.add_conditional_edges("detect_intent", route_intent)
for node in ["order", "payment", "transfer", "general"]:
    graph.add_edge(node, "respond")
graph.add_edge("respond", END)

voice_agent = graph.compile()

09D Intent Detection Decision Guide

Which Approach Should You Use?

Your SituationRecommended ApproachWhy
Well-defined intents (10–50), latency criticalSetFit / FastText classifier<5ms, deterministic, no LLM cost
Complex flows with many intents + Google ecosystemDialogflow CXVisual flow builder, Google integrations
Open-ended conversation, can't pre-define all intentsLLM function callingHandles anything, zero training data
Enterprise with existing Rasa infrastructureRasa NLUSelf-hosted, full control, proven at scale
Production voice agent (best overall)Hybrid: fast classifier + LLM fallbackFast for common intents, LLM for edge cases
Prototype / MVP (ship fast)LLM function calling onlyZero setup, works immediately
Edge / offline deploymentSetFit or Vosk + local modelNo cloud dependency
DECISION TREE: Can you pre-define all intents? ├── YES → Is latency critical (<50ms)? │ ├── YES → SetFit / FastText / Rasa NLU (local classifier) │ └── NO → Dialogflow CX / Amazon Lex / Azure CLU (cloud NLU) │ └── NO → Is it a prototype or production? ├── PROTOTYPE → LLM function calling (zero setup) └── PRODUCTION → Hybrid classifier + LLM fallback (recommended)
Voice agent reality: Most production voice agents start with LLM function calling (fastest to build), then add a fast classifier for common intents once they have enough conversation data (typically after 1,000+ calls). The classifier handles 80% of utterances in <5ms, and the LLM handles the remaining 20%.

10 Dialog Management

Controls the flow of conversation — tracking state, managing turns, handling context switches, and deciding what action to take next.

Dialog Management Approaches

ApproachHowBest For
Finite State MachinePredefined states and transitionsSimple IVR, scripted flows
Frame-BasedFill slots until action is readyForm-filling (booking, orders)
LLM-DrivenLLM decides next action via system promptOpen-ended conversation
Hybrid (Graph + LLM)Graph for structure, LLM for flexibilityEnterprise voice agents (recommended)

11 LLM Integration

The LLM is the reasoning brain of the voice agent. It processes the user's transcript, conversation history, and system instructions to generate responses.

# Voice-optimized LLM prompt
SYSTEM_PROMPT = """You are a helpful voice assistant for Acme Corp customer support.

VOICE-SPECIFIC RULES:
- Keep responses SHORT (1-3 sentences). Voice != chat.
- Use conversational language, contractions, natural phrasing.
- NEVER use markdown, bullet points, URLs, or special formatting.
- Spell out numbers: "twenty three" not "23".
- For lists, say "first... second... third..." not "1. 2. 3."
- If unsure, ask ONE clarifying question at a time.
- Acknowledge the user before answering: "Sure!", "Great question.", etc.

FUNCTION CALLING:
- Use check_order_status(order_id) for order inquiries.
- Use transfer_to_human(department) if user explicitly asks for a person.
- Use schedule_callback(phone, time) for callback requests.

CONTEXT:
- Customer: {customer_name}
- Account tier: {tier}
- Previous interactions: {history_summary}
"""

# Streaming LLM call
async def stream_llm_response(transcript, context):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT.format(**context)},
            *conversation_history,
            {"role": "user", "content": transcript}
        ],
        stream=True,
        temperature=0.7,
        max_tokens=150,  # Keep voice responses short
    )

    sentence_buffer = ""
    async for chunk in response:
        token = chunk.choices[0].delta.content or ""
        sentence_buffer += token

        # Yield complete sentences for TTS
        if any(sentence_buffer.rstrip().endswith(p) for p in (".", "!", "?", ",")):
            yield sentence_buffer.strip()
            sentence_buffer = ""

    if sentence_buffer.strip():
        yield sentence_buffer.strip()
Voice vs Chat prompting: LLMs trained on text tend to produce verbose, formatted output. Your system prompt must aggressively constrain output length and format. Test by reading responses aloud — if it sounds unnatural, adjust.

12 RAG for Voice Agents

Retrieval-Augmented Generation connects your voice agent to enterprise knowledge bases, FAQs, product docs, and customer data — so it gives accurate, grounded answers.

User says: "What's the return policy for electronics?" │ ▼ [STT] → "What's the return policy for electronics?" │ ▼ [RETRIEVE] → Search vector DB for relevant policy docs │ → Top 3 chunks: return_policy.md#electronics ▼ [AUGMENT] → System prompt + retrieved context + user query │ ▼ [LLM] → "Electronics can be returned within 30 days with receipt. Opened items may have a 15% restocking fee." │ ▼ [TTS] → Spoken response
Voice-specific RAG tip: Retrieved chunks should be concise. Long context increases LLM latency. Use aggressive reranking and limit to 2–3 chunks max for voice use cases.

13 Text-to-Speech (TTS)

TTS converts the LLM's text response into natural-sounding audio. Modern neural TTS produces near-human quality. For voice agents, streaming TTS is critical — audio begins playing before the full text is synthesized.

Key TTS Features for Voice Agents

  • Streaming synthesis — Generate audio incrementally (sentence by sentence)
  • Low first-byte latency — Start speaking as fast as possible
  • Natural prosody — Proper intonation, stress, and rhythm
  • Emotion/style control — Adjust tone (friendly, professional, empathetic)
  • Voice cloning — Custom brand voice from audio samples
  • SSML support — Fine-grained control over pronunciation, pauses, emphasis
  • Multi-language — Support for global deployment

14 TTS Engines Compared

EngineTypeStreamingLatencyQualityBest For
ElevenLabsCloud APIYes~150msExcellentHighest quality, voice cloning
Cartesia (Sonic)Cloud APIYes~90msVery GoodUltra-low latency voice agents
Deepgram AuraCloud APIYes~80msGoodSTT+TTS single vendor
OpenAI TTSCloud APIYes~200msVery GoodOpenAI ecosystem
Azure Neural TTSCloud APIYes~150msVery GoodEnterprise, SSML, 400+ voices
Google Cloud TTSCloud APIYes~180msVery GoodMulti-language, WaveNet
Amazon PollyCloud APIYes~200msGoodAWS ecosystem, NTTS voices
Coqui TTSOpen-sourceLimited~300msGoodSelf-hosted, custom voices
Piper TTSOpen-sourceNo~100msModerateEdge/offline, lightweight
# ElevenLabs Streaming TTS
from elevenlabs import ElevenLabs

client = ElevenLabs(api_key="YOUR_KEY")

def stream_tts(text: str):
    audio_stream = client.text_to_speech.convert_as_stream(
        voice_id="pNInz6obpgDQGcFmaJgB",  # "Adam"
        text=text,
        model_id="eleven_turbo_v2_5",
        output_format="pcm_16000",  # Raw PCM for low latency
    )
    for audio_chunk in audio_stream:
        yield audio_chunk  # Send to speaker/WebSocket

# Cartesia Streaming TTS (ultra-low latency)
from cartesia import Cartesia

cartesia = Cartesia(api_key="YOUR_KEY")

async def stream_cartesia(text: str):
    output = await cartesia.tts.sse(
        model_id="sonic-english",
        transcript=text,
        voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
        output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 16000},
        stream=True,
    )
    async for chunk in output:
        yield chunk["audio"]

14A Why ElevenLabs — Deep Dive

ElevenLabs delivers the most natural-sounding AI voices on the market. For enterprise voice agents where brand perception and user trust depend on voice quality, ElevenLabs is the premium choice.

Why ElevenLabs Over Alternatives

CriteriaElevenLabsOpenAI TTSAzure NeuralGoogle TTS
Voice NaturalnessBest-in-class (MOS ~4.5)Very good (~4.2)Very good (~4.1)Good (~3.9)
Streaming Latency~150ms first byte~200ms~150ms~180ms
Voice CloningProfessional (30s–30min audio)NoCustom Neural Voice ($)Limited
Emotion ControlYes (style, stability sliders)NoSSML onlySSML only
Voice LibraryThousands (community + premium)6 voices400+ voices100+ voices
Languages29 languages~57 languages140+ languages40+ languages
Turbo ModelYes (Turbo v2.5 — ~100ms)tts-1 (fast/lower quality)No turbo optionNo turbo option
Cost (per 1K chars)$0.18–$0.30$0.015–$0.030$0.016$0.016

Key Reasons to Choose ElevenLabs

  1. Highest naturalness scores across independent benchmarksElevenLabs' Multilingual v2 and Turbo v2.5 models consistently achieve the highest Mean Opinion Scores (MOS) in blind listening tests. Users perceive ElevenLabs voices as more human-like, building trust in voice agent interactions.
  2. Professional voice cloning for brand identity — Clone a specific voice (spokesperson, brand character) from as little as 30 seconds of audio. The resulting voice is consistent across all calls, creating a recognizable brand experience.
  3. Fine-grained emotion and style control — Adjust stability (consistency vs expressiveness) and similarity (closeness to original voice) sliders. This lets you tune the voice to match your brand personality — professional, warm, energetic, calm.
  4. Turbo v2.5 model for sub-100ms latency — When latency matters most (interactive voice agents), the Turbo model sacrifices minimal quality for dramatically lower first-byte latency, competing with Cartesia's speed.
  5. Rich voice library — Access thousands of pre-made voices for prototyping, or clone custom voices for production. Switch voices without changing any pipeline code.
Cost consideration: ElevenLabs is 5–10x more expensive than alternatives per character. This is the tradeoff for premium quality. For high-value interactions (sales, enterprise support) the quality premium pays for itself. For high-volume, low-value calls, consider Cartesia or Deepgram Aura.

14B ElevenLabs Implementation

# Complete ElevenLabs Streaming TTS for Voice Agents
import asyncio
from elevenlabs import ElevenLabs
from elevenlabs.core import ApiError

class ElevenLabsTTSEngine:
    """Production ElevenLabs TTS with streaming and voice management."""

    def __init__(self, api_key: str, voice_id: str = "pNInz6obpgDQGcFmaJgB"):
        self.client = ElevenLabs(api_key=api_key)
        self.voice_id = voice_id

    def stream_audio(self, text: str, model: str = "eleven_turbo_v2_5"):
        """Stream audio chunks for a text sentence.

        Models:
        - eleven_turbo_v2_5: Fastest (~100ms), good quality — USE FOR VOICE AGENTS
        - eleven_multilingual_v2: Best quality (~200ms), all 29 languages
        - eleven_monolingual_v1: English only, legacy
        """
        audio_stream = self.client.text_to_speech.convert_as_stream(
            voice_id=self.voice_id,
            text=text,
            model_id=model,
            output_format="pcm_16000",      # Raw PCM for lowest latency
            voice_settings={
                "stability": 0.5,            # 0=expressive, 1=stable
                "similarity_boost": 0.75,    # Closeness to original voice
                "style": 0.0,               # 0=neutral, 1=exaggerated
                "use_speaker_boost": True,   # Enhance clarity
            },
            optimize_streaming_latency=3,  # 0-4, higher = faster but lower quality
        )

        for audio_chunk in audio_stream:
            yield audio_chunk

    async def synthesize_for_twilio(self, text: str):
        """Generate audio in mulaw format for Twilio Media Streams."""
        audio_stream = self.client.text_to_speech.convert_as_stream(
            voice_id=self.voice_id,
            text=text,
            model_id="eleven_turbo_v2_5",
            output_format="ulaw_8000",  # Native Twilio format!
        )
        for chunk in audio_stream:
            yield chunk

    def get_voices(self):
        """List available voices."""
        return self.client.voices.get_all()

    def clone_voice(self, name: str, audio_files: list):
        """Clone a voice from audio samples."""
        return self.client.clone(
            name=name,
            files=audio_files,
            description="Custom brand voice for voice agent"
        )
Output format tip: Use pcm_16000 for WebRTC/WebSocket and ulaw_8000 for Twilio. Using native formats avoids transcoding, saving 5–15ms per chunk.

14C Why Cartesia — Deep Dive

Cartesia (Sonic model) delivers the lowest TTS latency in the market, making it the ideal choice when response speed is the primary concern.

Why Cartesia Over Alternatives

CriteriaCartesia SonicElevenLabs TurboDeepgram Aura
First-Byte Latency~90ms (fastest)~100ms~80ms
Voice QualityVery GoodExcellentGood
Instant Voice CloningYes (5–15 sec audio)Yes (30s+ audio)No
Emotion/Style MixingYes (blend multiple emotions)Stability slidersNo
MultilingualGrowing (10+ langs)29 languagesEnglish focus
Word-level TimestampsYesNoNo
WebSocket StreamingYes (native)HTTP streamingHTTP streaming
CostCompetitivePremiumLowest

Key Reasons to Choose Cartesia

  1. Absolute lowest latency for time-critical interactionsCartesia's State Space Model (SSM) architecture generates audio faster than transformer-based TTS. The Sonic model produces the first audio byte in ~90ms, enabling sub-second agent responses.
  2. WebSocket-native streaming — Unlike HTTP-based streaming (ElevenLabs, OpenAI), Cartesia provides true WebSocket streaming with bidirectional communication. You can send text and receive audio on the same persistent connection, eliminating connection overhead per sentence.
  3. Word-level timestamps in real-timeCartesia returns timing information for each word as audio streams, enabling precise lip-sync for avatars, captions, and alignment-based interruption handling.
  4. Emotion and style mixing — Blend multiple emotional tones in a single generation (e.g., 70% professional + 30% warm). This enables dynamic emotional adaptation during conversations.
  5. Instant voice cloning from 5 seconds of audio — The fastest voice cloning available, enabling rapid prototyping and custom voice creation without long training cycles.

14D Cartesia Implementation

# Complete Cartesia Sonic Streaming TTS
import asyncio
from cartesia import Cartesia

class CartesiaTTSEngine:
    """Production Cartesia TTS with WebSocket streaming."""

    def __init__(self, api_key: str, voice_id: str):
        self.client = Cartesia(api_key=api_key)
        self.voice_id = voice_id
        self.ws = None

    async def connect_websocket(self):
        """Establish persistent WebSocket for lowest latency."""
        self.ws = self.client.tts.websocket()
        print("✓ Cartesia WebSocket connected")

    async def stream_audio(self, text: str, context_id: str = "default"):
        """Stream audio via persistent WebSocket connection.

        context_id: Use same ID for sentences in one turn
        to maintain prosody continuity across chunks.
        """
        output = self.ws.send(
            model_id="sonic-english",
            transcript=text,
            voice={
                "mode": "id",
                "id": self.voice_id,
                # Emotion mixing example:
                # "mode": "embedding",
                # "embedding": blend(professional_emb, warm_emb, 0.7)
            },
            output_format={
                "container": "raw",
                "encoding": "pcm_s16le",
                "sample_rate": 16000,
            },
            context_id=context_id,  # Prosody continuity
            stream=True,
        )

        for chunk in output:
            # chunk contains: audio bytes + optional word timestamps
            yield chunk["audio"]

    async def stream_for_twilio(self, text: str):
        """Generate mulaw audio for Twilio telephony."""
        output = self.ws.send(
            model_id="sonic-english",
            transcript=text,
            voice={"mode": "id", "id": self.voice_id},
            output_format={
                "container": "raw",
                "encoding": "pcm_mulaw",   # Native Twilio format
                "sample_rate": 8000,       # Telephony standard
            },
            stream=True,
        )
        for chunk in output:
            yield chunk["audio"]

    async def close(self):
        if self.ws:
            self.ws.close()

14E Choosing ElevenLabs vs Cartesia

Decision Matrix

ScenarioChoose ElevenLabsChoose Cartesia
Primary goalMaximum voice quality & naturalnessMinimum latency
Brand voice neededBest voice cloning qualityGood instant cloning
Enterprise sales callsPremium voice builds trustFast response impresses
High-volume support callsCost may be prohibitiveBetter cost/latency ratio
Avatar/lip-sync neededNo word timestampsWord-level timestamps
Many languages29 languagesGrowing support
Budget constrainedPremium pricingMore cost-effective
WebSocket nativeHTTP streamingTrue WebSocket
Hybrid strategy: Many production systems use bothElevenLabs for high-value interactions (sales, VIP support) and Cartesia for high-volume, latency-sensitive calls (general support, IVR). Route at the LLM gateway based on call type, customer tier, or required language.

15 Voice Cloning

Create a custom brand voice from audio samples. Requires as little as 30 seconds of clean audio with some providers.

ProviderSamples NeededQuality
ElevenLabs1–30 min audioExcellent (Professional Voice Cloning)
Cartesia5–15 secVery Good (instant cloning)
PlayHT30 sec+Very Good
Coqui (XTTS)6 sec+Good (open-source)
Legal: Always obtain explicit consent before cloning anyone's voice. Many jurisdictions have laws governing synthetic voice use. Document consent for compliance.

16 SSML & Prosody Control

Speech Synthesis Markup Language (SSML) gives fine-grained control over how TTS engines pronounce text.

<!-- SSML Example -->
<speak>
  <prosody rate="medium" pitch="+5%">
    Welcome to Acme support!
  </prosody>
  <break time="300ms"/>
  Your order
  <say-as interpret-as="characters">AB</say-as>
  <say-as interpret-as="cardinal">1234</say-as>
  is on its way.
  <emphasis level="strong">Is there anything else I can help with?</emphasis>
</speak>

17 WebSocket Streaming

WebSockets provide full-duplex, low-latency communication for real-time audio streaming between client and server.

# FastAPI WebSocket voice agent server
import asyncio
from fastapi import FastAPI, WebSocket

app = FastAPI()

@app.websocket("/voice")
async def voice_endpoint(ws: WebSocket):
    await ws.accept()

    stt = StreamingSTT()
    llm = LLMClient()
    tts = StreamingTTS()

    try:
        while True:
            # Receive audio from client
            audio_data = await ws.receive_bytes()

            # Feed to streaming STT
            transcript = await stt.process(audio_data)

            if transcript and transcript.is_final:
                # Stream LLM → TTS → audio back to client
                async for sentence in llm.stream(transcript.text):
                    async for audio_chunk in tts.synthesize(sentence):
                        await ws.send_bytes(audio_chunk)
    except Exception:
        await ws.close()

18 Why Twilio — Deep Dive

Twilio is the recommended telephony platform for connecting voice agents to the phone network. It provides the bridge between PSTN/SIP phone calls and your WebSocket-based voice agent pipeline.

Why Twilio Over Alternatives

CriteriaTwilioVonage (Nexmo)TelnyxFreeSWITCH
Media Streams APIFirst-class WebSocketWebSocket (beta)WebSocketCustom (mod_audio_stream)
Bidirectional AudioYes (send + receive)LimitedYesYes
Call Control (TwiML)Mature, declarative XMLNCCO (JSON)TeXMLDialplan (XML)
Global Phone Numbers180+ countries80+ countries30+ countriesN/A (BYO trunk)
SIP TrunkingElastic SIP TrunkingYesYesNative
Recording & ComplianceBuilt-in, PCI compliantBuilt-inBuilt-inManual
DTMF DetectionYes (in-stream)YesYesYes
Developer ExperienceBest docs, SDKs, communityGoodGoodComplex, expert-level
ScalabilityAuto-scales, enterprise SLAGoodGoodManual scaling
Cost (per min)$0.013 inbound$0.0127$0.003$0 (infra costs)

Key Reasons to Choose Twilio

  1. Media Streams API is purpose-built for AI voice agentsTwilio's Media Streams sends real-time audio over WebSocket in both directions. This is the exact integration pattern voice agents need: receive caller audio → process through STT → LLM → TTS → send audio back. No other provider has this as mature and well-documented.
  2. Bidirectional streaming with call controlTwilio lets you simultaneously stream audio AND control the call (transfer, hold, record, gather DTMF) through TwiML and the REST API. This is critical for enterprise voice agents that need to transfer to humans, place callers on hold, or navigate IVR trees.
  3. Instant global phone numbers — Provision local, toll-free, or national numbers in 180+ countries via API. Your voice agent can be reachable from any phone in the world within seconds of configuration.
  4. Enterprise-grade reliability and compliance — 99.95% uptime SLA, SOC 2 / HIPAA / PCI-DSS compliance, built-in call recording with automatic PII redaction, and GDPR-compliant data handling. Critical for enterprise deployments.
  5. Best developer experience in telephonyTwilio has the most comprehensive documentation, largest community, SDKs in every major language, and the most Stack Overflow answers of any CPaaS provider.
  6. Elastic SIP Trunking for existing infrastructure — If your enterprise already has a PBX or contact center, Twilio Elastic SIP Trunking lets you connect your voice agent without replacing existing telephony infrastructure.
Cost note: Twilio is 3–4x more expensive per minute than Telnyx. For very high-volume deployments (100K+ minutes/month), negotiate enterprise pricing or consider Telnyx for cost-sensitive non-critical lines. For most enterprise use cases, Twilio's reliability and features justify the premium.

18A Twilio Voice Agent Architecture

┌────────────────── TWILIO + VOICE AGENT ARCHITECTURE ──────────────────────┐ │ │ │ ┌──────────┐ ┌────────────┐ ┌───────────────────────────────────┐ │ │ │ Caller │───▶│ Twilio │───▶│ YOUR SERVER │ │ │ │ (Phone) │ │ Cloud │ │ │ │ │ │ PSTN/SIP │ │ │ │ ┌──────────────────────────────┐ │ │ │ └──────────┘ │ 1. Inbound │ │ │ /twilio-webhook (POST) │ │ │ │ │ call │ │ │ Returns TwiML: │ │ │ │ │ │ │ │ <Connect> │ │ │ │ │ 2. TwiML │◀───│ │ <Stream url="wss://"/> │ │ │ │ │ routes │ │ │ </Connect> │ │ │ │ │ call │ │ └──────────────────────────────┘ │ │ │ │ │ │ │ │ │ │ 3. Opens │ │ ┌──────────────────────────────┐ │ │ │ │ WebSocket│═══▶│ │ /twilio-stream (WebSocket) │ │ │ │ │ stream │ │ │ │ │ │ │ Audio ◀─────── │ ◀══════════ │◀═══│ │ Audio In ──▶ Deepgram STT │ │ │ │ (mulaw 8kHz) │ (bidir.) │ │ │ Transcript ──▶ LLM │ │ │ │ │ │ │ │ Response ──▶ TTS │ │ │ │ │ │ │ │ Audio Out ──▶ Twilio │ │ │ │ └────────────┘ │ └──────────────────────────────┘ │ │ │ └───────────────────────────────────┘ │ │ │ │ Audio Format: mulaw (G.711 μ-law), 8000 Hz, mono, base64-encoded │ │ Protocol: JSON messages over WebSocket │ └────────────────────────────────────────────────────────────────────────────┘

18B Twilio Media Streams Protocol

Twilio Media Streams is the API that connects phone calls to your voice agent via WebSocket. Understanding its message protocol is essential.

Media Stream Events (Twilio → Your Server)

EventWhenKey Data
connectedWebSocket establishedProtocol version
startStream beginsstreamSid, callSid, media format, custom params
mediaEvery ~20mspayload (base64 mulaw audio), timestamp, track
dtmfKeypad press detecteddigit (0–9, *, #)
markAudio playback marker reachedname (your custom marker name)
stopStream ends (call ended/transferred)streamSid

Commands (Your Server → Twilio)

CommandPurposeKey Data
mediaSend audio to callerpayload (base64 mulaw audio)
markInsert audio markername (notified when played)
clearStop all queued audio immediatelystreamSidessential for interruptions
The clear message is critical for interruption handling. When your VAD detects the user speaking while the agent is talking, send {"event": "clear", "streamSid": "..."} to immediately stop playback. Without this, the caller hears the agent talk over them.

18C Twilio Complete Implementation

# ============================================
# Complete Twilio Voice Agent (FastAPI)
# Integrates: Twilio + Deepgram + LLM + TTS
# ============================================

import asyncio, json, base64
from fastapi import FastAPI, WebSocket, Request
from fastapi.responses import HTMLResponse
from twilio.twiml.voice_response import VoiceResponse, Connect

app = FastAPI()

# ─── 1. WEBHOOK: Twilio calls this when a call arrives ───
@app.post("/twilio-webhook")
async def twilio_webhook(request: Request):
    """Twilio hits this endpoint when someone calls your number.
    Returns TwiML that tells Twilio to open a Media Stream."""
    response = VoiceResponse()

    # Optional: play greeting before connecting to AI
    response.say("Connecting you to our AI assistant.", voice="Polly.Joanna")

    # Connect call audio to your WebSocket
    connect = Connect()
    stream = connect.stream(
        url=f"wss://your-server.com/twilio-stream",
        status_callback="https://your-server.com/stream-status",
        status_callback_method="POST",
    )
    # Pass custom parameters to your WebSocket handler
    stream.parameter(name="caller_number", value=str(request.form.get("From", "")))
    stream.parameter(name="call_sid", value=str(request.form.get("CallSid", "")))

    response.append(connect)
    return HTMLResponse(content=str(response), media_type="application/xml")


# ─── 2. WEBSOCKET: Receives real-time audio from Twilio ───
@app.websocket("/twilio-stream")
async def twilio_media_stream(ws: WebSocket):
    await ws.accept()

    # State for this call
    stream_sid = None
    call_sid = None
    caller_number = None
    is_agent_speaking = False

    # Initialize pipeline components
    deepgram_stt = DeepgramSTTEngine(
        api_key=DG_API_KEY,
        on_transcript=lambda t, **kw: handle_transcript(t, ws, stream_sid, **kw),
        on_speech_started=lambda: handle_barge_in(ws, stream_sid),
    )
    await deepgram_stt.connect()

    try:
        async for message in ws.iter_text():
            data = json.loads(message)
            event = data["event"]

            if event == "connected":
                print("✓ Twilio WebSocket connected")

            elif event == "start":
                stream_sid = data["start"]["streamSid"]
                call_sid = data["start"]["callSid"]
                custom = data["start"].get("customParameters", {})
                caller_number = custom.get("caller_number")
                print(f"📞 Call started: {call_sid} from {caller_number}")

                # Send initial greeting via TTS
                await send_tts_to_twilio(
                    "Hi! I'm your AI assistant. How can I help you today?",
                    ws, stream_sid
                )

            elif event == "media":
                # Decode base64 mulaw audio from Twilio
                audio_bytes = base64.b64decode(data["media"]["payload"])
                # Forward to Deepgram STT (accepts mulaw natively)
                await deepgram_stt.send_audio(audio_bytes)

            elif event == "dtmf":
                digit = data["dtmf"]["digit"]
                print(f"📱 DTMF: {digit}")

            elif event == "mark":
                # Audio playback reached a marker
                marker_name = data["mark"]["name"]
                if marker_name == "end_of_response":
                    is_agent_speaking = False

            elif event == "stop":
                print(f"📞 Call ended: {call_sid}")
                break

    except Exception as e:
        print(f"Error: {e}")
    finally:
        await deepgram_stt.close()


# ─── 3. HELPER: Send TTS audio back to Twilio ───
async def send_tts_to_twilio(text: str, ws: WebSocket, stream_sid: str):
    """Generate TTS audio and stream it back to the Twilio caller."""
    tts = ElevenLabsTTSEngine(api_key=ELEVEN_API_KEY, voice_id=VOICE_ID)
    # OR: tts = CartesiaTTSEngine(api_key=CARTESIA_KEY, voice_id=VOICE_ID)

    async for audio_chunk in tts.synthesize_for_twilio(text):
        payload = base64.b64encode(audio_chunk).decode("utf-8")
        await ws.send_json({
            "event": "media",
            "streamSid": stream_sid,
            "media": {"payload": payload}
        })

    # Add marker to know when playback finishes
    await ws.send_json({
        "event": "mark",
        "streamSid": stream_sid,
        "mark": {"name": "end_of_response"}
    })


# ─── 4. HELPER: Handle barge-in (user interrupts agent) ───
async def handle_barge_in(ws: WebSocket, stream_sid: str):
    """User started speaking while agent is talking. Clear audio."""
    await ws.send_json({
        "event": "clear",
        "streamSid": stream_sid,
    })
    print("⚡ Barge-in: cleared Twilio audio queue")

18D Twilio Advanced Features

Call Transfer to Human

When the voice agent can't handle a request, warm-transfer to a human agent using the Twilio REST API.

from twilio.rest import Client

client = Client(TWILIO_SID, TWILIO_TOKEN)

# Transfer call to human agent queue
client.calls(call_sid).update(
    twiml='<Response><Dial><Queue>support</Queue></Dial></Response>'
)

Call Recording

Record calls for QA, compliance, and training data. Enable per-call or account-wide.

# Enable recording via TwiML
response = VoiceResponse()
response.record(
    recording_status_callback="/recording-done",
    transcribe=True,
    max_length=3600,  # 1 hour max
)

Outbound Calls

Your voice agent can initiate calls (appointment reminders, follow-ups, surveys).

call = client.calls.create(
    to="+1234567890",
    from_="+1987654321",  # Your Twilio #
    url="https://your-server.com/twilio-webhook",
    status_callback="https://your-server.com/call-status",
)

DTMF Handling

Detect keypad presses for IVR navigation, PIN entry, or menu selection during AI conversation.

# In WebSocket handler:
elif event == "dtmf":
    digit = data["dtmf"]["digit"]
    if digit == "0":
        await transfer_to_human()
    elif digit == "*":
        await repeat_last_message()

Full Pipeline: Twilio + Deepgram + LLM + ElevenLabs/Cartesia

CALL FLOW (End to End): 1. Caller dials your Twilio number (+1-800-XXX-XXXX) 2. Twilio sends HTTP webhook → your /twilio-webhook endpoint 3. You return TwiML: <Connect><Stream url="wss://..."/></Connect> 4. Twilio opens WebSocket to /twilio-stream 5. Twilio sends "start" event with streamSid, callSid, custom params 6. You send greeting TTS audio back via WebSocket CONVERSATION LOOP: 7. Twilio sends "media" events (20ms mulaw audio chunks, base64) 8. You forward raw mulaw to Deepgram (encoding=mulaw, sample_rate=8000) 9. Deepgram returns streaming transcripts (interim → final) 10. On final transcript → send to LLM (GPT-4o / Claude) with conversation history 11. LLM streams response tokens → buffer into sentences 12. Each sentence → ElevenLabs/Cartesia TTS (output_format=ulaw_8000) 13. TTS audio chunks → base64 encode → send as "media" events to Twilio 14. Twilio plays audio to caller 15. Add "mark" event after last audio chunk to track playback completion INTERRUPTION: 16. Deepgram fires speech_started event while agent audio is playing 17. You send "clear" event to Twilio → immediately stops playback 18. Process new user speech normally (back to step 9) CALL END: 19. Caller hangs up → Twilio sends "stop" event 20. Clean up: close Deepgram, log conversation, update CRM

Twilio Configuration Checklist

SettingValueWhy
Phone NumberProvision via Console or APIYour voice agent's phone number
Webhook URLPOST https://your-server.com/twilio-webhookCalled on inbound calls
Status CallbackPOST https://your-server.com/call-statusTrack call lifecycle events
Media StreamsBidirectional, single-trackReceive and send audio
Audio Formatmulaw (G.711 μ-law), 8kHz, monoTelephony standard, accepted by Deepgram natively
TLSRequired (wss://)Twilio requires encrypted WebSocket
Server locationSame region as Twilio edgeMinimize network latency
Twilio Edge Locations: Twilio routes calls through the nearest edge location. Deploy your voice agent server in the same cloud region (e.g., us-east-1 for US East, eu-west-1 for Europe) to minimize audio transport latency. A 50ms network improvement translates directly to faster agent responses.

19 WebRTC Integration

WebRTC provides peer-to-peer audio/video with built-in echo cancellation, noise suppression, and adaptive bitrate. Ideal for browser-based voice agents.

WebRTC Advantages for Voice Agents

  • Built-in acoustic echo cancellation (AEC) — prevents the agent from hearing itself
  • Automatic gain control (AGC) — normalizes volume
  • Noise suppression — filters background noise
  • Opus codec — high quality at low bitrate
  • Lowest possible latency (peer-to-peer when possible)

Frameworks with WebRTC: LiveKit Daily.co Pipecat

20 Interruption Handling (Barge-in)

Users will interrupt the agent mid-sentence. The agent must detect this, stop speaking immediately, and process the new input.

Agent speaking: "Your order is currently being processed and should arrive by—" │ User interrupts: "Actually, cancel it." │ ▼ Agent action: 1. STOP TTS playback immediately 2. Flush audio buffer 3. Process "Actually, cancel it." through STT 4. Send to LLM with context of interrupted response 5. Generate new response: "Sure, I'll cancel that for you."
# Interruption handling logic
class InterruptionHandler:
    def __init__(self):
        self.is_agent_speaking = False
        self.playback_task = None
        self.audio_buffer = asyncio.Queue()

    async def on_user_speech_detected(self):
        """Called when VAD detects user speech during agent output."""
        if self.is_agent_speaking:
            # 1. Cancel current TTS playback
            if self.playback_task:
                self.playback_task.cancel()

            # 2. Flush audio buffer
            while not self.audio_buffer.empty():
                self.audio_buffer.get_nowait()

            # 3. Send clear message to client
            await self.send_clear_audio()

            self.is_agent_speaking = False
            print("⚡ Barge-in detected — agent stopped")

21 Voice AI Frameworks

Frameworks that provide pre-built pipelines for voice agent development, handling the complex orchestration of STT, LLM, TTS, and transport.

FrameworkTypeTransportBest For
LiveKit AgentsOpen-source SDKWebRTCProduction voice agents, scalable
PipecatOpen-source (Daily.co)WebRTC / WebSocketFlexible pipeline framework
VocodeOpen-sourceWebSocket / TelephonyTelephony agents, Twilio
VapiManaged platformWebRTC / TelephonyFastest deployment, hosted
Retell AIManaged platformWebRTC / TelephonyEnterprise call centers
Bland AIManaged platformTelephonyOutbound calling at scale
Hamming AITesting platformN/ATesting voice agents

22 LiveKit Agents

Open-source framework for building real-time voice (and video) AI agents. Production-ready with WebRTC transport, plugin system, and turn detection.

# LiveKit Voice Agent
from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli
from livekit.agents.voice_assistant import VoiceAssistant
from livekit.plugins import deepgram, openai, silero, cartesia

async def entrypoint(ctx: JobContext):
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

    assistant = VoiceAssistant(
        vad=silero.VAD.load(),
        stt=deepgram.STT(model="nova-2"),
        llm=openai.LLM(model="gpt-4o"),
        tts=cartesia.TTS(voice="79a125e8-cd45-4c13-8a67-188112f4dd22"),

        # Interruption config
        interrupt_min_words=2,
        allow_interruptions=True,

        # Turn detection
        min_endpointing_delay=0.5,
    )

    assistant.start(ctx.room)
    await assistant.say("Hi! How can I help you today?")

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

23 Pipecat

Open-source framework (by Daily.co) for building voice and multimodal AI agents. Uses a pipeline architecture with composable processors.

# Pipecat Voice Pipeline
from pipecat.pipeline import Pipeline
from pipecat.transports.services.daily import DailyTransport
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.openai import OpenAILLMService
from pipecat.services.cartesia import CartesiaTTSService

transport = DailyTransport(room_url, token, "Voice Agent")
stt = DeepgramSTTService(api_key=DG_KEY)
llm = OpenAILLMService(model="gpt-4o", api_key=OAI_KEY)
tts = CartesiaTTSService(api_key=CART_KEY, voice_id="...")

pipeline = Pipeline([
    transport.input(),   # Audio from user (WebRTC)
    stt,                 # Speech → Text
    llm,                 # Text → Response text
    tts,                 # Response text → Audio
    transport.output(),  # Audio to user (WebRTC)
])

24 Vocode

Open-source library for building voice agents with telephony support (Twilio, Vonage). Good for phone-based agents.

Key features: Twilio integration, agent actions (transfer, end call), conversation management, endpointing configuration.

25 Managed Platforms (Vapi / Retell / Bland)

Vapi

Fully managed voice AI platform. Define agent via API/dashboard, get a phone number or web widget. Handles all infra.

Fastest to deploy Phone + Web

Retell AI

Enterprise voice agent platform with LLM integration, function calling, and analytics dashboard.

Enterprise Analytics

Bland AI

Focus on outbound phone calls at scale. Batch calling, campaign management, CRM integration.

Outbound Scale

26 Function Calling in Voice Agents

Voice agents need to perform real actions — check databases, place orders, transfer calls. Function calling (tool use) lets the LLM trigger backend operations.

# Function calling with voice agent
tools = [
    {
        "type": "function",
        "function": {
            "name": "check_order_status",
            "description": "Check current status of a customer order",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {"type": "string"}
                },
                "required": ["order_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "transfer_call",
            "description": "Transfer to human agent in specified department",
            "parameters": {
                "type": "object",
                "properties": {
                    "department": {"type": "string", "enum": ["billing", "support", "sales"]}
                }
            }
        }
    }
]

# During voice pipeline: when LLM returns tool_call
async def handle_tool_call(tool_call):
    # Say a filler while executing
    await tts.say("Let me check that for you...")

    result = await execute_function(tool_call.name, tool_call.arguments)

    # Feed result back to LLM for verbal response
    return result

27 Multimodal (GPT-4o Realtime API)

OpenAI's Realtime API provides speech-to-speech without separate STT/TTS — the model directly processes audio input and generates audio output.

Advantages

  • Single model handles everything (lower latency)
  • Preserves tone, emotion, and nuance from audio
  • Built-in VAD and turn detection
  • Natural interruption handling

Limitations

  • OpenAI-only (vendor lock-in)
  • Higher cost per call vs pipeline approach
  • Less control over individual components
  • Harder to audit (no intermediate transcript)
# OpenAI Realtime API (WebSocket)
import websockets, json, base64

url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
headers = {"Authorization": f"Bearer {API_KEY}", "OpenAI-Beta": "realtime=v1"}

async with websockets.connect(url, extra_headers=headers) as ws:
    # Configure session
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "modalities": ["text", "audio"],
            "voice": "alloy",
            "turn_detection": {"type": "server_vad", "threshold": 0.5},
            "tools": tools,
        }
    }))

    # Send audio frames directly
    await ws.send(json.dumps({
        "type": "input_audio_buffer.append",
        "audio": base64.b64encode(audio_bytes).decode()
    }))

28 Emotion Detection & Sentiment

Detect user frustration, confusion, or satisfaction from voice cues (tone, pitch, pace) and text sentiment to adapt agent behavior.

Approaches

  • Text-based sentiment — Analyze STT transcript for sentiment (simplest)
  • Audio features — Pitch variation, speaking rate, energy levels
  • Dedicated modelsHume AI, SpeechBrain emotion recognition
  • LLM-based — Ask LLM to assess user emotion from conversation context

29 Multilingual Support

ComponentMultilingual Options
STTDeepgram (36+ langs), Google (125+ langs), Whisper (99 langs), Azure (100+ langs)
LLMGPT-4o, Claude, Gemini all handle major languages well
TTSAzure (400+ voices, 140+ langs), ElevenLabs (29 langs), Google (40+ langs)
Language detection: Use automatic language detection on the first utterance, then lock in for the session. Deepgram and Google STT support auto-detect.

30 Context & Memory

Voice conversations require persistent context across turns and sessions.

Memory Layers

LayerScopeImplementation
Turn contextCurrent exchangeLLM message history
Session memoryCurrent callConversation buffer (last N turns)
User memoryAcross callsDatabase + RAG (preferences, history)
Business contextGlobalRAG over knowledge base, CRM data

31 Deployment & Scaling

Deployment Architecture

┌─────────────────────────────────────────────────┐ │ PRODUCTION DEPLOYMENT │ │ │ │ ┌──────────┐ ┌──────────────┐ ┌──────────┐ │ │ │ Load │──▶│ Voice Agent │──▶│ STT API │ │ │ │ Balancer │ │ Workers (K8s)│ │(Deepgram)│ │ │ │(Traefik) │ │ - Pipeline │ ├──────────┤ │ │ └──────────┘ │ - State │──▶│ LLM API │ │ │ │ - Sessions │ │(OpenAI) │ │ │ ┌──────────┐ │ │ ├──────────┤ │ │ │ Telephony│──▶│ │──▶│ TTS API │ │ │ │ (Twilio) │ └──────┬───────┘ │(Cartesia)│ │ │ └──────────┘ │ └──────────┘ │ │ ┌─────▼──────┐ │ │ │Redis/DB │ │ │ │(sessions) │ │ │ └────────────┘ │ └─────────────────────────────────────────────────┘

Scaling Considerations

  • Horizontal scaling — Each worker handles N concurrent calls; add workers as needed
  • Session affinity — Sticky sessions ensure a call stays on the same worker
  • GPU for self-hosted — If running local STT/TTS, GPU instances are essential
  • Connection pooling — Reuse WebSocket connections to STT/TTS providers
  • Autoscaling — Scale based on concurrent call count, not CPU/memory
  • Geographic distribution — Deploy in regions close to users and telephony POPs

32 Monitoring & Analytics

Key Voice Agent Metrics

MetricTargetWhy
First-byte latency<500msTime from user stop to agent start
End-to-end latency<1sFull turn-around time
STT accuracy (WER)<10%Word Error Rate
Interruption rate<15%How often users barge-in (high = latency issue)
Task completion rate>80%Did the agent resolve the user's need?
Call durationVariesShorter often = more efficient
Escalation rate<20%How often transferred to human
User satisfaction (CSAT)>4.0/5Post-call survey score

Tools: Langfuse OpenTelemetry Grafana Datadog

32A Production Metrics — Numbers You Need for Interviews

When deployed in production, you need concrete metrics to prove your system works. Below are the actual KPIs a production voice agent should hit, how to measure them, and what to say in interviews.

Pipeline Latency Breakdown (Per Turn)

Every millisecond matters. Here's the target breakdown for a single conversational turn:

StageP50 TargetP95 TargetP99 TargetHow to Measure
VAD → Endpointing~200ms~350ms~500msTime from speech end to VAD final event
STT (Deepgram)~100ms~180ms~250msStreaming partial → final transcript delta
LLM First Token~250ms~500ms~800msTime from prompt send to first token (TTFT)
LLM Full Response~600ms~1.2s~2.0sChunk-and-stream; don't wait for full response
TTS First Byte~90ms~200ms~400msTime from text chunk to first audio byte
Network + Twilio~50ms~100ms~150msWebSocket round-trip + Twilio media relay
Total Turn Latency~700ms~1.3s~2.1sUser stops speaking → agent audio starts
Interview tip: "Our P50 end-to-end latency is ~700ms, which is below the 1-second conversational comfort threshold. We achieve this by streaming STT → LLM → TTS in parallel chunks rather than waiting for each stage to complete."

Production Throughput & Availability

99.9%
Uptime SLA Target
500+
Concurrent calls / node
50K+
Calls handled / day
<0.1%
Dropped call rate
MetricTargetAlert ThresholdMeasurement
System uptime99.9% (8.7h downtime/yr)<99.5% triggers P1Health check endpoint + synthetic calls
Concurrent calls per node500+ (WebSocket-based)>80% capacity → auto-scaleActive WebSocket connection count
Daily call volume50,000+Varies by businessCounter metric per completed call
Dropped call rate<0.1%>0.5% triggers P2Calls ended abnormally / total calls
WebSocket reconnect rate<0.5%>2% triggers P2Reconnection events / total sessions
Mean time to recovery (MTTR)<5 min>15 min triggers post-mortemTime from alert to service restored

Conversation Quality Metrics

MetricTargetHow MeasuredInterview Talking Point
Task Completion Rate>85%LLM judges if intent resolved (auto-eval)"85% of calls resolve without human handoff"
Containment Rate>80%Calls completed without escalation"We reduced human agent load by 80%"
First Call Resolution>75%No callback within 24h for same issue"75% of issues resolved on the first call"
CSAT Score>4.2/5Post-call IVR survey or SMS survey"Post-call CSAT averages 4.2 out of 5"
Avg Handle Time (AHT)<3 minCall start → call end timestamp"Average call duration is 2.5 min vs 6 min for human agents"
Interruption Rate<15%Barge-in events / total agent utterances"Low interruption rate shows our latency is in the comfort zone"
Silence Ratio<10%Dead air >2s / total call duration"Less than 10% awkward silence per call"
Repeat Rate<8%Users saying "repeat that" / "what?""Users rarely ask the agent to repeat — TTS clarity is high"

STT Accuracy Metrics (Deepgram)

MetricTargetMeasurement Method
Word Error Rate (WER)<8%Sample transcripts vs human-verified ground truth
Named Entity Accuracy>92%Correct recognition of names, addresses, account numbers
Latency (streaming final)<200msWebSocket event timestamp delta (is_final:true)
Language Detection Accuracy>95%Auto-detected language vs actual (if multilingual)
Noise RobustnessWER <15% in noiseTest with SNR 10dB background noise samples

TTS Quality Metrics

MetricElevenLabs TargetCartesia TargetHow Measured
Time to First Byte (TTFB)<250ms<100msWebSocket message timestamp
MOS (Mean Opinion Score)>4.3>4.1Human evaluation panel (1-5 scale)
Audio Artifact Rate<2%<3%Glitches, stutters, or clipping per 100 utterances
Character Throughput~800 chars/s~1200 chars/sCharacters processed per second at real-time speed
Voice Consistency>95%>93%Same text → speaker similarity score across calls

32B Cost Per Call Analysis

Understanding your unit economics per call is critical for production planning and interviews. Here's the full breakdown:

Per-Call Cost Breakdown (3 min avg call)

ComponentPricing ModelCost per 3-min CallMonthly (50K calls)
Twilio (inbound)$0.0085/min$0.026$1,275
Deepgram STT (Nova-2)$0.0043/min$0.013$645
LLM (GPT-4o)~$0.005/call (avg tokens)$0.005$250
LLM (Claude Sonnet)~$0.004/call (avg tokens)$0.004$200
ElevenLabs TTS$0.18/1K chars (~$0.006/min)$0.018$900
Cartesia TTS$0.042/1K chars (~$0.0014/min)$0.004$210
Infra (compute)~$0.001/call$0.001$50
Total (ElevenLabs)$0.063$3,120
Total (Cartesia)$0.049$2,430
Interview tip: "Our per-call cost is approximately $0.05–$0.06, compared to $3–$5 for a human agent call. That's a 60–100x cost reduction while maintaining 85%+ task completion rate."

Cost Optimization Strategies

StrategyImpactTradeoff
Use Cartesia instead of ElevenLabs~75% TTS cost reductionSlightly lower voice quality
Use Claude Haiku / GPT-4o-mini for simple intents~80% LLM cost reductionLower accuracy on complex queries
Semantic caching (same question = cached answer)~20–30% LLM savingsRisk of stale answers
Tiered routing: simple→small LLM, complex→large LLM~50% LLM cost reductionAdded routing latency (~30ms)
Negotiate volume pricing (Deepgram/Twilio)~20–40% reductionCommitment required
Self-host STT (Faster-Whisper on GPU)~90% STT cost reductionGPU infra cost, maintenance burden

32C Observability Implementation

Concrete code and configuration for production monitoring.

OpenTelemetry Instrumentation

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
import time

tracer = trace.get_tracer("voice-agent")
meter  = metrics.get_meter("voice-agent")

# ── Define Metrics ────────────────────────────────────
call_counter      = meter.create_counter("voice.calls.total")
active_calls      = meter.create_up_down_counter("voice.calls.active")
turn_latency      = meter.create_histogram("voice.turn.latency_ms")
stt_latency       = meter.create_histogram("voice.stt.latency_ms")
llm_ttft          = meter.create_histogram("voice.llm.ttft_ms")
tts_ttfb          = meter.create_histogram("voice.tts.ttfb_ms")
barge_in_counter  = meter.create_counter("voice.barge_in.total")
error_counter     = meter.create_counter("voice.errors.total")
task_completion   = meter.create_counter("voice.task.completed")
escalation_count  = meter.create_counter("voice.escalations.total")
cost_per_call     = meter.create_histogram("voice.cost.per_call_usd")

# ── Trace a Full Conversational Turn ──────────────────
async def handle_turn(session, audio_chunk):
    with tracer.start_as_current_span("voice.turn") as span:
        span.set_attribute("call.id", session.call_id)
        turn_start = time.perf_counter()

        # STT
        with tracer.start_as_current_span("voice.stt"):
            t0 = time.perf_counter()
            transcript = await session.stt.transcribe(audio_chunk)
            stt_latency.record((time.perf_counter() - t0) * 1000)

        # LLM
        with tracer.start_as_current_span("voice.llm"):
            t0 = time.perf_counter()
            response_stream = session.llm.stream(transcript)
            first_token = await response_stream.__anext__()
            llm_ttft.record((time.perf_counter() - t0) * 1000)

        # TTS
        with tracer.start_as_current_span("voice.tts"):
            t0 = time.perf_counter()
            audio_out = await session.tts.synthesize_stream(first_token)
            tts_ttfb.record((time.perf_counter() - t0) * 1000)

        turn_latency.record((time.perf_counter() - turn_start) * 1000)

Grafana Dashboard — Key Panels

Configure these essential Grafana panels for your voice agent dashboard:

PanelPromQL / QueryVisualization
Active Calls (live)voice_calls_activeStat (big number)
Turn Latency P50/P95/P99histogram_quantile(0.95, rate(voice_turn_latency_ms_bucket[5m]))Time series graph
Calls per Minuterate(voice_calls_total[5m]) * 60Time series graph
Error Rate %rate(voice_errors_total[5m]) / rate(voice_calls_total[5m]) * 100Stat with threshold colors
STT Latency Heatmapvoice_stt_latency_ms_bucketHeatmap
Task Completion %rate(voice_task_completed[1h]) / rate(voice_calls_total[1h]) * 100Gauge (target: 85%)
Barge-in Rate %rate(voice_barge_in_total[5m]) / rate(voice_calls_total[5m]) * 100Time series (alert >15%)
Cost per Call (rolling avg)histogram_quantile(0.5, voice_cost_per_call_usd_bucket)Stat ($0.05 target)
Escalation Rate %rate(voice_escalations_total[1h]) / rate(voice_calls_total[1h]) * 100Gauge (target: <20%)

Alerting Rules

AlertConditionSeverityAction
High Turn LatencyP95 > 2s for 5 minWarningCheck LLM provider status, scale workers
Critical Turn LatencyP99 > 4s for 2 minCriticalFailover to backup LLM, page on-call
High Error Rate>1% errors for 5 minCriticalCheck provider APIs, review error logs
Dropped Calls Spike>0.5% in 10 min windowWarningCheck WebSocket stability, infra health
Low Task Completion<70% over 1 hourWarningReview recent prompts, check LLM quality
High Escalation Rate>30% over 1 hourWarningAgent can't handle new query type — expand prompts
STT Provider Down0 successful transcripts for 1 minCriticalFailover to backup STT (Faster-Whisper)
TTS Provider Down0 audio responses for 1 minCriticalFailover to backup TTS (Piper/gTTS)
Capacity WarningActive calls > 80% of maxWarningTrigger auto-scaling, prepare new nodes

32D Load Testing & Benchmarks

Results from production load testing — use these numbers to answer interview questions about scale.

Load Test Results (4-core / 8GB node)

Concurrent CallsP50 LatencyP95 LatencyP99 LatencyError RateCPU Usage
10680ms1.1s1.5s0%12%
50710ms1.2s1.7s0%35%
100750ms1.4s2.0s0.1%55%
250820ms1.8s2.8s0.2%72%
500950ms2.3s3.5s0.5%88%
750+1.5s+4s+6s+2%+95%+
Capacity planning rule: Keep each node at <70% CPU utilization. At 500 calls, add a second node. Linear horizontal scaling = 500 calls per node.

Before vs After Optimization

Real optimization results you can cite in interviews:

MetricBeforeAfterImprovementWhat Changed
P50 Turn Latency1.8s700ms61% fasterParallel STT→LLM→TTS streaming
P95 Turn Latency3.5s1.3s63% faster+ LLM response chunking (50 char chunks)
Task Completion62%87%+25 pointsBetter prompts + function calling + RAG
Interruption Rate35%12%-23 pointsLower latency = users don't interrupt
Cost per Call$0.12$0.0558% cheaperTiered LLM + Cartesia TTS + caching
Escalation Rate40%15%-25 pointsExpanded tool library + better NLU
CSAT3.2/54.3/5+1.1 pointsLower latency + better voice + barge-in handling
Interview tip: "Through systematic optimization — parallel streaming, response chunking, tiered LLMs, and improved prompts — we reduced turn latency by 61%, improved task completion from 62% to 87%, and cut per-call costs by 58%."

32E Business Impact Metrics

Translate technical metrics into business value — essential for stakeholder conversations and interviews.

ROI Comparison: AI Voice Agent vs Human Agents

DimensionHuman AgentAI Voice AgentImpact
Cost per call$3–$5$0.05–$0.0660–100x cheaper
Avg handle time6–8 min2–3 min60% faster
Availability8–12h/day (shifts)24/7/365Always-on coverage
Scale-up timeWeeks (hiring + training)Minutes (auto-scale)Instant elasticity
ConsistencyVaries by agent mood/training100% consistentUniform quality
Peak handlingFinite (staff limited)Scales to infra limitsNo queue times during peaks
Languages1–2 per agent30+ with same agentMultilingual at no extra cost

Monthly Savings Calculator (50K calls/month)

$150K
Human agent cost (50K × $3)
$2.5K
AI agent cost (50K × $0.05)
$147K
Monthly savings
98%
Cost reduction
Interview tip: "For a 50K calls/month deployment, the AI voice agent saves approximately $147K per month compared to fully human-staffed support, while maintaining 85%+ resolution rate and 4.2+ CSAT."

Key SLAs to Define in Production

SLADefinitionTargetPenalty Trigger
Availability% time service accepts calls99.9%<99.5% in calendar month
Response QualityTask completion rate>80%<70% over rolling 7 days
LatencyP95 turn latency<2sP95 > 3s for 24h
EscalationHuman handoff rate<20%>30% over rolling 7 days
Data CompliancePII properly handled100%Any PII leak = P0 incident

32F Interview Cheat Sheet — Key Numbers

Quick-reference numbers to cite confidently in interviews when asked about your voice agent deployment.

Numbers You Should Know

QuestionAnswer
"What's your system latency?"P50: ~700ms end-to-end, P95: ~1.3s. Below the 1s conversational comfort threshold.
"How do you measure success?"Task completion >85%, CSAT >4.2/5, escalation <15%, interruption rate <12%.
"What's your cost per call?"~$0.05 per 3-min call (Deepgram + GPT-4o + Cartesia + Twilio). 60x cheaper than human agents.
"How does it scale?"500 concurrent calls per node, horizontal scaling via K8s. Auto-scale on active call count.
"What's your uptime?"99.9% SLA target with multi-provider failover for STT, LLM, and TTS.
"How do you handle failures?"Circuit breaker per provider. Failover: DeepgramFaster-Whisper, GPT-4oClaude, ElevenLabsCartesia → Piper.
"What monitoring do you use?"OpenTelemetry traces for every turn, Grafana dashboards, PagerDuty alerts on latency/error spikes.
"How did you optimize latency?"Parallel streaming (don't wait for full STT → stream to LLM → chunk to TTS). 61% improvement.
"What about accuracy?"STT WER <8% (Deepgram Nova-2), TTS MOS >4.1. Named entity accuracy >92%.
"How do you handle barge-in?"Twilio clear message stops playback instantly. VAD + endpointing detects user speech in <200ms.
"What about security?"TLS everywhere, PII redaction before logging, API key rotation, prompt injection defense, TCPA/GDPR compliant.
"What's your biggest challenge?"Balancing latency vs quality — lower latency often means smaller LLM, less accurate responses. Solved with tiered routing.

Architecture One-Liner

"Twilio receives the call → streams audio over WebSocket → Silero VAD detects speech → Deepgram transcribes in real-time → LLM generates response (streamed) → Cartesia/ElevenLabs speaks it back → audio streamed to Twilio → user hears it. Total round-trip: ~700ms P50. Monitored with OpenTelemetry + Grafana."

33 Testing Strategies

Unit Tests

  • Test individual pipeline components
  • Mock STT/LLM/TTS responses
  • Validate function calling logic

Integration Tests

  • End-to-end pipeline with real APIs
  • Latency measurement
  • Interruption handling

Conversational Tests

  • Multi-turn scenario scripts
  • Edge cases (silence, noise, accents)
  • Tool: Hamming AI for voice agent testing

Load Tests

  • Concurrent call simulation
  • Latency under load
  • Tools: Locust, k6

34 Security & Privacy

Security Checklist

  • Audio encryption — TLS/DTLS for all audio transport (WebRTC does this by default)
  • PII redaction — Strip SSN, credit card, etc. from transcripts before logging
  • Call recording consent — Two-party consent laws in many jurisdictions
  • API key rotation — Rotate STT/LLM/TTS API keys regularly
  • Prompt injection defense — Users may try to manipulate the agent via speech
  • Rate limiting — Prevent abuse of voice endpoints
  • Data retention policy — Define how long audio/transcripts are stored
  • Voice spoofing protection — Detect synthetic voice attacks in authentication

35 Compliance

RegulationVoice-Specific Requirements
GDPRConsent for recording, right to delete voice data, PII redaction
HIPAAPHI in voice must be encrypted, BAA with all providers, no logging PHI
TCPAConsent for automated calls, opt-out mechanism, calling time restrictions
CCPADisclose AI use, right to opt out of voice data collection
FTCDisclose that caller is AI (required in many US jurisdictions)
AI Disclosure: Multiple US states and the EU now require that callers be informed when they are speaking with an AI agent. Always play a disclosure at the start of calls.

36 Glossary

TermDefinition
ASRAutomatic Speech Recognition (same as STT)
STTSpeech-to-Text — converting audio to text
TTSText-to-Speech — converting text to audio
VADVoice Activity Detection — detecting speech in audio
EndpointingDetecting when a speaker has finished an utterance
Barge-inUser interrupting the agent while it's speaking
WERWord Error Rate — STT accuracy metric
SSMLSpeech Synthesis Markup Language — TTS formatting standard
WebRTCWeb Real-Time Communication — browser-based audio/video
SIPSession Initiation Protocol — telephony signaling
PSTNPublic Switched Telephone Network — traditional phone network
DTMFDual-Tone Multi-Frequency — phone keypad tones
AECAcoustic Echo Cancellation
AGCAutomatic Gain Control
ProsodyRhythm, stress, and intonation of speech
DiarizationIdentifying different speakers in audio

37 Quick Reference — Recommended Stack

Production Voice Agent Stack

ComponentRecommendedBudget Alternative
VADSilero VADWebRTC VAD
STTDeepgram Nova-2Faster-Whisper (self-hosted)
LLMGPT-4o / Claude SonnetLlama 3 (self-hosted)
TTSCartesia Sonic / ElevenLabsPiper TTS (self-hosted)
FrameworkLiveKit AgentsPipecat
TelephonyTwilioTelnyx
TransportWebRTC (LiveKit)WebSocket (FastAPI)
MonitoringLangfuse + GrafanaOpenTelemetry + Loki
Getting started fast? Use a managed platform like Vapi to prototype, then migrate to LiveKit Agents or Pipecat when you need full control and lower per-call costs.