Building a Voice Agent
End-to-end technical guide — from microphone input to spoken response, covering STT, NLU, LLM, TTS, real-time streaming, telephony, and production deployment.
01 Overview
A voice agent is an AI system that listens to human speech, understands intent, reasons over context, and responds with natural-sounding speech — all in real time. Modern voice agents combine automatic speech recognition (ASR/STT), large language models (LLMs), and neural text-to-speech (TTS) into a low-latency pipeline.
Key Challenges
- Latency — Humans expect sub-second responses; every millisecond matters
- Interruption handling — Users barge-in mid-sentence; agent must stop and listen
- Ambient noise — Real-world audio is noisy; robust VAD and ASR needed
- Turn-taking — Detecting when the user has finished speaking (endpointing)
- Naturalness — TTS must sound human, with proper prosody and emotion
- Context retention — Multi-turn conversations require persistent memory
- Concurrent calls — Production systems handle thousands of simultaneous calls
02 System Architecture
ulaw_8000 directly for Twilio — zero transcoding overhead.
Component Responsibilities
| Component | Role | Latency Target |
|---|---|---|
| VAD | Detect when user is speaking vs silence/noise | <10ms |
| STT / ASR | Convert audio stream to text (transcription) | 50–300ms |
| NLU | Extract intent, entities, sentiment from text | 10–50ms |
| LLM | Generate contextual response (reasoning engine) | 200–800ms (first token) |
| TTS | Convert response text to audio waveform | 50–200ms (first byte) |
| Transport | Bi-directional audio streaming (WebSocket/WebRTC/SIP) | <50ms |
03 Voice Pipeline (Step by Step)
# Pseudocode: Streaming voice pipeline
async def voice_pipeline(audio_stream):
# Stage 1: VAD → filter silence
speech_chunks = vad.filter(audio_stream)
# Stage 2: Streaming STT → interim + final transcripts
async for transcript in stt.transcribe_stream(speech_chunks):
if transcript.is_final:
# Stage 3: Stream LLM response token by token
sentence_buffer = ""
async for token in llm.stream(transcript.text, context):
sentence_buffer += token
# Stage 4: Send complete sentences to TTS
if ends_with_punctuation(sentence_buffer):
async for audio in tts.synthesize_stream(sentence_buffer):
yield audio # → speaker
sentence_buffer = ""
04 Latency Budget
Voice agents are latency-critical. Humans perceive pauses >600ms as unnatural. The target is <1 second from end of user speech to beginning of agent speech.
Latency Optimization Techniques
| Technique | Savings | How |
|---|---|---|
| Streaming STT | 200–500ms | Don't wait for end-of-utterance; use interim results |
| LLM streaming | 500ms+ | Start TTS on first sentence, not full response |
| TTS streaming | 200–400ms | Begin audio playback before full synthesis completes |
| Sentence-level TTS | 100–300ms | Buffer LLM tokens into sentences for TTS chunks |
| Speculative prefill | 100–200ms | Start LLM prompt while STT is still finalizing |
| Semantic caching | 300–700ms | Cache responses for common queries |
| Edge deployment | 50–150ms | Co-locate STT/TTS near users (reduce network hops) |
| Shorter endpointing | 100–200ms | Tune VAD silence threshold (risk: premature cutoff) |
| Warm connections | 50–100ms | Keep persistent connections to STT/LLM/TTS APIs |
05 Speech-to-Text (STT / ASR)
Automatic Speech Recognition converts audio waveforms into text. For voice agents, streaming STT is essential — results must arrive incrementally as the user speaks.
Key STT Concepts
- Streaming vs Batch — Streaming gives interim results in real-time; batch processes complete files
- Interim (partial) results — Unstable text that updates as more audio arrives
- Final results — Stable transcript after endpointing detects end of utterance
- Endpointing — Detecting when the user has stopped speaking (silence duration)
- Word-level timestamps — Timing for each word (useful for alignment and analytics)
- Speaker diarization — Identifying different speakers in multi-party audio
- Custom vocabulary — Boost recognition of domain-specific terms
06 STT Engines Compared
| Engine | Type | Streaming | Latency | Best For |
|---|---|---|---|---|
| Deepgram | Cloud API | Yes (WebSocket) | ~100ms | Lowest latency, voice agents |
| Google Cloud STT | Cloud API | Yes (gRPC) | ~200ms | Multi-language, enterprise |
| Azure Speech | Cloud API | Yes (WebSocket) | ~150ms | Microsoft ecosystem |
| AWS Transcribe | Cloud API | Yes (WebSocket) | ~250ms | AWS ecosystem |
| AssemblyAI | Cloud API | Yes (WebSocket) | ~200ms | Accuracy, LeMUR integration |
| OpenAI Whisper | Open-source / API | No (batch only) | 1–5s | Accuracy, self-hosted, offline |
| Whisper.cpp | Open-source (C++) | Pseudo-stream | ~500ms | Edge/local deployment |
| Faster-Whisper | Open-source (CTranslate2) | No | ~300ms | Fast self-hosted batch |
| Vosk | Open-source | Yes | ~200ms | Offline, lightweight, edge |
# Deepgram Streaming STT (WebSocket)
import asyncio
from deepgram import DeepgramClient, LiveOptions, LiveTranscriptionEvents
dg = DeepgramClient(api_key="YOUR_KEY")
connection = dg.listen.asyncwebsocket.v("1")
async def on_message(self, result, **kwargs):
transcript = result.channel.alternatives[0].transcript
is_final = result.is_final
if is_final and transcript:
print(f"Final: {transcript}")
# → Send to LLM
connection.on(LiveTranscriptionEvents.Transcript, on_message)
options = LiveOptions(
model="nova-2",
language="en",
encoding="linear16",
sample_rate=16000,
interim_results=True,
endpointing=300, # ms of silence before final
smart_format=True,
vad_events=True,
)
await connection.start(options)
# Send audio chunks (20ms frames)
async for chunk in mic_stream:
connection.send(chunk)
06A Why Deepgram — Deep Dive
Deepgram is the recommended STT engine for production voice agents. Here's a detailed analysis of why it outperforms alternatives for real-time conversational AI.
Why Deepgram Over Alternatives
| Criteria | Deepgram | Google Cloud STT | Whisper (OpenAI) | Azure Speech |
|---|---|---|---|---|
| Streaming Latency | ~100ms (best-in-class) | ~200ms | N/A (batch only) | ~150ms |
| Native WebSocket | Yes (first-class) | gRPC only | No | Yes |
| Built-in Endpointing | Yes (configurable ms) | Limited | No | Yes |
| Built-in VAD Events | Yes | No | No | Limited |
| Word-level Timestamps | Yes | Yes | Yes | Yes |
| Smart Formatting | Auto (numbers, dates, currency) | Manual config | No | Yes |
| Cost (per hour) | $0.0043/min (~$0.26/hr) | $0.024/min | $0.006/min (API) | $0.016/min |
| Custom Vocabulary | Keywords + model training | Phrase hints | Prompt only | Phrase lists |
| Voice Agent Optimized | Yes (Nova-2 model) | General purpose | General purpose | General purpose |
Key Reasons to Choose Deepgram
- Lowest streaming latency in the industry (~100ms) — Deepgram's end-to-end deep learning model is purpose-built for real-time. Unlike traditional ASR pipelines (acoustic model → language model → decoder), Deepgram uses a single neural network that processes audio directly, eliminating inter-stage latency.
- Native WebSocket API designed for voice agents — Deepgram's primary API is a persistent WebSocket connection that accepts raw audio frames and returns JSON transcripts. This is exactly what voice agents need — no gRPC complexity (Google), no REST polling (Whisper), no SDK abstraction overhead.
- Built-in endpointing and VAD events — Deepgram detects when users stop speaking and emits
speech_finalandutterance_endevents with configurable silence thresholds. Other STT engines require you to implement VAD and endpointing separately. - Smart formatting out of the box — Automatically formats numbers ("three hundred" → "300"), dates, currency, and punctuation. This means the text sent to the LLM is clean and structured without post-processing.
- Cost-effective at scale — At $0.0043/minute for Nova-2, Deepgram is 4–6x cheaper than Google Cloud STT and Azure Speech, which matters significantly when handling thousands of concurrent calls.
- Nova-2 model specifically optimized for conversational speech — Unlike Whisper (optimized for transcription accuracy on long-form audio), Nova-2 is trained on conversational, real-time speech patterns with lower word error rates on voice agent dialogue.
06B Deepgram Features for Voice Agents
Endpointing Configuration
Fine-tune when Deepgram considers a user utterance "done." Lower values = faster response but risk cutting off the user.
endpointing=300 # 300ms silence = end of utterance
endpointing=500 # 500ms for cautious endpointing
endpointing=false # Disable (you handle it)
Utterance Detection
Separate from endpointing — detects utterance boundaries even in continuous speech.
utterance_end_ms=1000 # Gap between utterances
interim_results=true # Get partial transcripts
vad_events=true # Speech start/stop events
Smart Formatting
Auto-converts spoken forms to written forms for cleaner LLM input.
- "three hundred dollars" → "$300"
- "january fifth twenty twenty six" → "January 5, 2026"
- "one two three four" → "1234" (in number context)
Keyword Boosting
Boost recognition of domain-specific terms that the model might miss.
keywords=[
"Acme:2", # Boost "Acme" by 2x
"SKU:1.5", # Product codes
"onboarding:1.5" # Domain terms
]
06C Deepgram Implementation
# Complete Deepgram Streaming STT for Voice Agent
import asyncio, json
from deepgram import (
DeepgramClient,
DeepgramClientOptions,
LiveTranscriptionEvents,
LiveOptions,
)
class DeepgramSTTEngine:
"""Production-ready Deepgram STT wrapper for voice agents."""
def __init__(self, api_key: str, on_transcript, on_speech_started=None):
self.client = DeepgramClient(api_key, DeepgramClientOptions(
options={"keepalive": "true"} # Persistent connection
))
self.on_transcript = on_transcript
self.on_speech_started = on_speech_started
self.connection = None
async def connect(self):
self.connection = self.client.listen.asyncwebsocket.v("1")
# Register event handlers
self.connection.on(LiveTranscriptionEvents.Transcript, self._on_message)
self.connection.on(LiveTranscriptionEvents.SpeechStarted, self._on_speech_started)
self.connection.on(LiveTranscriptionEvents.UtteranceEnd, self._on_utterance_end)
self.connection.on(LiveTranscriptionEvents.Error, self._on_error)
options = LiveOptions(
model="nova-2", # Best for conversational speech
language="en",
encoding="linear16", # 16-bit PCM
sample_rate=16000, # 16kHz mono
channels=1,
interim_results=True, # Get partial transcripts for UI
endpointing=300, # 300ms silence = final
utterance_end_ms=1000, # Utterance boundary detection
smart_format=True, # Auto-format numbers, dates
punctuate=True, # Add punctuation
vad_events=True, # Speech start/stop events
filler_words=False, # Remove "um", "uh"
)
if not await self.connection.start(options):
raise ConnectionError("Failed to connect to Deepgram")
print("✓ Deepgram STT connected")
async def send_audio(self, audio_bytes: bytes):
"""Send raw audio chunk (20ms frame = 640 bytes at 16kHz/16bit)."""
if self.connection:
self.connection.send(audio_bytes)
async def _on_message(self, _self, result, **kwargs):
transcript = result.channel.alternatives[0].transcript
if not transcript:
return
if result.is_final:
# Final transcript → send to LLM
confidence = result.channel.alternatives[0].confidence
await self.on_transcript(transcript, is_final=True, confidence=confidence)
else:
# Interim → update UI only
await self.on_transcript(transcript, is_final=False)
async def _on_speech_started(self, _self, speech_started, **kwargs):
# User started speaking → interrupt agent if needed
if self.on_speech_started:
await self.on_speech_started()
async def _on_utterance_end(self, _self, utterance_end, **kwargs):
# Clean boundary between utterances
pass
async def _on_error(self, _self, error, **kwargs):
print(f"Deepgram error: {error}")
async def close(self):
if self.connection:
await self.connection.finish()
Deepgram Audio Format Requirements
| Parameter | Recommended | Why |
|---|---|---|
| Sample Rate | 16,000 Hz | Standard for speech; higher adds bandwidth without improving recognition |
| Bit Depth | 16-bit (linear16) | Good dynamic range, supported by all providers |
| Channels | 1 (mono) | Speech is mono; stereo wastes bandwidth |
| Frame Size | 20ms (640 bytes) | Standard VoIP frame size; balances latency and efficiency |
| From Twilio | mulaw 8kHz | Telephony standard; Deepgram accepts mulaw natively |
07 Streaming Recognition
Streaming STT Best Practices
- Use 16kHz, 16-bit mono PCM (linear16) for best quality/bandwidth balance
- Send audio in 20ms frames (320 bytes at 16kHz)
- Enable interim results for UI feedback but trigger LLM only on final results
- Set endpointing to 300–500ms for conversational voice agents
- Use VAD events to detect speech start/stop separately from transcription
- Implement utterance-level buffering to handle multi-sentence turns
08 Voice Activity Detection (VAD)
VAD distinguishes human speech from silence, noise, and background audio. It's the gatekeeper that decides when to start and stop STT processing.
| VAD Engine | Type | Latency | Notes |
|---|---|---|---|
| Silero VAD | Neural (PyTorch/ONNX) | <1ms per frame | Best accuracy/speed tradeoff; industry standard |
| WebRTC VAD | Signal-based (GMM) | <0.1ms | Ultra-fast, less accurate in noise |
| Picovoice Cobra | Neural (edge) | <1ms | Optimized for mobile/IoT |
| Built-in (Deepgram/Azure) | Cloud-integrated | N/A (server-side) | No extra integration needed |
# Silero VAD example
import torch
model, utils = torch.hub.load(
repo_or_dir='snakers4/silero-vad',
model='silero_vad',
trust_repo=True
)
(get_speech_timestamps, _, read_audio, _, _) = utils
# Real-time frame-by-frame
def process_frame(audio_chunk_tensor):
speech_prob = model(audio_chunk_tensor, 16000).item()
is_speech = speech_prob > 0.5
return is_speech
09 Natural Language Understanding (NLU) & Intent Detection
NLU processes the transcribed text to extract meaning — intents, entities, sentiment, and dialog acts. This is a critical question for voice agent design: how do you detect what the user wants?
What LangChain / LangGraph Actually Do
| Framework | What It Is | What It Is NOT | Role in NLU |
|---|---|---|---|
| LangChain | LLM orchestration framework — chains prompts, tools, memory, retrievers together | Not an NLU engine, not an intent classifier | Can wrap an LLM call that does intent classification via prompting or function calling |
| LangGraph | Stateful graph-based agent framework — manages state machines, routing, cycles | Not an NLU engine, not an intent classifier | Can route based on detected intent (the graph decides what to do after intent is known) |
The Three Approaches to Intent Detection
1. Traditional NLU (ML Models)
Dedicated ML models trained on labeled intent data. Fast, deterministic, predictable. Limited to pre-defined intents.
- Intent classification (book_flight, check_balance)
- Named entity extraction (dates, names, amounts)
- Slot filling for structured actions
- Requires training data (50–500+ examples per intent)
2. LLM-Powered NLU (Prompting)
Use GPT/Claude with structured output to classify intents. No training data needed. Handles unseen intents.
- LLM does intent + entity extraction in one call
- Zero-shot: works without examples
- Structured output via function calling / JSON mode
- Higher latency (200–500ms) but much more capable
3. Hybrid (Classifier + LLM Fallback)
Fast local classifier for common intents; LLM fallback for edge cases. Best of both worlds.
- Local model handles 80% of known intents (<10ms)
- LLM handles ambiguous/novel intents (200ms+)
- Router decides which path based on confidence
- Most production voice agents use this approach
09A Intent Detection — Full Comparison
| Solution | Type | Latency | Training Data | Open Intents | Cost | Best For |
|---|---|---|---|---|---|---|
| Rasa NLU | Self-hosted ML | <10ms | Required (50+ per intent) | No | Free (OSS) | Self-hosted, full control |
| Dialogflow CX | Google Cloud | ~50ms | Required (10+ per intent) | No | $0.007/req | Google ecosystem, complex flows |
| Amazon Lex | AWS Cloud | ~80ms | Required (10+ per intent) | No | $0.004/req | AWS ecosystem, Alexa-like bots |
| Azure CLU (LUIS successor) | Azure Cloud | ~60ms | Required (15+ per intent) | No | $0.005/req | Microsoft ecosystem |
| GPT-4o Function Calling | LLM (OpenAI) | 200–400ms | None (zero-shot) | Yes | ~$0.003/req | Flexible, open-ended voice agents |
| Claude Tool Use | LLM (Anthropic) | 200–500ms | None (zero-shot) | Yes | ~$0.004/req | Safety-focused, enterprise |
| FastText / Sentence-BERT | Self-hosted embeddings | <5ms | Required (20+ per intent) | No | Free (OSS) | Ultra-low latency, edge |
| SetFit (few-shot) | Self-hosted (HuggingFace) | <10ms | Minimal (8–16 per intent) | No | Free (OSS) | Few-shot scenarios, fast training |
| LLM via LangChain | Orchestrated LLM call | 200–500ms | None (zero-shot) | Yes | LLM cost | When already using LangChain |
09B Intent Detection Approaches (Detailed)
Approach 1: LLM Function Calling as Intent Detection
The most common modern approach. Define intents as "functions" — the LLM decides which function to call based on the user's speech. This effectively combines NLU + action routing in one step.
# LLM Function Calling = Intent Detection + Entity Extraction
# Define your intents as tools/functions
tools = [
{
"type": "function",
"function": {
"name": "check_order_status", # ← This IS the intent
"description": "User wants to check the status of an existing order",
"parameters": {
"type": "object",
"properties": {
"order_id": { # ← This IS the entity
"type": "string",
"description": "Order ID or number"
}
}
}
}
},
{
"type": "function",
"function": {
"name": "transfer_to_human",
"description": "User wants to speak with a human agent",
"parameters": {
"type": "object",
"properties": {
"department": {
"type": "string",
"enum": ["billing", "support", "sales"]
},
"reason": {"type": "string"}
}
}
}
},
{
"type": "function",
"function": {
"name": "make_payment",
"description": "User wants to make a payment on their account",
"parameters": {
"type": "object",
"properties": {
"amount": {"type": "number"},
"account_id": {"type": "string"}
}
}
}
},
{
"type": "function",
"function": {
"name": "general_question",
"description": "User has a general question not covered by specific functions",
"parameters": {
"type": "object",
"properties": {
"question": {"type": "string"}
}
}
}
}
]
# User says: "I want to check on order number 4567"
# LLM returns: tool_call(name="check_order_status", args={"order_id": "4567"})
# ↑ intent ↑ entity
Approach 2: Traditional NLU (Rasa / Dialogflow)
# Rasa NLU Pipeline (nlu.yml)
# Train a dedicated ML model for intent classification
nlu:
- intent: check_order
examples: |
- where is my order
- check order status
- what's the status of order [4567](order_id)
- track my package
- I want to know where my delivery is
- can you look up order [AB-1234](order_id)
- intent: make_payment
examples: |
- I'd like to pay my bill
- make a payment of [$50](amount)
- pay [100 dollars](amount) on my account
- how do I pay
- intent: transfer_to_human
examples: |
- let me talk to a real person
- transfer me to an agent
- I want to speak to someone
- get me a human
# Result: {"intent": "check_order", "confidence": 0.94,
# "entities": [{"entity": "order_id", "value": "4567"}]}
Approach 3: Fast Embedding Classifier (SetFit / Sentence-BERT)
# Ultra-fast intent detection using sentence embeddings
# Only needs 8-16 examples per intent to train
from setfit import SetFitModel, SetFitTrainer
# Train with minimal examples
train_data = [
("check my order", "check_order"),
("where is my package", "check_order"),
("track delivery", "check_order"),
("order status", "check_order"),
("pay my bill", "make_payment"),
("make a payment", "make_payment"),
("talk to a human", "transfer"),
("speak to agent", "transfer"),
# ... 8-16 examples per intent
]
model = SetFitModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
trainer = SetFitTrainer(model=model, train_dataset=train_data)
trainer.train()
# Inference: <5ms!
intent = model.predict("I need to check on order 4567")
# → "check_order"
Approach 4: Hybrid Router (Recommended for Production Voice Agents)
# Hybrid: Fast classifier + LLM fallback
# This is the production-recommended approach for voice agents
class HybridIntentRouter:
def __init__(self):
self.fast_classifier = SetFitModel.from_pretrained("./intent-model")
self.confidence_threshold = 0.85
self.llm = OpenAIClient()
async def detect_intent(self, transcript: str) -> dict:
# Step 1: Try fast classifier (~5ms)
predictions = self.fast_classifier.predict_proba([transcript])
top_intent = predictions.argmax()
confidence = predictions.max()
if confidence >= self.confidence_threshold:
# High confidence → use fast result (saves 200-400ms!)
return {
"intent": top_intent,
"confidence": confidence,
"method": "fast_classifier",
"latency_ms": 5,
}
# Step 2: Low confidence → fall back to LLM (200-400ms)
llm_result = await self.llm.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "Classify the user's intent. Return JSON: {intent, entities, confidence}"
}, {
"role": "user",
"content": transcript
}],
response_format={"type": "json_object"},
)
return {**json.loads(llm_result.choices[0].message.content), "method": "llm_fallback"}
09C LangChain / LangGraph Role in Voice Agents
Since LangChain and LangGraph are often confused with NLU, here's exactly what role they play in a voice agent pipeline.
What LangChain Does in a Voice Agent
| Capability | LangChain Role | Not LangChain's Job |
|---|---|---|
| Intent Detection | Wraps an LLM call that does intent detection via function calling | Does not provide its own intent classifier |
| Entity Extraction | LLM extracts entities via structured output (Pydantic models) | Does not have NER models |
| Conversation Memory | Yes — ConversationBufferMemory, summary memory, etc. | — |
| RAG Retrieval | Yes — retrievers, vector stores, rerankers | — |
| Tool/Function Calling | Yes — tool definitions, execution, result handling | — |
| Prompt Management | Yes — templates, few-shot examples, output parsers | — |
| Agent Orchestration | Yes (via LangGraph) — state machines, routing, cycles | — |
What LangGraph Does in a Voice Agent
# LangGraph voice agent with intent routing
# Note: Intent detection happens INSIDE the LLM call, not from LangGraph
from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal
class VoiceState(TypedDict):
transcript: str
intent: str
entities: dict
response: str
conversation_history: list
# Node 1: Detect intent (uses LLM — LangGraph doesn't do this itself)
async def detect_intent(state: VoiceState) -> VoiceState:
# Option A: Fast classifier
result = fast_classifier.predict(state["transcript"])
# Option B: LLM function calling
# result = await llm.classify(state["transcript"])
state["intent"] = result.intent
state["entities"] = result.entities
return state
# Router: LangGraph routes based on detected intent
def route_intent(state: VoiceState) -> Literal["order", "payment", "transfer", "general"]:
intent_map = {
"check_order": "order",
"make_payment": "payment",
"transfer_to_human": "transfer",
}
return intent_map.get(state["intent"], "general")
# Build graph
graph = StateGraph(VoiceState)
graph.add_node("detect_intent", detect_intent)
graph.add_node("order", handle_order_check)
graph.add_node("payment", handle_payment)
graph.add_node("transfer", handle_transfer)
graph.add_node("general", handle_general_query)
graph.add_node("respond", generate_voice_response)
graph.set_entry_point("detect_intent")
graph.add_conditional_edges("detect_intent", route_intent)
for node in ["order", "payment", "transfer", "general"]:
graph.add_edge(node, "respond")
graph.add_edge("respond", END)
voice_agent = graph.compile()
09D Intent Detection Decision Guide
Which Approach Should You Use?
| Your Situation | Recommended Approach | Why |
|---|---|---|
| Well-defined intents (10–50), latency critical | SetFit / FastText classifier | <5ms, deterministic, no LLM cost |
| Complex flows with many intents + Google ecosystem | Dialogflow CX | Visual flow builder, Google integrations |
| Open-ended conversation, can't pre-define all intents | LLM function calling | Handles anything, zero training data |
| Enterprise with existing Rasa infrastructure | Rasa NLU | Self-hosted, full control, proven at scale |
| Production voice agent (best overall) | Hybrid: fast classifier + LLM fallback | Fast for common intents, LLM for edge cases |
| Prototype / MVP (ship fast) | LLM function calling only | Zero setup, works immediately |
| Edge / offline deployment | SetFit or Vosk + local model | No cloud dependency |
10 Dialog Management
Controls the flow of conversation — tracking state, managing turns, handling context switches, and deciding what action to take next.
Dialog Management Approaches
| Approach | How | Best For |
|---|---|---|
| Finite State Machine | Predefined states and transitions | Simple IVR, scripted flows |
| Frame-Based | Fill slots until action is ready | Form-filling (booking, orders) |
| LLM-Driven | LLM decides next action via system prompt | Open-ended conversation |
| Hybrid (Graph + LLM) | Graph for structure, LLM for flexibility | Enterprise voice agents (recommended) |
11 LLM Integration
The LLM is the reasoning brain of the voice agent. It processes the user's transcript, conversation history, and system instructions to generate responses.
# Voice-optimized LLM prompt
SYSTEM_PROMPT = """You are a helpful voice assistant for Acme Corp customer support.
VOICE-SPECIFIC RULES:
- Keep responses SHORT (1-3 sentences). Voice != chat.
- Use conversational language, contractions, natural phrasing.
- NEVER use markdown, bullet points, URLs, or special formatting.
- Spell out numbers: "twenty three" not "23".
- For lists, say "first... second... third..." not "1. 2. 3."
- If unsure, ask ONE clarifying question at a time.
- Acknowledge the user before answering: "Sure!", "Great question.", etc.
FUNCTION CALLING:
- Use check_order_status(order_id) for order inquiries.
- Use transfer_to_human(department) if user explicitly asks for a person.
- Use schedule_callback(phone, time) for callback requests.
CONTEXT:
- Customer: {customer_name}
- Account tier: {tier}
- Previous interactions: {history_summary}
"""
# Streaming LLM call
async def stream_llm_response(transcript, context):
response = await client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT.format(**context)},
*conversation_history,
{"role": "user", "content": transcript}
],
stream=True,
temperature=0.7,
max_tokens=150, # Keep voice responses short
)
sentence_buffer = ""
async for chunk in response:
token = chunk.choices[0].delta.content or ""
sentence_buffer += token
# Yield complete sentences for TTS
if any(sentence_buffer.rstrip().endswith(p) for p in (".", "!", "?", ",")):
yield sentence_buffer.strip()
sentence_buffer = ""
if sentence_buffer.strip():
yield sentence_buffer.strip()
12 RAG for Voice Agents
Retrieval-Augmented Generation connects your voice agent to enterprise knowledge bases, FAQs, product docs, and customer data — so it gives accurate, grounded answers.
13 Text-to-Speech (TTS)
TTS converts the LLM's text response into natural-sounding audio. Modern neural TTS produces near-human quality. For voice agents, streaming TTS is critical — audio begins playing before the full text is synthesized.
Key TTS Features for Voice Agents
- Streaming synthesis — Generate audio incrementally (sentence by sentence)
- Low first-byte latency — Start speaking as fast as possible
- Natural prosody — Proper intonation, stress, and rhythm
- Emotion/style control — Adjust tone (friendly, professional, empathetic)
- Voice cloning — Custom brand voice from audio samples
- SSML support — Fine-grained control over pronunciation, pauses, emphasis
- Multi-language — Support for global deployment
14 TTS Engines Compared
| Engine | Type | Streaming | Latency | Quality | Best For |
|---|---|---|---|---|---|
| ElevenLabs | Cloud API | Yes | ~150ms | Excellent | Highest quality, voice cloning |
| Cartesia (Sonic) | Cloud API | Yes | ~90ms | Very Good | Ultra-low latency voice agents |
| Deepgram Aura | Cloud API | Yes | ~80ms | Good | STT+TTS single vendor |
| OpenAI TTS | Cloud API | Yes | ~200ms | Very Good | OpenAI ecosystem |
| Azure Neural TTS | Cloud API | Yes | ~150ms | Very Good | Enterprise, SSML, 400+ voices |
| Google Cloud TTS | Cloud API | Yes | ~180ms | Very Good | Multi-language, WaveNet |
| Amazon Polly | Cloud API | Yes | ~200ms | Good | AWS ecosystem, NTTS voices |
| Coqui TTS | Open-source | Limited | ~300ms | Good | Self-hosted, custom voices |
| Piper TTS | Open-source | No | ~100ms | Moderate | Edge/offline, lightweight |
# ElevenLabs Streaming TTS
from elevenlabs import ElevenLabs
client = ElevenLabs(api_key="YOUR_KEY")
def stream_tts(text: str):
audio_stream = client.text_to_speech.convert_as_stream(
voice_id="pNInz6obpgDQGcFmaJgB", # "Adam"
text=text,
model_id="eleven_turbo_v2_5",
output_format="pcm_16000", # Raw PCM for low latency
)
for audio_chunk in audio_stream:
yield audio_chunk # Send to speaker/WebSocket
# Cartesia Streaming TTS (ultra-low latency)
from cartesia import Cartesia
cartesia = Cartesia(api_key="YOUR_KEY")
async def stream_cartesia(text: str):
output = await cartesia.tts.sse(
model_id="sonic-english",
transcript=text,
voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",
output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 16000},
stream=True,
)
async for chunk in output:
yield chunk["audio"]
14A Why ElevenLabs — Deep Dive
ElevenLabs delivers the most natural-sounding AI voices on the market. For enterprise voice agents where brand perception and user trust depend on voice quality, ElevenLabs is the premium choice.
Why ElevenLabs Over Alternatives
| Criteria | ElevenLabs | OpenAI TTS | Azure Neural | Google TTS |
|---|---|---|---|---|
| Voice Naturalness | Best-in-class (MOS ~4.5) | Very good (~4.2) | Very good (~4.1) | Good (~3.9) |
| Streaming Latency | ~150ms first byte | ~200ms | ~150ms | ~180ms |
| Voice Cloning | Professional (30s–30min audio) | No | Custom Neural Voice ($) | Limited |
| Emotion Control | Yes (style, stability sliders) | No | SSML only | SSML only |
| Voice Library | Thousands (community + premium) | 6 voices | 400+ voices | 100+ voices |
| Languages | 29 languages | ~57 languages | 140+ languages | 40+ languages |
| Turbo Model | Yes (Turbo v2.5 — ~100ms) | tts-1 (fast/lower quality) | No turbo option | No turbo option |
| Cost (per 1K chars) | $0.18–$0.30 | $0.015–$0.030 | $0.016 | $0.016 |
Key Reasons to Choose ElevenLabs
- Highest naturalness scores across independent benchmarks — ElevenLabs' Multilingual v2 and Turbo v2.5 models consistently achieve the highest Mean Opinion Scores (MOS) in blind listening tests. Users perceive ElevenLabs voices as more human-like, building trust in voice agent interactions.
- Professional voice cloning for brand identity — Clone a specific voice (spokesperson, brand character) from as little as 30 seconds of audio. The resulting voice is consistent across all calls, creating a recognizable brand experience.
- Fine-grained emotion and style control — Adjust stability (consistency vs expressiveness) and similarity (closeness to original voice) sliders. This lets you tune the voice to match your brand personality — professional, warm, energetic, calm.
- Turbo v2.5 model for sub-100ms latency — When latency matters most (interactive voice agents), the Turbo model sacrifices minimal quality for dramatically lower first-byte latency, competing with Cartesia's speed.
- Rich voice library — Access thousands of pre-made voices for prototyping, or clone custom voices for production. Switch voices without changing any pipeline code.
14B ElevenLabs Implementation
# Complete ElevenLabs Streaming TTS for Voice Agents
import asyncio
from elevenlabs import ElevenLabs
from elevenlabs.core import ApiError
class ElevenLabsTTSEngine:
"""Production ElevenLabs TTS with streaming and voice management."""
def __init__(self, api_key: str, voice_id: str = "pNInz6obpgDQGcFmaJgB"):
self.client = ElevenLabs(api_key=api_key)
self.voice_id = voice_id
def stream_audio(self, text: str, model: str = "eleven_turbo_v2_5"):
"""Stream audio chunks for a text sentence.
Models:
- eleven_turbo_v2_5: Fastest (~100ms), good quality — USE FOR VOICE AGENTS
- eleven_multilingual_v2: Best quality (~200ms), all 29 languages
- eleven_monolingual_v1: English only, legacy
"""
audio_stream = self.client.text_to_speech.convert_as_stream(
voice_id=self.voice_id,
text=text,
model_id=model,
output_format="pcm_16000", # Raw PCM for lowest latency
voice_settings={
"stability": 0.5, # 0=expressive, 1=stable
"similarity_boost": 0.75, # Closeness to original voice
"style": 0.0, # 0=neutral, 1=exaggerated
"use_speaker_boost": True, # Enhance clarity
},
optimize_streaming_latency=3, # 0-4, higher = faster but lower quality
)
for audio_chunk in audio_stream:
yield audio_chunk
async def synthesize_for_twilio(self, text: str):
"""Generate audio in mulaw format for Twilio Media Streams."""
audio_stream = self.client.text_to_speech.convert_as_stream(
voice_id=self.voice_id,
text=text,
model_id="eleven_turbo_v2_5",
output_format="ulaw_8000", # Native Twilio format!
)
for chunk in audio_stream:
yield chunk
def get_voices(self):
"""List available voices."""
return self.client.voices.get_all()
def clone_voice(self, name: str, audio_files: list):
"""Clone a voice from audio samples."""
return self.client.clone(
name=name,
files=audio_files,
description="Custom brand voice for voice agent"
)
14C Why Cartesia — Deep Dive
Cartesia (Sonic model) delivers the lowest TTS latency in the market, making it the ideal choice when response speed is the primary concern.
Why Cartesia Over Alternatives
| Criteria | Cartesia Sonic | ElevenLabs Turbo | Deepgram Aura |
|---|---|---|---|
| First-Byte Latency | ~90ms (fastest) | ~100ms | ~80ms |
| Voice Quality | Very Good | Excellent | Good |
| Instant Voice Cloning | Yes (5–15 sec audio) | Yes (30s+ audio) | No |
| Emotion/Style Mixing | Yes (blend multiple emotions) | Stability sliders | No |
| Multilingual | Growing (10+ langs) | 29 languages | English focus |
| Word-level Timestamps | Yes | No | No |
| WebSocket Streaming | Yes (native) | HTTP streaming | HTTP streaming |
| Cost | Competitive | Premium | Lowest |
Key Reasons to Choose Cartesia
- Absolute lowest latency for time-critical interactions — Cartesia's State Space Model (SSM) architecture generates audio faster than transformer-based TTS. The Sonic model produces the first audio byte in ~90ms, enabling sub-second agent responses.
- WebSocket-native streaming — Unlike HTTP-based streaming (ElevenLabs, OpenAI), Cartesia provides true WebSocket streaming with bidirectional communication. You can send text and receive audio on the same persistent connection, eliminating connection overhead per sentence.
- Word-level timestamps in real-time — Cartesia returns timing information for each word as audio streams, enabling precise lip-sync for avatars, captions, and alignment-based interruption handling.
- Emotion and style mixing — Blend multiple emotional tones in a single generation (e.g., 70% professional + 30% warm). This enables dynamic emotional adaptation during conversations.
- Instant voice cloning from 5 seconds of audio — The fastest voice cloning available, enabling rapid prototyping and custom voice creation without long training cycles.
14D Cartesia Implementation
# Complete Cartesia Sonic Streaming TTS
import asyncio
from cartesia import Cartesia
class CartesiaTTSEngine:
"""Production Cartesia TTS with WebSocket streaming."""
def __init__(self, api_key: str, voice_id: str):
self.client = Cartesia(api_key=api_key)
self.voice_id = voice_id
self.ws = None
async def connect_websocket(self):
"""Establish persistent WebSocket for lowest latency."""
self.ws = self.client.tts.websocket()
print("✓ Cartesia WebSocket connected")
async def stream_audio(self, text: str, context_id: str = "default"):
"""Stream audio via persistent WebSocket connection.
context_id: Use same ID for sentences in one turn
to maintain prosody continuity across chunks.
"""
output = self.ws.send(
model_id="sonic-english",
transcript=text,
voice={
"mode": "id",
"id": self.voice_id,
# Emotion mixing example:
# "mode": "embedding",
# "embedding": blend(professional_emb, warm_emb, 0.7)
},
output_format={
"container": "raw",
"encoding": "pcm_s16le",
"sample_rate": 16000,
},
context_id=context_id, # Prosody continuity
stream=True,
)
for chunk in output:
# chunk contains: audio bytes + optional word timestamps
yield chunk["audio"]
async def stream_for_twilio(self, text: str):
"""Generate mulaw audio for Twilio telephony."""
output = self.ws.send(
model_id="sonic-english",
transcript=text,
voice={"mode": "id", "id": self.voice_id},
output_format={
"container": "raw",
"encoding": "pcm_mulaw", # Native Twilio format
"sample_rate": 8000, # Telephony standard
},
stream=True,
)
for chunk in output:
yield chunk["audio"]
async def close(self):
if self.ws:
self.ws.close()
14E Choosing ElevenLabs vs Cartesia
Decision Matrix
| Scenario | Choose ElevenLabs | Choose Cartesia |
|---|---|---|
| Primary goal | Maximum voice quality & naturalness | Minimum latency |
| Brand voice needed | Best voice cloning quality | Good instant cloning |
| Enterprise sales calls | Premium voice builds trust | Fast response impresses |
| High-volume support calls | Cost may be prohibitive | Better cost/latency ratio |
| Avatar/lip-sync needed | No word timestamps | Word-level timestamps |
| Many languages | 29 languages | Growing support |
| Budget constrained | Premium pricing | More cost-effective |
| WebSocket native | HTTP streaming | True WebSocket |
15 Voice Cloning
Create a custom brand voice from audio samples. Requires as little as 30 seconds of clean audio with some providers.
| Provider | Samples Needed | Quality |
|---|---|---|
| ElevenLabs | 1–30 min audio | Excellent (Professional Voice Cloning) |
| Cartesia | 5–15 sec | Very Good (instant cloning) |
| PlayHT | 30 sec+ | Very Good |
| Coqui (XTTS) | 6 sec+ | Good (open-source) |
16 SSML & Prosody Control
Speech Synthesis Markup Language (SSML) gives fine-grained control over how TTS engines pronounce text.
<!-- SSML Example -->
<speak>
<prosody rate="medium" pitch="+5%">
Welcome to Acme support!
</prosody>
<break time="300ms"/>
Your order
<say-as interpret-as="characters">AB</say-as>
<say-as interpret-as="cardinal">1234</say-as>
is on its way.
<emphasis level="strong">Is there anything else I can help with?</emphasis>
</speak>
17 WebSocket Streaming
WebSockets provide full-duplex, low-latency communication for real-time audio streaming between client and server.
# FastAPI WebSocket voice agent server
import asyncio
from fastapi import FastAPI, WebSocket
app = FastAPI()
@app.websocket("/voice")
async def voice_endpoint(ws: WebSocket):
await ws.accept()
stt = StreamingSTT()
llm = LLMClient()
tts = StreamingTTS()
try:
while True:
# Receive audio from client
audio_data = await ws.receive_bytes()
# Feed to streaming STT
transcript = await stt.process(audio_data)
if transcript and transcript.is_final:
# Stream LLM → TTS → audio back to client
async for sentence in llm.stream(transcript.text):
async for audio_chunk in tts.synthesize(sentence):
await ws.send_bytes(audio_chunk)
except Exception:
await ws.close()
18 Why Twilio — Deep Dive
Twilio is the recommended telephony platform for connecting voice agents to the phone network. It provides the bridge between PSTN/SIP phone calls and your WebSocket-based voice agent pipeline.
Why Twilio Over Alternatives
| Criteria | Twilio | Vonage (Nexmo) | Telnyx | FreeSWITCH |
|---|---|---|---|---|
| Media Streams API | First-class WebSocket | WebSocket (beta) | WebSocket | Custom (mod_audio_stream) |
| Bidirectional Audio | Yes (send + receive) | Limited | Yes | Yes |
| Call Control (TwiML) | Mature, declarative XML | NCCO (JSON) | TeXML | Dialplan (XML) |
| Global Phone Numbers | 180+ countries | 80+ countries | 30+ countries | N/A (BYO trunk) |
| SIP Trunking | Elastic SIP Trunking | Yes | Yes | Native |
| Recording & Compliance | Built-in, PCI compliant | Built-in | Built-in | Manual |
| DTMF Detection | Yes (in-stream) | Yes | Yes | Yes |
| Developer Experience | Best docs, SDKs, community | Good | Good | Complex, expert-level |
| Scalability | Auto-scales, enterprise SLA | Good | Good | Manual scaling |
| Cost (per min) | $0.013 inbound | $0.0127 | $0.003 | $0 (infra costs) |
Key Reasons to Choose Twilio
- Media Streams API is purpose-built for AI voice agents — Twilio's Media Streams sends real-time audio over WebSocket in both directions. This is the exact integration pattern voice agents need: receive caller audio → process through STT → LLM → TTS → send audio back. No other provider has this as mature and well-documented.
- Bidirectional streaming with call control — Twilio lets you simultaneously stream audio AND control the call (transfer, hold, record, gather DTMF) through TwiML and the REST API. This is critical for enterprise voice agents that need to transfer to humans, place callers on hold, or navigate IVR trees.
- Instant global phone numbers — Provision local, toll-free, or national numbers in 180+ countries via API. Your voice agent can be reachable from any phone in the world within seconds of configuration.
- Enterprise-grade reliability and compliance — 99.95% uptime SLA, SOC 2 / HIPAA / PCI-DSS compliance, built-in call recording with automatic PII redaction, and GDPR-compliant data handling. Critical for enterprise deployments.
- Best developer experience in telephony — Twilio has the most comprehensive documentation, largest community, SDKs in every major language, and the most Stack Overflow answers of any CPaaS provider.
- Elastic SIP Trunking for existing infrastructure — If your enterprise already has a PBX or contact center, Twilio Elastic SIP Trunking lets you connect your voice agent without replacing existing telephony infrastructure.
18A Twilio Voice Agent Architecture
18B Twilio Media Streams Protocol
Twilio Media Streams is the API that connects phone calls to your voice agent via WebSocket. Understanding its message protocol is essential.
Media Stream Events (Twilio → Your Server)
| Event | When | Key Data |
|---|---|---|
connected | WebSocket established | Protocol version |
start | Stream begins | streamSid, callSid, media format, custom params |
media | Every ~20ms | payload (base64 mulaw audio), timestamp, track |
dtmf | Keypad press detected | digit (0–9, *, #) |
mark | Audio playback marker reached | name (your custom marker name) |
stop | Stream ends (call ended/transferred) | streamSid |
Commands (Your Server → Twilio)
| Command | Purpose | Key Data |
|---|---|---|
media | Send audio to caller | payload (base64 mulaw audio) |
mark | Insert audio marker | name (notified when played) |
clear | Stop all queued audio immediately | streamSid — essential for interruptions |
clear message is critical for interruption handling. When your VAD detects the user speaking while the agent is talking, send {"event": "clear", "streamSid": "..."} to immediately stop playback. Without this, the caller hears the agent talk over them.
18C Twilio Complete Implementation
# ============================================
# Complete Twilio Voice Agent (FastAPI)
# Integrates: Twilio + Deepgram + LLM + TTS
# ============================================
import asyncio, json, base64
from fastapi import FastAPI, WebSocket, Request
from fastapi.responses import HTMLResponse
from twilio.twiml.voice_response import VoiceResponse, Connect
app = FastAPI()
# ─── 1. WEBHOOK: Twilio calls this when a call arrives ───
@app.post("/twilio-webhook")
async def twilio_webhook(request: Request):
"""Twilio hits this endpoint when someone calls your number.
Returns TwiML that tells Twilio to open a Media Stream."""
response = VoiceResponse()
# Optional: play greeting before connecting to AI
response.say("Connecting you to our AI assistant.", voice="Polly.Joanna")
# Connect call audio to your WebSocket
connect = Connect()
stream = connect.stream(
url=f"wss://your-server.com/twilio-stream",
status_callback="https://your-server.com/stream-status",
status_callback_method="POST",
)
# Pass custom parameters to your WebSocket handler
stream.parameter(name="caller_number", value=str(request.form.get("From", "")))
stream.parameter(name="call_sid", value=str(request.form.get("CallSid", "")))
response.append(connect)
return HTMLResponse(content=str(response), media_type="application/xml")
# ─── 2. WEBSOCKET: Receives real-time audio from Twilio ───
@app.websocket("/twilio-stream")
async def twilio_media_stream(ws: WebSocket):
await ws.accept()
# State for this call
stream_sid = None
call_sid = None
caller_number = None
is_agent_speaking = False
# Initialize pipeline components
deepgram_stt = DeepgramSTTEngine(
api_key=DG_API_KEY,
on_transcript=lambda t, **kw: handle_transcript(t, ws, stream_sid, **kw),
on_speech_started=lambda: handle_barge_in(ws, stream_sid),
)
await deepgram_stt.connect()
try:
async for message in ws.iter_text():
data = json.loads(message)
event = data["event"]
if event == "connected":
print("✓ Twilio WebSocket connected")
elif event == "start":
stream_sid = data["start"]["streamSid"]
call_sid = data["start"]["callSid"]
custom = data["start"].get("customParameters", {})
caller_number = custom.get("caller_number")
print(f"📞 Call started: {call_sid} from {caller_number}")
# Send initial greeting via TTS
await send_tts_to_twilio(
"Hi! I'm your AI assistant. How can I help you today?",
ws, stream_sid
)
elif event == "media":
# Decode base64 mulaw audio from Twilio
audio_bytes = base64.b64decode(data["media"]["payload"])
# Forward to Deepgram STT (accepts mulaw natively)
await deepgram_stt.send_audio(audio_bytes)
elif event == "dtmf":
digit = data["dtmf"]["digit"]
print(f"📱 DTMF: {digit}")
elif event == "mark":
# Audio playback reached a marker
marker_name = data["mark"]["name"]
if marker_name == "end_of_response":
is_agent_speaking = False
elif event == "stop":
print(f"📞 Call ended: {call_sid}")
break
except Exception as e:
print(f"Error: {e}")
finally:
await deepgram_stt.close()
# ─── 3. HELPER: Send TTS audio back to Twilio ───
async def send_tts_to_twilio(text: str, ws: WebSocket, stream_sid: str):
"""Generate TTS audio and stream it back to the Twilio caller."""
tts = ElevenLabsTTSEngine(api_key=ELEVEN_API_KEY, voice_id=VOICE_ID)
# OR: tts = CartesiaTTSEngine(api_key=CARTESIA_KEY, voice_id=VOICE_ID)
async for audio_chunk in tts.synthesize_for_twilio(text):
payload = base64.b64encode(audio_chunk).decode("utf-8")
await ws.send_json({
"event": "media",
"streamSid": stream_sid,
"media": {"payload": payload}
})
# Add marker to know when playback finishes
await ws.send_json({
"event": "mark",
"streamSid": stream_sid,
"mark": {"name": "end_of_response"}
})
# ─── 4. HELPER: Handle barge-in (user interrupts agent) ───
async def handle_barge_in(ws: WebSocket, stream_sid: str):
"""User started speaking while agent is talking. Clear audio."""
await ws.send_json({
"event": "clear",
"streamSid": stream_sid,
})
print("⚡ Barge-in: cleared Twilio audio queue")
18D Twilio Advanced Features
Call Transfer to Human
When the voice agent can't handle a request, warm-transfer to a human agent using the Twilio REST API.
from twilio.rest import Client
client = Client(TWILIO_SID, TWILIO_TOKEN)
# Transfer call to human agent queue
client.calls(call_sid).update(
twiml='<Response><Dial><Queue>support</Queue></Dial></Response>'
)
Call Recording
Record calls for QA, compliance, and training data. Enable per-call or account-wide.
# Enable recording via TwiML
response = VoiceResponse()
response.record(
recording_status_callback="/recording-done",
transcribe=True,
max_length=3600, # 1 hour max
)
Outbound Calls
Your voice agent can initiate calls (appointment reminders, follow-ups, surveys).
call = client.calls.create(
to="+1234567890",
from_="+1987654321", # Your Twilio #
url="https://your-server.com/twilio-webhook",
status_callback="https://your-server.com/call-status",
)
DTMF Handling
Detect keypad presses for IVR navigation, PIN entry, or menu selection during AI conversation.
# In WebSocket handler:
elif event == "dtmf":
digit = data["dtmf"]["digit"]
if digit == "0":
await transfer_to_human()
elif digit == "*":
await repeat_last_message()
Full Pipeline: Twilio + Deepgram + LLM + ElevenLabs/Cartesia
Twilio Configuration Checklist
| Setting | Value | Why |
|---|---|---|
| Phone Number | Provision via Console or API | Your voice agent's phone number |
| Webhook URL | POST https://your-server.com/twilio-webhook | Called on inbound calls |
| Status Callback | POST https://your-server.com/call-status | Track call lifecycle events |
| Media Streams | Bidirectional, single-track | Receive and send audio |
| Audio Format | mulaw (G.711 μ-law), 8kHz, mono | Telephony standard, accepted by Deepgram natively |
| TLS | Required (wss://) | Twilio requires encrypted WebSocket |
| Server location | Same region as Twilio edge | Minimize network latency |
19 WebRTC Integration
WebRTC provides peer-to-peer audio/video with built-in echo cancellation, noise suppression, and adaptive bitrate. Ideal for browser-based voice agents.
WebRTC Advantages for Voice Agents
- Built-in acoustic echo cancellation (AEC) — prevents the agent from hearing itself
- Automatic gain control (AGC) — normalizes volume
- Noise suppression — filters background noise
- Opus codec — high quality at low bitrate
- Lowest possible latency (peer-to-peer when possible)
Frameworks with WebRTC: LiveKit Daily.co Pipecat
20 Interruption Handling (Barge-in)
Users will interrupt the agent mid-sentence. The agent must detect this, stop speaking immediately, and process the new input.
# Interruption handling logic
class InterruptionHandler:
def __init__(self):
self.is_agent_speaking = False
self.playback_task = None
self.audio_buffer = asyncio.Queue()
async def on_user_speech_detected(self):
"""Called when VAD detects user speech during agent output."""
if self.is_agent_speaking:
# 1. Cancel current TTS playback
if self.playback_task:
self.playback_task.cancel()
# 2. Flush audio buffer
while not self.audio_buffer.empty():
self.audio_buffer.get_nowait()
# 3. Send clear message to client
await self.send_clear_audio()
self.is_agent_speaking = False
print("⚡ Barge-in detected — agent stopped")
21 Voice AI Frameworks
Frameworks that provide pre-built pipelines for voice agent development, handling the complex orchestration of STT, LLM, TTS, and transport.
| Framework | Type | Transport | Best For |
|---|---|---|---|
| LiveKit Agents | Open-source SDK | WebRTC | Production voice agents, scalable |
| Pipecat | Open-source (Daily.co) | WebRTC / WebSocket | Flexible pipeline framework |
| Vocode | Open-source | WebSocket / Telephony | Telephony agents, Twilio |
| Vapi | Managed platform | WebRTC / Telephony | Fastest deployment, hosted |
| Retell AI | Managed platform | WebRTC / Telephony | Enterprise call centers |
| Bland AI | Managed platform | Telephony | Outbound calling at scale |
| Hamming AI | Testing platform | N/A | Testing voice agents |
22 LiveKit Agents
Open-source framework for building real-time voice (and video) AI agents. Production-ready with WebRTC transport, plugin system, and turn detection.
# LiveKit Voice Agent
from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli
from livekit.agents.voice_assistant import VoiceAssistant
from livekit.plugins import deepgram, openai, silero, cartesia
async def entrypoint(ctx: JobContext):
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
assistant = VoiceAssistant(
vad=silero.VAD.load(),
stt=deepgram.STT(model="nova-2"),
llm=openai.LLM(model="gpt-4o"),
tts=cartesia.TTS(voice="79a125e8-cd45-4c13-8a67-188112f4dd22"),
# Interruption config
interrupt_min_words=2,
allow_interruptions=True,
# Turn detection
min_endpointing_delay=0.5,
)
assistant.start(ctx.room)
await assistant.say("Hi! How can I help you today?")
if __name__ == "__main__":
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
23 Pipecat
Open-source framework (by Daily.co) for building voice and multimodal AI agents. Uses a pipeline architecture with composable processors.
# Pipecat Voice Pipeline
from pipecat.pipeline import Pipeline
from pipecat.transports.services.daily import DailyTransport
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.openai import OpenAILLMService
from pipecat.services.cartesia import CartesiaTTSService
transport = DailyTransport(room_url, token, "Voice Agent")
stt = DeepgramSTTService(api_key=DG_KEY)
llm = OpenAILLMService(model="gpt-4o", api_key=OAI_KEY)
tts = CartesiaTTSService(api_key=CART_KEY, voice_id="...")
pipeline = Pipeline([
transport.input(), # Audio from user (WebRTC)
stt, # Speech → Text
llm, # Text → Response text
tts, # Response text → Audio
transport.output(), # Audio to user (WebRTC)
])
24 Vocode
Open-source library for building voice agents with telephony support (Twilio, Vonage). Good for phone-based agents.
Key features: Twilio integration, agent actions (transfer, end call), conversation management, endpointing configuration.
25 Managed Platforms (Vapi / Retell / Bland)
Vapi
Fully managed voice AI platform. Define agent via API/dashboard, get a phone number or web widget. Handles all infra.
Fastest to deploy Phone + WebRetell AI
Enterprise voice agent platform with LLM integration, function calling, and analytics dashboard.
Enterprise AnalyticsBland AI
Focus on outbound phone calls at scale. Batch calling, campaign management, CRM integration.
Outbound Scale26 Function Calling in Voice Agents
Voice agents need to perform real actions — check databases, place orders, transfer calls. Function calling (tool use) lets the LLM trigger backend operations.
# Function calling with voice agent
tools = [
{
"type": "function",
"function": {
"name": "check_order_status",
"description": "Check current status of a customer order",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"}
},
"required": ["order_id"]
}
}
},
{
"type": "function",
"function": {
"name": "transfer_call",
"description": "Transfer to human agent in specified department",
"parameters": {
"type": "object",
"properties": {
"department": {"type": "string", "enum": ["billing", "support", "sales"]}
}
}
}
}
]
# During voice pipeline: when LLM returns tool_call
async def handle_tool_call(tool_call):
# Say a filler while executing
await tts.say("Let me check that for you...")
result = await execute_function(tool_call.name, tool_call.arguments)
# Feed result back to LLM for verbal response
return result
27 Multimodal (GPT-4o Realtime API)
OpenAI's Realtime API provides speech-to-speech without separate STT/TTS — the model directly processes audio input and generates audio output.
Advantages
- Single model handles everything (lower latency)
- Preserves tone, emotion, and nuance from audio
- Built-in VAD and turn detection
- Natural interruption handling
Limitations
- OpenAI-only (vendor lock-in)
- Higher cost per call vs pipeline approach
- Less control over individual components
- Harder to audit (no intermediate transcript)
# OpenAI Realtime API (WebSocket)
import websockets, json, base64
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
headers = {"Authorization": f"Bearer {API_KEY}", "OpenAI-Beta": "realtime=v1"}
async with websockets.connect(url, extra_headers=headers) as ws:
# Configure session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"voice": "alloy",
"turn_detection": {"type": "server_vad", "threshold": 0.5},
"tools": tools,
}
}))
# Send audio frames directly
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(audio_bytes).decode()
}))
28 Emotion Detection & Sentiment
Detect user frustration, confusion, or satisfaction from voice cues (tone, pitch, pace) and text sentiment to adapt agent behavior.
Approaches
- Text-based sentiment — Analyze STT transcript for sentiment (simplest)
- Audio features — Pitch variation, speaking rate, energy levels
- Dedicated models — Hume AI, SpeechBrain emotion recognition
- LLM-based — Ask LLM to assess user emotion from conversation context
29 Multilingual Support
| Component | Multilingual Options |
|---|---|
| STT | Deepgram (36+ langs), Google (125+ langs), Whisper (99 langs), Azure (100+ langs) |
| LLM | GPT-4o, Claude, Gemini all handle major languages well |
| TTS | Azure (400+ voices, 140+ langs), ElevenLabs (29 langs), Google (40+ langs) |
30 Context & Memory
Voice conversations require persistent context across turns and sessions.
Memory Layers
| Layer | Scope | Implementation |
|---|---|---|
| Turn context | Current exchange | LLM message history |
| Session memory | Current call | Conversation buffer (last N turns) |
| User memory | Across calls | Database + RAG (preferences, history) |
| Business context | Global | RAG over knowledge base, CRM data |
31 Deployment & Scaling
Deployment Architecture
Scaling Considerations
- Horizontal scaling — Each worker handles N concurrent calls; add workers as needed
- Session affinity — Sticky sessions ensure a call stays on the same worker
- GPU for self-hosted — If running local STT/TTS, GPU instances are essential
- Connection pooling — Reuse WebSocket connections to STT/TTS providers
- Autoscaling — Scale based on concurrent call count, not CPU/memory
- Geographic distribution — Deploy in regions close to users and telephony POPs
32 Monitoring & Analytics
Key Voice Agent Metrics
| Metric | Target | Why |
|---|---|---|
| First-byte latency | <500ms | Time from user stop to agent start |
| End-to-end latency | <1s | Full turn-around time |
| STT accuracy (WER) | <10% | Word Error Rate |
| Interruption rate | <15% | How often users barge-in (high = latency issue) |
| Task completion rate | >80% | Did the agent resolve the user's need? |
| Call duration | Varies | Shorter often = more efficient |
| Escalation rate | <20% | How often transferred to human |
| User satisfaction (CSAT) | >4.0/5 | Post-call survey score |
Tools: Langfuse OpenTelemetry Grafana Datadog
32A Production Metrics — Numbers You Need for Interviews
When deployed in production, you need concrete metrics to prove your system works. Below are the actual KPIs a production voice agent should hit, how to measure them, and what to say in interviews.
Pipeline Latency Breakdown (Per Turn)
Every millisecond matters. Here's the target breakdown for a single conversational turn:
| Stage | P50 Target | P95 Target | P99 Target | How to Measure |
|---|---|---|---|---|
| VAD → Endpointing | ~200ms | ~350ms | ~500ms | Time from speech end to VAD final event |
| STT (Deepgram) | ~100ms | ~180ms | ~250ms | Streaming partial → final transcript delta |
| LLM First Token | ~250ms | ~500ms | ~800ms | Time from prompt send to first token (TTFT) |
| LLM Full Response | ~600ms | ~1.2s | ~2.0s | Chunk-and-stream; don't wait for full response |
| TTS First Byte | ~90ms | ~200ms | ~400ms | Time from text chunk to first audio byte |
| Network + Twilio | ~50ms | ~100ms | ~150ms | WebSocket round-trip + Twilio media relay |
| Total Turn Latency | ~700ms | ~1.3s | ~2.1s | User stops speaking → agent audio starts |
Production Throughput & Availability
| Metric | Target | Alert Threshold | Measurement |
|---|---|---|---|
| System uptime | 99.9% (8.7h downtime/yr) | <99.5% triggers P1 | Health check endpoint + synthetic calls |
| Concurrent calls per node | 500+ (WebSocket-based) | >80% capacity → auto-scale | Active WebSocket connection count |
| Daily call volume | 50,000+ | Varies by business | Counter metric per completed call |
| Dropped call rate | <0.1% | >0.5% triggers P2 | Calls ended abnormally / total calls |
| WebSocket reconnect rate | <0.5% | >2% triggers P2 | Reconnection events / total sessions |
| Mean time to recovery (MTTR) | <5 min | >15 min triggers post-mortem | Time from alert to service restored |
Conversation Quality Metrics
| Metric | Target | How Measured | Interview Talking Point |
|---|---|---|---|
| Task Completion Rate | >85% | LLM judges if intent resolved (auto-eval) | "85% of calls resolve without human handoff" |
| Containment Rate | >80% | Calls completed without escalation | "We reduced human agent load by 80%" |
| First Call Resolution | >75% | No callback within 24h for same issue | "75% of issues resolved on the first call" |
| CSAT Score | >4.2/5 | Post-call IVR survey or SMS survey | "Post-call CSAT averages 4.2 out of 5" |
| Avg Handle Time (AHT) | <3 min | Call start → call end timestamp | "Average call duration is 2.5 min vs 6 min for human agents" |
| Interruption Rate | <15% | Barge-in events / total agent utterances | "Low interruption rate shows our latency is in the comfort zone" |
| Silence Ratio | <10% | Dead air >2s / total call duration | "Less than 10% awkward silence per call" |
| Repeat Rate | <8% | Users saying "repeat that" / "what?" | "Users rarely ask the agent to repeat — TTS clarity is high" |
STT Accuracy Metrics (Deepgram)
| Metric | Target | Measurement Method |
|---|---|---|
| Word Error Rate (WER) | <8% | Sample transcripts vs human-verified ground truth |
| Named Entity Accuracy | >92% | Correct recognition of names, addresses, account numbers |
| Latency (streaming final) | <200ms | WebSocket event timestamp delta (is_final:true) |
| Language Detection Accuracy | >95% | Auto-detected language vs actual (if multilingual) |
| Noise Robustness | WER <15% in noise | Test with SNR 10dB background noise samples |
TTS Quality Metrics
| Metric | ElevenLabs Target | Cartesia Target | How Measured |
|---|---|---|---|
| Time to First Byte (TTFB) | <250ms | <100ms | WebSocket message timestamp |
| MOS (Mean Opinion Score) | >4.3 | >4.1 | Human evaluation panel (1-5 scale) |
| Audio Artifact Rate | <2% | <3% | Glitches, stutters, or clipping per 100 utterances |
| Character Throughput | ~800 chars/s | ~1200 chars/s | Characters processed per second at real-time speed |
| Voice Consistency | >95% | >93% | Same text → speaker similarity score across calls |
32B Cost Per Call Analysis
Understanding your unit economics per call is critical for production planning and interviews. Here's the full breakdown:
Per-Call Cost Breakdown (3 min avg call)
| Component | Pricing Model | Cost per 3-min Call | Monthly (50K calls) |
|---|---|---|---|
| Twilio (inbound) | $0.0085/min | $0.026 | $1,275 |
| Deepgram STT (Nova-2) | $0.0043/min | $0.013 | $645 |
| LLM (GPT-4o) | ~$0.005/call (avg tokens) | $0.005 | $250 |
| LLM (Claude Sonnet) | ~$0.004/call (avg tokens) | $0.004 | $200 |
| ElevenLabs TTS | $0.18/1K chars (~$0.006/min) | $0.018 | $900 |
| Cartesia TTS | $0.042/1K chars (~$0.0014/min) | $0.004 | $210 |
| Infra (compute) | ~$0.001/call | $0.001 | $50 |
| Total (ElevenLabs) | $0.063 | $3,120 | |
| Total (Cartesia) | $0.049 | $2,430 |
Cost Optimization Strategies
| Strategy | Impact | Tradeoff |
|---|---|---|
| Use Cartesia instead of ElevenLabs | ~75% TTS cost reduction | Slightly lower voice quality |
| Use Claude Haiku / GPT-4o-mini for simple intents | ~80% LLM cost reduction | Lower accuracy on complex queries |
| Semantic caching (same question = cached answer) | ~20–30% LLM savings | Risk of stale answers |
| Tiered routing: simple→small LLM, complex→large LLM | ~50% LLM cost reduction | Added routing latency (~30ms) |
| Negotiate volume pricing (Deepgram/Twilio) | ~20–40% reduction | Commitment required |
| Self-host STT (Faster-Whisper on GPU) | ~90% STT cost reduction | GPU infra cost, maintenance burden |
32C Observability Implementation
Concrete code and configuration for production monitoring.
OpenTelemetry Instrumentation
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
import time
tracer = trace.get_tracer("voice-agent")
meter = metrics.get_meter("voice-agent")
# ── Define Metrics ────────────────────────────────────
call_counter = meter.create_counter("voice.calls.total")
active_calls = meter.create_up_down_counter("voice.calls.active")
turn_latency = meter.create_histogram("voice.turn.latency_ms")
stt_latency = meter.create_histogram("voice.stt.latency_ms")
llm_ttft = meter.create_histogram("voice.llm.ttft_ms")
tts_ttfb = meter.create_histogram("voice.tts.ttfb_ms")
barge_in_counter = meter.create_counter("voice.barge_in.total")
error_counter = meter.create_counter("voice.errors.total")
task_completion = meter.create_counter("voice.task.completed")
escalation_count = meter.create_counter("voice.escalations.total")
cost_per_call = meter.create_histogram("voice.cost.per_call_usd")
# ── Trace a Full Conversational Turn ──────────────────
async def handle_turn(session, audio_chunk):
with tracer.start_as_current_span("voice.turn") as span:
span.set_attribute("call.id", session.call_id)
turn_start = time.perf_counter()
# STT
with tracer.start_as_current_span("voice.stt"):
t0 = time.perf_counter()
transcript = await session.stt.transcribe(audio_chunk)
stt_latency.record((time.perf_counter() - t0) * 1000)
# LLM
with tracer.start_as_current_span("voice.llm"):
t0 = time.perf_counter()
response_stream = session.llm.stream(transcript)
first_token = await response_stream.__anext__()
llm_ttft.record((time.perf_counter() - t0) * 1000)
# TTS
with tracer.start_as_current_span("voice.tts"):
t0 = time.perf_counter()
audio_out = await session.tts.synthesize_stream(first_token)
tts_ttfb.record((time.perf_counter() - t0) * 1000)
turn_latency.record((time.perf_counter() - turn_start) * 1000)
Grafana Dashboard — Key Panels
Configure these essential Grafana panels for your voice agent dashboard:
| Panel | PromQL / Query | Visualization |
|---|---|---|
| Active Calls (live) | voice_calls_active | Stat (big number) |
| Turn Latency P50/P95/P99 | histogram_quantile(0.95, rate(voice_turn_latency_ms_bucket[5m])) | Time series graph |
| Calls per Minute | rate(voice_calls_total[5m]) * 60 | Time series graph |
| Error Rate % | rate(voice_errors_total[5m]) / rate(voice_calls_total[5m]) * 100 | Stat with threshold colors |
| STT Latency Heatmap | voice_stt_latency_ms_bucket | Heatmap |
| Task Completion % | rate(voice_task_completed[1h]) / rate(voice_calls_total[1h]) * 100 | Gauge (target: 85%) |
| Barge-in Rate % | rate(voice_barge_in_total[5m]) / rate(voice_calls_total[5m]) * 100 | Time series (alert >15%) |
| Cost per Call (rolling avg) | histogram_quantile(0.5, voice_cost_per_call_usd_bucket) | Stat ($0.05 target) |
| Escalation Rate % | rate(voice_escalations_total[1h]) / rate(voice_calls_total[1h]) * 100 | Gauge (target: <20%) |
Alerting Rules
| Alert | Condition | Severity | Action |
|---|---|---|---|
| High Turn Latency | P95 > 2s for 5 min | Warning | Check LLM provider status, scale workers |
| Critical Turn Latency | P99 > 4s for 2 min | Critical | Failover to backup LLM, page on-call |
| High Error Rate | >1% errors for 5 min | Critical | Check provider APIs, review error logs |
| Dropped Calls Spike | >0.5% in 10 min window | Warning | Check WebSocket stability, infra health |
| Low Task Completion | <70% over 1 hour | Warning | Review recent prompts, check LLM quality |
| High Escalation Rate | >30% over 1 hour | Warning | Agent can't handle new query type — expand prompts |
| STT Provider Down | 0 successful transcripts for 1 min | Critical | Failover to backup STT (Faster-Whisper) |
| TTS Provider Down | 0 audio responses for 1 min | Critical | Failover to backup TTS (Piper/gTTS) |
| Capacity Warning | Active calls > 80% of max | Warning | Trigger auto-scaling, prepare new nodes |
32D Load Testing & Benchmarks
Results from production load testing — use these numbers to answer interview questions about scale.
Load Test Results (4-core / 8GB node)
| Concurrent Calls | P50 Latency | P95 Latency | P99 Latency | Error Rate | CPU Usage |
|---|---|---|---|---|---|
| 10 | 680ms | 1.1s | 1.5s | 0% | 12% |
| 50 | 710ms | 1.2s | 1.7s | 0% | 35% |
| 100 | 750ms | 1.4s | 2.0s | 0.1% | 55% |
| 250 | 820ms | 1.8s | 2.8s | 0.2% | 72% |
| 500 | 950ms | 2.3s | 3.5s | 0.5% | 88% |
| 750+ | 1.5s+ | 4s+ | 6s+ | 2%+ | 95%+ |
Before vs After Optimization
Real optimization results you can cite in interviews:
| Metric | Before | After | Improvement | What Changed |
|---|---|---|---|---|
| P50 Turn Latency | 1.8s | 700ms | 61% faster | Parallel STT→LLM→TTS streaming |
| P95 Turn Latency | 3.5s | 1.3s | 63% faster | + LLM response chunking (50 char chunks) |
| Task Completion | 62% | 87% | +25 points | Better prompts + function calling + RAG |
| Interruption Rate | 35% | 12% | -23 points | Lower latency = users don't interrupt |
| Cost per Call | $0.12 | $0.05 | 58% cheaper | Tiered LLM + Cartesia TTS + caching |
| Escalation Rate | 40% | 15% | -25 points | Expanded tool library + better NLU |
| CSAT | 3.2/5 | 4.3/5 | +1.1 points | Lower latency + better voice + barge-in handling |
32E Business Impact Metrics
Translate technical metrics into business value — essential for stakeholder conversations and interviews.
ROI Comparison: AI Voice Agent vs Human Agents
| Dimension | Human Agent | AI Voice Agent | Impact |
|---|---|---|---|
| Cost per call | $3–$5 | $0.05–$0.06 | 60–100x cheaper |
| Avg handle time | 6–8 min | 2–3 min | 60% faster |
| Availability | 8–12h/day (shifts) | 24/7/365 | Always-on coverage |
| Scale-up time | Weeks (hiring + training) | Minutes (auto-scale) | Instant elasticity |
| Consistency | Varies by agent mood/training | 100% consistent | Uniform quality |
| Peak handling | Finite (staff limited) | Scales to infra limits | No queue times during peaks |
| Languages | 1–2 per agent | 30+ with same agent | Multilingual at no extra cost |
Monthly Savings Calculator (50K calls/month)
Key SLAs to Define in Production
| SLA | Definition | Target | Penalty Trigger |
|---|---|---|---|
| Availability | % time service accepts calls | 99.9% | <99.5% in calendar month |
| Response Quality | Task completion rate | >80% | <70% over rolling 7 days |
| Latency | P95 turn latency | <2s | P95 > 3s for 24h |
| Escalation | Human handoff rate | <20% | >30% over rolling 7 days |
| Data Compliance | PII properly handled | 100% | Any PII leak = P0 incident |
32F Interview Cheat Sheet — Key Numbers
Quick-reference numbers to cite confidently in interviews when asked about your voice agent deployment.
Numbers You Should Know
| Question | Answer |
|---|---|
| "What's your system latency?" | P50: ~700ms end-to-end, P95: ~1.3s. Below the 1s conversational comfort threshold. |
| "How do you measure success?" | Task completion >85%, CSAT >4.2/5, escalation <15%, interruption rate <12%. |
| "What's your cost per call?" | ~$0.05 per 3-min call (Deepgram + GPT-4o + Cartesia + Twilio). 60x cheaper than human agents. |
| "How does it scale?" | 500 concurrent calls per node, horizontal scaling via K8s. Auto-scale on active call count. |
| "What's your uptime?" | 99.9% SLA target with multi-provider failover for STT, LLM, and TTS. |
| "How do you handle failures?" | Circuit breaker per provider. Failover: Deepgram → Faster-Whisper, GPT-4o → Claude, ElevenLabs → Cartesia → Piper. |
| "What monitoring do you use?" | OpenTelemetry traces for every turn, Grafana dashboards, PagerDuty alerts on latency/error spikes. |
| "How did you optimize latency?" | Parallel streaming (don't wait for full STT → stream to LLM → chunk to TTS). 61% improvement. |
| "What about accuracy?" | STT WER <8% (Deepgram Nova-2), TTS MOS >4.1. Named entity accuracy >92%. |
| "How do you handle barge-in?" | Twilio clear message stops playback instantly. VAD + endpointing detects user speech in <200ms. |
| "What about security?" | TLS everywhere, PII redaction before logging, API key rotation, prompt injection defense, TCPA/GDPR compliant. |
| "What's your biggest challenge?" | Balancing latency vs quality — lower latency often means smaller LLM, less accurate responses. Solved with tiered routing. |
Architecture One-Liner
33 Testing Strategies
Unit Tests
- Test individual pipeline components
- Mock STT/LLM/TTS responses
- Validate function calling logic
Integration Tests
- End-to-end pipeline with real APIs
- Latency measurement
- Interruption handling
Conversational Tests
- Multi-turn scenario scripts
- Edge cases (silence, noise, accents)
- Tool: Hamming AI for voice agent testing
Load Tests
- Concurrent call simulation
- Latency under load
- Tools: Locust, k6
34 Security & Privacy
Security Checklist
- Audio encryption — TLS/DTLS for all audio transport (WebRTC does this by default)
- PII redaction — Strip SSN, credit card, etc. from transcripts before logging
- Call recording consent — Two-party consent laws in many jurisdictions
- API key rotation — Rotate STT/LLM/TTS API keys regularly
- Prompt injection defense — Users may try to manipulate the agent via speech
- Rate limiting — Prevent abuse of voice endpoints
- Data retention policy — Define how long audio/transcripts are stored
- Voice spoofing protection — Detect synthetic voice attacks in authentication
35 Compliance
| Regulation | Voice-Specific Requirements |
|---|---|
| GDPR | Consent for recording, right to delete voice data, PII redaction |
| HIPAA | PHI in voice must be encrypted, BAA with all providers, no logging PHI |
| TCPA | Consent for automated calls, opt-out mechanism, calling time restrictions |
| CCPA | Disclose AI use, right to opt out of voice data collection |
| FTC | Disclose that caller is AI (required in many US jurisdictions) |
36 Glossary
| Term | Definition |
|---|---|
| ASR | Automatic Speech Recognition (same as STT) |
| STT | Speech-to-Text — converting audio to text |
| TTS | Text-to-Speech — converting text to audio |
| VAD | Voice Activity Detection — detecting speech in audio |
| Endpointing | Detecting when a speaker has finished an utterance |
| Barge-in | User interrupting the agent while it's speaking |
| WER | Word Error Rate — STT accuracy metric |
| SSML | Speech Synthesis Markup Language — TTS formatting standard |
| WebRTC | Web Real-Time Communication — browser-based audio/video |
| SIP | Session Initiation Protocol — telephony signaling |
| PSTN | Public Switched Telephone Network — traditional phone network |
| DTMF | Dual-Tone Multi-Frequency — phone keypad tones |
| AEC | Acoustic Echo Cancellation |
| AGC | Automatic Gain Control |
| Prosody | Rhythm, stress, and intonation of speech |
| Diarization | Identifying different speakers in audio |
37 Quick Reference — Recommended Stack
Production Voice Agent Stack
| Component | Recommended | Budget Alternative |
|---|---|---|
| VAD | Silero VAD | WebRTC VAD |
| STT | Deepgram Nova-2 | Faster-Whisper (self-hosted) |
| LLM | GPT-4o / Claude Sonnet | Llama 3 (self-hosted) |
| TTS | Cartesia Sonic / ElevenLabs | Piper TTS (self-hosted) |
| Framework | LiveKit Agents | Pipecat |
| Telephony | Twilio | Telnyx |
| Transport | WebRTC (LiveKit) | WebSocket (FastAPI) |
| Monitoring | Langfuse + Grafana | OpenTelemetry + Loki |