Voice Agent — System Architecture

AI-powered conversational voice platform with telephony, streaming STT/TTS, multi-provider LLM, and production tooling
FastAPI + Uvicorn Gemini / GPT-4 / Claude Deepgram Nova-2 ElevenLabs / Cartesia Twilio Voice
System Overview
The Voice Agent is a production-grade, real-time conversational AI platform. It handles phone calls via Twilio and browser sessions via WebSocket/WebRTC, processes speech with streaming STT, routes through hybrid NLU, generates responses via multi-provider LLM with function calling, and synthesizes speech back to the caller — all within a single FastAPI application with horizontal scaling capabilities.
HIGH-LEVEL ARCHITECTURE PHONE (PSTN) Twilio Voice BROWSER WebSocket + Mic WebRTC Float32 48kHz Audio REST API Admin / Dashboard NGINX LOAD BALANCER IP Hash · Rate Limit · WS Upgrade FastAPI + Uvicorn (main.py) MIDDLEWARE: Rate Limit · Auth · Request Logging · PII Redaction · Compliance VOICE PIPELINE STT Deepgram Nova-2 NLU Hybrid 4-tier LLM ENGINE Gemini / GPT / Claude TTS ElevenLabs / Cartesia TOOLS 9 Functions MEMORY Session + User RAG / Knowledge Base Config Manager (SQLite) INFRASTRUCTURE Redis Session Cache + Pub/Sub Prometheus Grafana Compliance Auditor (HIPAA/PCI) CI/CD (GitHub Actions 5-Stage) GPT-4o REALTIME Audio-in / Audio-out EXTERNAL APIs ● Deepgram (STT) ● Gemini / OpenAI / Anthropic ● ElevenLabs / Cartesia (TTS) ● Twilio (Telephony) ● OpenAI Realtime (optional) ● FAISS / Sentence-Transformers
← Swipe to explore the diagram →
Voice Pipeline Flow
Each voice turn follows this path: audio in → transcription → intent understanding → response generation → speech synthesis → audio out. The pipeline uses streaming throughout for minimal latency.
STEP 1
Audio In
PCM 16kHz (browser)
mulaw 8kHz (Twilio)
STEP 2
STT
Deepgram Nova-2
Streaming + VAD
STEP 3
NLU
4-tier Hybrid
Regex→Rasa→SetFit→LLM
STEP 4
LLM
Streaming tokens
+ Function Calls
STEP 5
TTS
Streaming audio
SSML support
STEP 6
Audio Out
PCM (browser)
mulaw (Twilio)

Tool Call Feedback Loop

When the LLM invokes a function (e.g. check_order_status), the pipeline executes the tool, feeds results back to the LLM (up to 3 rounds), and disables tools on follow-up rounds to force a natural language response.

🤖
LLM
Tool Call
Execute
🔁
Feed Back
💬
Text Out
Core Engines
Modular engine architecture: each engine handles one domain and can be swapped independently.
🎤

Speech-to-Text Engine

engines/stt.py · 146 lines

Streaming Deepgram Nova-2 WebSocket integration with VAD events, smart endpointing (800ms), and utterance buffering with debounce logic.

  • ClassDeepgramSTTEngine
  • ProtocolWebSocket streaming
  • EventsTranscript, SpeechStarted, UtteranceEnd
  • Formatsmulaw/8kHz (Twilio), linear16/16kHz (browser)
  • FeaturesSmart format, punctuation, filler words, keywords
Deepgram Streaming VAD
🤖

LLM Engine

engines/llm.py · ~390 lines

Multi-provider streaming LLM with function calling. Supports Gemini, OpenAI, and Anthropic with automatic role merging for Gemini's alternating-role constraint.

  • ClassLLMEngine
  • ProvidersGemini, OpenAI, Anthropic
  • StreamingAsync generator (token, tool_call, done)
  • ToolsOpenAI-format function declarations
  • Featuressystem_override, role merging, TTFT logging
Gemini OpenAI Anthropic Function Calling
💡

NLU / Intent Router

engines/nlu.py · ~620 lines

4-tier hybrid intent classification: instant regex patterns, Rasa NLU, SetFit transformer, and LLM fallback. Includes emotion detection and entity extraction.

  • TiersRegex (0ms) → Rasa (<10ms) → SetFit (<5ms) → LLM (~1s)
  • Intents17 predefined (order, appointment, transfer, etc.)
  • Entitiesorder_id, date, time, purpose, email, phone
  • Emotions6 classes (neutral, happy, frustrated, sad, angry, confused)
Regex Rasa SetFit LLM Emotion
🔊

Text-to-Speech Engine

engines/tts.py · 462 lines

Dual-provider TTS with smart routing. ElevenLabs for premium quality, Cartesia Sonic for ultra-low latency. Includes SSML builder and emotion-aware prosody.

  • ProvidersElevenLabs, Cartesia Sonic
  • RouterTTSRouter (auto, quality, speed, cost)
  • SSMLBreaks, emphasis, prosody, say-as, phoneme
  • Formatsmulaw_8000, pcm_16000, mp3
  • FeaturesVoice cloning, multilingual, emotion prosody
ElevenLabs Cartesia SSML Streaming
🗃

Conversation Memory

engines/memory.py · 234 lines

Dual-layer memory: session-level turn tracking with auto-summarization, plus persistent cross-call user profiles with preferences and history.

  • SessionConversationMemory (turn tracking, summarization)
  • PersistentUserMemory (name, tier, preferences, call history)
  • StorageSQLite (user_memory, conversation_turns)
  • FeatureAuto-summarize when exceeding max_turns
Session Cross-Call SQLite
📚

RAG Knowledge Engine

engines/rag.py · 383 lines

Retrieval-Augmented Generation with FAISS vector search (or numpy fallback). Loads documents from a knowledge base directory, chunks text, embeds, and retrieves relevant context.

  • Vector StoreFAISS IndexFlatIP (or numpy cosine fallback)
  • Embeddingssentence-transformers (all-MiniLM-L6-v2)
  • FormatsMD, TXT, HTML, JSON, CSV
  • Chunking300-char voice-optimized paragraphs
  • CachingFAISS index persistence + content hash invalidation
FAISS Embeddings Auto-Index

Realtime Audio LLM

engines/realtime_llm.py · 346 lines

OpenAI GPT-4o Realtime API integration for direct audio-in/audio-out streaming. Bypasses the separate STT + LLM + TTS pipeline for ultra-low latency.

  • ProtocolWebSocket bidirectional audio
  • VADServer-side voice activity detection
  • ToolsFunction calling via Realtime API
  • Barge-inResponse cancellation support
GPT-4o Audio Native WebSocket
🌐

WebRTC Audio Handler

engines/webrtc.py · 197 lines

Browser-to-server audio streaming with format conversion. Resamples 48kHz float32 browser audio to 16kHz int16 for the STT pipeline.

  • InputFloat32, 48kHz, mono (browser)
  • OutputInt16, 16kHz, mono (STT)
  • ResamplerLinear interpolation
  • Buffering100ms frames
WebRTC Resampling Format Conversion
📞 Telephony Layer
Full Twilio integration for inbound/outbound voice with media streams, DTMF, and campaign management.
🕾

Twilio Handler

telephony/twilio_handler.py · 273 lines

Inbound call webhook (TwiML response), WebSocket media stream handler, outbound dialing, and call session lifecycle management.

  • InboundTwiML + Media Streams WebSocket
  • Audiobase64 mulaw encoding/decoding
  • DTMFDigit handling (0 = transfer to human)
  • Barge-inMark-based audio sync + clear
  • Stateactive_sessions dict (call_sid → CallSession)
Twilio WebSocket TwiML
📡

Outbound Campaign Manager

telephony/outbound.py · 396 lines

Batch dialing engine with DNC compliance, answering machine detection, TCPA calling hours enforcement, and real-time campaign analytics.

  • AMDAnswering machine detection (HUMAN/MACHINE/FAX)
  • DNCDo-Not-Call list management + scrubbing
  • TCPA9 AM – 9 PM calling hours enforcement
  • ConcurrencySemaphore-based rate control
  • AnalyticsContact attempt tracking, status filtering
AMD DNC TCPA Campaigns
🛡 Middleware & Security
Production middleware stack for security, observability, compliance, and distributed session management.
🔒

Security Middleware

middleware/security.py · 223 lines

Twilio signature validation (HMAC-SHA1), PII redaction (SSN, credit cards, DOB, etc.), admin authentication, IP-based rate limiting (200 req/60s).

PII Redaction Rate Limiting Auth HMAC
📊

Observability

middleware/observability.py · 286 lines

Prometheus-format metrics export, circuit breaker pattern for provider failover (CLOSED → OPEN → HALF_OPEN), and latency tracking across all pipeline stages.

Prometheus Circuit Breaker Histograms Failover

Compliance Auditor

middleware/compliance.py · 361 lines

Automated HIPAA/PCI-DSS compliance checking with 20 audit controls, PII scanning (10 pattern types with severity levels), risk assessment, and score calculation.

HIPAA PCI-DSS PII Scanner 20 Checks
🗃

Redis Session Cache

middleware/redis_cache.py · 253 lines

Distributed session management for horizontal scaling. Redis implementation with pub/sub for cross-instance coordination, plus automatic in-memory fallback.

Redis Pub/Sub TTL Sessions Fallback
Function Calling Tools
9 production tools that the LLM can invoke during conversation. All are SQLite-backed with sample data seeding.
Tool Name Parameters Returns Description
check_order_status order_id Status, items, tracking, ETA Look up order by ID, return full shipping details
schedule_appointment date, time, name, purpose Confirmation #, details Book a new appointment with conflict checking
cancel_appointment confirmation_number Cancellation status Cancel an existing appointment
reschedule_appointment confirmation_number, new_date, new_time Updated details Change date/time of existing appointment
transfer_to_human department, reason Queue position, wait time Request transfer to human agent
look_up_account identifier (phone/email/ID) Customer profile, history Find customer record by any identifier
get_business_hours department (optional) Hours, current status Check if open/closed, return schedule
collect_feedback rating, comment Thank you + feedback ID Record customer satisfaction rating
end_call reason (optional) Goodbye message End the conversation politely

Source: tools/functions.py · 672 lines · Database tables: orders, appointments, customers, feedback, transfers

🔗 API Endpoints
All endpoints served by the FastAPI application (main.py).
GET
/admin
Admin dashboard UI
GET
/demo
Text-based demo chat page
GET
/voice-demo
Voice demo with mic + TTS playback
GET
/tracker
Implementation task tracker
GET
/api/config
Get all config (sensitive keys masked)
POST
/api/config
Update configuration values
POST
/twilio-webhook
Inbound Twilio call webhook
WS
/twilio-stream
Twilio Media Streams WebSocket
WS
/ws/chat
Text chat with streaming LLM + tools
WS
/ws/voice
Full voice pipeline (STT + LLM + TTS)
WS
/ws/webrtc
WebRTC browser audio stream
POST
/api/outbound-call
Initiate an outbound call
POST
/api/test-llm
Test LLM with text input
POST
/api/test-nlu
Test NLU intent detection
GET
/api/active-calls
List currently active calls
GET
/api/call-logs
Recent call history
GET
/api/stats
Dashboard statistics
GET
/api/compliance-audit
Run HIPAA/PCI compliance audit
GET
/api/compliance-checklist
Full compliance checklist
GET
/metrics
Prometheus metrics endpoint
GET
/health
Health check
GET
/api/campaigns
List outbound campaigns
POST
/api/campaigns
Create new campaign
POST
/api/campaigns/{id}/start
Start dialing a campaign
POST
/api/dnc
Add number to Do-Not-Call list
🔄 Data Flow & Integration Map
How components communicate within the system.
SourceDestinationProtocolData
Browser / PhoneNginxHTTP / WSAudio frames, JSON commands
NginxFastAPI (Uvicorn)Reverse proxyIP-hash sticky sessions
FastAPIDeepgramWebSocketRaw audio → transcript events
FastAPIGemini / OpenAI / ClaudeHTTPS (streaming)Prompt + history → token stream
FastAPIElevenLabs / CartesiaHTTPS / WSText → audio chunks
FastAPITwilioREST + WSTwiML, media stream, outbound dial
FastAPIRedisTCPSession state, pub/sub events
FastAPISQLiteFile I/OConfig, call logs, orders, appointments
PrometheusFastAPI /metricsHTTP scrapeCounter + histogram metrics
GrafanaPrometheusHTTP queryPromQL dashboard queries
📈 Observability & Metrics
Prometheus-format metrics exported at /metrics. Scraped every 15s.
voice_turn_latency_ms
Full turn latency (audio-in to audio-out)
histogram
voice_stt_latency_ms
Speech-to-text transcription latency
histogram
voice_llm_ttft_ms
LLM time-to-first-token
histogram
voice_tts_ttfb_ms
TTS time-to-first-byte
histogram
voice_nlu_latency_ms
NLU intent detection latency by method
histogram (labeled)
voice_calls_total
Total calls (inbound + outbound)
counter
voice_calls_active
Currently active concurrent calls
gauge
voice_barge_in_total
Barge-in (user interruption) count
counter
voice_tool_calls_total
Function calls by tool name
counter (labeled)
voice_cost_per_call_usd
Estimated cost per call
histogram

Circuit Breaker States

🟢
CLOSED
Normal operation
All requests pass through
🔴
OPEN
3 failures detected
Traffic blocked for 30s
🟠
HALF-OPEN
Testing recovery
1 request allowed through
🚀 Deployment Architecture
Docker Compose orchestration with optional profiles for production features.
📦

Docker Services

docker-compose.yml · 91 lines
  • voice-agentFastAPI server (port 8000) · 2 CPU, 2GB RAM
  • redisSession cache (port 6379) · 256MB, AOF, LRU
  • rasaNLU server (port 5005) · profile: with-rasa
  • prometheusMetrics scraping (port 9090) · profile: monitoring
  • grafanaDashboards (port 3000) · profile: monitoring
  • nginxLoad balancer (port 80) · profile: production
🛠

CI/CD Pipeline

.github/workflows/ci.yml · 178 lines
  • Stage 1Lint — Ruff check + format + mypy type check
  • Stage 2Test — pytest + coverage (with Redis service)
  • Stage 3Security — Safety + Bandit vulnerability scan
  • Stage 4Docker — Image build + health check test
  • Stage 5Deploy — Production deploy (main branch only)
🌐

Nginx Load Balancer

deployment/nginx.conf · 141 lines
  • StrategyIP hash (caller affinity / sticky sessions)
  • API Rate30 req/s per IP (burst 20)
  • WS Rate10 req/s per IP
  • WS Timeout3600s (1 hour for long calls)
  • Connections100 concurrent WebSockets per IP
📁 Project File Tree
Complete source structure (37 files, ~6,500+ lines of application code).
voice_agent/ ├── main.py # FastAPI server, all endpoints & WebSockets (960 lines) ├── config.py # Centralized config with SQLite persistence (266 lines) ├── __init__.py │ ├── engines/ # Core AI/audio processing engines │ ├── stt.py # Deepgram Nova-2 streaming STT (146 lines) │ ├── llm.py # Multi-provider LLM: Gemini/GPT/Claude (~390 lines) │ ├── tts.py # ElevenLabs + Cartesia TTS with SSML (462 lines) │ ├── nlu.py # 4-tier hybrid NLU + emotion detection (~620 lines) │ ├── memory.py # Session + cross-call user memory (234 lines) │ ├── rag.py # FAISS vector search + document chunking (383 lines) │ ├── realtime_llm.py # GPT-4o Realtime audio-in/audio-out (346 lines) │ ├── webrtc.py # Browser audio resampling 48k→16k (197 lines) │ ├── __init__.py │ └── rasa_nlu/ # Rasa NLU training data │ ├── config.yml │ ├── domain.yml │ └── nlu.yml │ ├── pipeline/ # Orchestration layer │ ├── voice_pipeline.py # STT→NLU→LLM→TTS orchestrator (428 lines) │ └── __init__.py │ ├── telephony/ # Phone integration │ ├── twilio_handler.py # Inbound/outbound Twilio + Media Streams (273 lines) │ ├── outbound.py # Campaign manager + AMD + DNC (396 lines) │ └── __init__.py │ ├── tools/ # LLM function calling │ ├── functions.py # 9 tools + execute_tool router (672 lines) │ └── __init__.py │ ├── middleware/ # Cross-cutting concerns │ ├── security.py # Auth, PII redaction, rate limiting (223 lines) │ ├── observability.py # Prometheus metrics + circuit breaker (286 lines) │ ├── compliance.py # HIPAA/PCI audit + PII scanner (361 lines) │ ├── redis_cache.py # Distributed session cache (253 lines) │ └── __init__.py │ ├── admin/ # Web UI pages │ ├── dashboard.html # Admin dashboard with config + call logs │ ├── demo.html # Text chat demo page │ ├── voice_demo.html # Voice demo with mic + TTS playback │ ├── tracker.html # Implementation task tracker │ ├── architecture.html # This document │ └── mic-processor.js # AudioWorklet for mic capture │ ├── deployment/ # Infrastructure configs │ └── nginx.conf # Load balancer + WebSocket proxy (141 lines) │ ├── monitoring/ # Observability stack │ ├── prometheus.yml # Scrape config (10 lines) │ └── grafana/provisioning/ │ ├── dashboards/ │ │ └── dashboards.yml │ └── datasources/ │ └── prometheus.yml │ ├── .github/workflows/ # CI/CD │ └── ci.yml # 5-stage pipeline: lint→test→security→docker→deploy │ └── docker-compose.yml # 6 services: app, redis, rasa, prometheus, grafana, nginx
Voice Agent Architecture Document
Generated March 2026 · FastAPI 1.0.0 · Python 3.12