Voice Agent — System Architecture

AI-powered conversational voice platform with telephony, streaming STT/TTS, multi-provider LLM, and production tooling
FastAPI + Uvicorn Gemini / GPT-4 / Claude Deepgram Nova-2 ElevenLabs / Cartesia Twilio Voice
System Overview
The Voice Agent is a production-grade, real-time conversational AI platform. It handles phone calls via Twilio and browser sessions via WebSocket/WebRTC, processes speech with streaming STT, routes through hybrid NLU, generates responses via multi-provider LLM with function calling, and synthesizes speech back to the caller — all within a single FastAPI application with horizontal scaling capabilities.
HIGH-LEVEL ARCHITECTURE PHONE (PSTN) Twilio Voice BROWSER WebSocket + Mic WebRTC Float32 48kHz Audio REST API Admin / Dashboard NGINX LOAD BALANCER IP Hash · Rate Limit · WS Upgrade FastAPI + Uvicorn (main.py) MIDDLEWARE: Rate Limit · Auth · Request Logging · PII Redaction · Compliance VOICE PIPELINE STT Deepgram Nova-2 NLU Hybrid 4-tier LLM ENGINE Gemini / GPT / Claude TTS ElevenLabs / Cartesia TOOLS 9 Functions MEMORY Session + User RAG / Knowledge Base Config Manager (SQLite) INFRASTRUCTURE Redis Session Cache + Pub/Sub Prometheus Grafana Compliance Auditor (HIPAA/PCI) CI/CD (GitHub Actions 5-Stage) GPT-4o REALTIME Audio-in / Audio-out EXTERNAL APIs ● Deepgram (STT) ● Gemini / OpenAI / Anthropic ● ElevenLabs / Cartesia (TTS) ● Twilio (Telephony) ● OpenAI Realtime (optional) ● FAISS / Sentence-Transformers
← Swipe to explore the diagram →
Voice Pipeline Flow
Each voice turn follows this path: audio in → transcription → intent understanding → response generation → speech synthesis → audio out. The pipeline uses streaming throughout for minimal latency.
STEP 1
Audio In
PCM 16kHz (browser)
mulaw 8kHz (Twilio)
STEP 2
STT
Deepgram Nova-2
Streaming + VAD
STEP 3
NLU
4-tier Hybrid
Regex→Rasa→SetFit→LLM
STEP 4
LLM
Streaming tokens
+ Function Calls
STEP 5
TTS
Streaming audio
SSML support
STEP 6
Audio Out
PCM (browser)
mulaw (Twilio)

Tool Call Feedback Loop

When the LLM invokes a function (e.g. check_order_status), the pipeline executes the tool, feeds results back to the LLM (up to 3 rounds), and disables tools on follow-up rounds to force a natural language response.

🤖
LLM
Tool Call
Execute
🔁
Feed Back
💬
Text Out
Core Engines
Modular engine architecture: each engine handles one domain and can be swapped independently.
🎤

Speech-to-Text Engine

engines/stt.py · 146 lines

Streaming Deepgram Nova-2 WebSocket integration with VAD events, smart endpointing (800ms), and utterance buffering with debounce logic.

  • ClassDeepgramSTTEngine
  • ProtocolWebSocket streaming
  • EventsTranscript, SpeechStarted, UtteranceEnd
  • Formatsmulaw/8kHz (Twilio), linear16/16kHz (browser)
  • FeaturesSmart format, punctuation, filler words, keywords
Deepgram Streaming VAD
🤖

LLM Engine

engines/llm.py · ~390 lines

Multi-provider streaming LLM with function calling. Supports Gemini, OpenAI, and Anthropic with automatic role merging for Gemini's alternating-role constraint.

  • ClassLLMEngine
  • ProvidersGemini, OpenAI, Anthropic
  • StreamingAsync generator (token, tool_call, done)
  • ToolsOpenAI-format function declarations
  • Featuressystem_override, role merging, TTFT logging
Gemini OpenAI Anthropic Function Calling
💡

NLU / Intent Router

engines/nlu.py · ~620 lines

4-tier hybrid intent classification: instant regex patterns, Rasa NLU, SetFit transformer, and LLM fallback. Includes emotion detection and entity extraction.

  • TiersRegex (0ms) → Rasa (<10ms) → SetFit (<5ms) → LLM (~1s)
  • Intents17 predefined (order, appointment, transfer, etc.)
  • Entitiesorder_id, date, time, purpose, email, phone
  • Emotions6 classes (neutral, happy, frustrated, sad, angry, confused)
Regex Rasa SetFit LLM Emotion
🔊

Text-to-Speech Engine

engines/tts.py · 462 lines

Dual-provider TTS with smart routing. ElevenLabs for premium quality, Cartesia Sonic for ultra-low latency. Includes SSML builder and emotion-aware prosody.

  • ProvidersElevenLabs, Cartesia Sonic
  • RouterTTSRouter (auto, quality, speed, cost)
  • SSMLBreaks, emphasis, prosody, say-as, phoneme
  • Formatsmulaw_8000, pcm_16000, mp3
  • FeaturesVoice cloning, multilingual, emotion prosody
ElevenLabs Cartesia SSML Streaming
🗃

Conversation Memory

engines/memory.py · 234 lines

Dual-layer memory: session-level turn tracking with auto-summarization, plus persistent cross-call user profiles with preferences and history.

  • SessionConversationMemory (turn tracking, summarization)
  • PersistentUserMemory (name, tier, preferences, call history)
  • StorageSQLite (user_memory, conversation_turns)
  • FeatureAuto-summarize when exceeding max_turns
Session Cross-Call SQLite
📚

RAG Knowledge Engine

engines/rag.py · 383 lines

Retrieval-Augmented Generation with FAISS vector search (or numpy fallback). Loads documents from a knowledge base directory, chunks text, embeds, and retrieves relevant context.

  • Vector StoreFAISS IndexFlatIP (or numpy cosine fallback)
  • Embeddingssentence-transformers (all-MiniLM-L6-v2)
  • FormatsMD, TXT, HTML, JSON, CSV
  • Chunking300-char voice-optimized paragraphs
  • CachingFAISS index persistence + content hash invalidation
FAISS Embeddings Auto-Index

Realtime Audio LLM

engines/realtime_llm.py · 346 lines

OpenAI GPT-4o Realtime API integration for direct audio-in/audio-out streaming. Bypasses the separate STT + LLM + TTS pipeline for ultra-low latency.

  • ProtocolWebSocket bidirectional audio
  • VADServer-side voice activity detection
  • ToolsFunction calling via Realtime API
  • Barge-inResponse cancellation support
GPT-4o Audio Native WebSocket
🌐

WebRTC Audio Handler

engines/webrtc.py · 197 lines

Browser-to-server audio streaming with format conversion. Resamples 48kHz float32 browser audio to 16kHz int16 for the STT pipeline.

  • InputFloat32, 48kHz, mono (browser)
  • OutputInt16, 16kHz, mono (STT)
  • ResamplerLinear interpolation
  • Buffering100ms frames
WebRTC Resampling Format Conversion
💡 NLU Deep Dive
The Natural Language Understanding engine is the brain of the voice pipeline, sitting between STT and LLM to classify user intent, extract structured entities, and detect emotional tone — all in under 15ms for the fast path. It uses a 4-tier cascading architecture that balances speed, accuracy, and cost: deterministic regex for known patterns, Rasa NLU for trained intents, SetFit transformer for few-shot generalization, and LLM fallback for ambiguous or novel utterances.

4-Tier Cascading Classification

engines/nlu.py · IntentClassifier

Each incoming utterance is evaluated top-down. The first tier to return a high-confidence result short-circuits the cascade, minimizing latency. If no tier reaches the confidence threshold, the utterance falls through to LLM-based classification.

TIER 1 — Regex Patterns Deterministic match against 50+ compiled patterns · Latency: ~0ms · Confidence: 1.0 ✓ Instant miss TIER 2 — Rasa NLU DIET classifier with CRF entity extraction · Latency: <10ms · Threshold: 0.75 ⚡ Fast miss TIER 3 — SetFit Transformer Few-shot sentence-transformer fine-tuned on domain data · Latency: <5ms · Threshold: 0.70 🔬 Accurate miss TIER 4 — LLM Fallback Structured prompt with schema → JSON intent + entities + confidence · Latency: ~800–1200ms 🚀 Comprehensive
🔍

Tier 1 — Regex Patterns

_regex_classify()

Pre-compiled regular expressions match common, well-defined utterance patterns with zero latency and absolute confidence. Patterns are loaded at startup from a configurable YAML mapping and compiled with re.IGNORECASE.

  • Patterns50+ compiled regexes across 17 intents
  • Latency~0ms (in-process string matching)
  • ConfidenceAlways 1.0 on match
  • Examples"check my order", "transfer to agent", "cancel", "yes/no"
  • LimitationOnly handles exact phrasings; no semantic understanding
Zero Latency Deterministic YAML Config
🧠

Tier 2 — Rasa NLU

rasa_nlu/ · DIET Classifier

A Rasa NLU server (port 5005) runs the DIET (Dual Intent and Entity Transformer) architecture. It performs joint intent classification and entity extraction using a shared transformer backbone trained on domain-specific examples.

  • ArchitectureDIET classifier + CRF entity extractor
  • FeaturizersWhitespaceTokenizer → CountVectorsFeaturizer → LanguageModelFeaturizer
  • Training Datarasa_nlu/data/nlu.yml (800+ labeled examples)
  • Threshold0.75 confidence minimum to accept
  • Latency<10ms per utterance
  • Entity Typesorder_id, date, time, name, phone, email via CRF
DIET CRF Entities Joint Model
🤖

Tier 3 — SetFit Transformer

SetFitModel · sentence-transformers

A few-shot contrastive learning model built on sentence-transformers. SetFit fine-tunes a pre-trained embedding model using Siamese networks, then trains a classification head — requiring only 8–16 examples per intent to generalize well to unseen phrasings.

  • Base Modelall-MiniLM-L6-v2 (384-dim embeddings)
  • TrainingContrastive pairs + logistic regression head
  • Few-Shot8–16 labeled examples per intent class
  • Threshold0.70 confidence minimum to accept
  • Latency<5ms (single forward pass, CPU-optimized)
  • AdvantageHandles paraphrases and novel phrasings missed by regex/Rasa
Few-Shot Sentence-Transformers Contrastive
🚀

Tier 4 — LLM Fallback

_llm_classify() · Multi-provider

When all fast-path tiers miss, a structured prompt is sent to the active LLM provider (Gemini, GPT-4, or Claude). The prompt includes the full intent schema, entity definitions, and conversation context to produce a JSON-structured classification result.

  • PromptSystem prompt with intent enum + entity schema + few-shot examples
  • OutputJSON: {intent, confidence, entities[], reasoning}
  • ProvidersGemini 2.0 Flash (primary), GPT-4o-mini, Claude 3.5 Haiku
  • Latency~800–1200ms (network round-trip)
  • ContextIncludes last 3 turns for disambiguation
  • RetryProvider failover with 2s timeout per attempt
Gemini GPT-4 Claude JSON Schema
📋

Intent Registry

17 production intents

All recognized intents are defined in a central registry with associated metadata: priority level, required entities, follow-up actions, and escalation rules. The registry drives both classification validation and downstream routing.

  • Ordercheck_order, place_order, cancel_order, return_order
  • Schedulingbook_appointment, reschedule, cancel_appointment
  • Supporttransfer_agent, technical_support, billing_inquiry
  • Generalgreeting, goodbye, affirm, deny, faq, out_of_scope
  • Specialescalate (auto-triggered by frustration detection)
17 Intents Priority Levels Auto-Escalation
🎯

Entity Extraction

Hybrid extraction pipeline

Entities are extracted in parallel by multiple methods and merged with confidence-weighted deduplication. Regex captures structured formats (order IDs, phone numbers), CRF handles contextual spans, and LLM extracts complex or implicit entities.

  • order_idRegex: ORD-\d{6,10} or #\d{6}
  • date / timeDuckling-style parser with relative date support ("next Tuesday", "in 2 hours")
  • emailRFC 5322 regex with domain validation
  • phoneE.164 pattern matching with country code inference
  • purposeFree-text extracted by CRF or LLM (“I need to discuss my refund”)
  • MergeHighest-confidence value wins per entity slot
Regex CRF LLM Dedup
💥

Emotion Detection

EmotionDetector · 6 classes

Runs alongside intent classification to detect caller emotional state. Uses a lightweight DistilBERT model fine-tuned on call center transcripts. Emotion labels feed into TTS prosody adjustment and agent escalation logic.

  • Classesneutral, happy, frustrated, sad, angry, confused
  • ModelDistilBERT-base fine-tuned (6-class softmax)
  • Latency<3ms (runs in parallel with intent classification)
  • Escalationangry > 0.8 or frustrated > 0.85 → auto-escalate intent
  • TTS EffectAdjusts prosody rate, pitch, and empathy phrasing
  • TrackingEmotion trajectory logged per call for analytics
Frustration Sentiment DistilBERT Auto-Escalate
📜

NLU Output Schema

NLUResult dataclass

Every NLU invocation returns a standardized NLUResult object consumed by the dialog manager and LLM prompt builder. The schema ensures consistent downstream handling regardless of which classification tier produced the result.

  • intentstr — one of 17 registered intent labels
  • confidencefloat (0.0–1.0) — classification confidence
  • tierint (1–4) — which tier produced the result
  • entitiesdict — {entity_name: {value, confidence, source}}
  • emotionstr — detected emotion label
  • emotion_scoresdict — all 6 class probabilities
  • latency_msfloat — total NLU processing time
  • raw_textstr — original STT transcript input
Dataclass Typed Serializable
📞 Telephony Layer
Full Twilio integration for inbound/outbound voice with media streams, DTMF, and campaign management.
🕾

Twilio Handler

telephony/twilio_handler.py · 273 lines

Inbound call webhook (TwiML response), WebSocket media stream handler, outbound dialing, and call session lifecycle management.

  • InboundTwiML + Media Streams WebSocket
  • Audiobase64 mulaw encoding/decoding
  • DTMFDigit handling (0 = transfer to human)
  • Barge-inMark-based audio sync + clear
  • Stateactive_sessions dict (call_sid → CallSession)
Twilio WebSocket TwiML
📡

Outbound Campaign Manager

telephony/outbound.py · 396 lines

Batch dialing engine with DNC compliance, answering machine detection, TCPA calling hours enforcement, and real-time campaign analytics.

  • AMDAnswering machine detection (HUMAN/MACHINE/FAX)
  • DNCDo-Not-Call list management + scrubbing
  • TCPA9 AM – 9 PM calling hours enforcement
  • ConcurrencySemaphore-based rate control
  • AnalyticsContact attempt tracking, status filtering
AMD DNC TCPA Campaigns
🛡 Middleware & Security
Production middleware stack for security, observability, compliance, and distributed session management.
🔒

Security Middleware

middleware/security.py · 223 lines

Twilio signature validation (HMAC-SHA1), PII redaction (SSN, credit cards, DOB, etc.), admin authentication, IP-based rate limiting (200 req/60s).

PII Redaction Rate Limiting Auth HMAC
📊

Observability

middleware/observability.py · 286 lines

Prometheus-format metrics export, circuit breaker pattern for provider failover (CLOSED → OPEN → HALF_OPEN), and latency tracking across all pipeline stages.

Prometheus Circuit Breaker Histograms Failover

Compliance Auditor

middleware/compliance.py · 361 lines

Automated HIPAA/PCI-DSS compliance checking with 20 audit controls, PII scanning (10 pattern types with severity levels), risk assessment, and score calculation.

HIPAA PCI-DSS PII Scanner 20 Checks
🗃

Redis Session Cache

middleware/redis_cache.py · 253 lines

Distributed session management for horizontal scaling. Redis implementation with pub/sub for cross-instance coordination, plus automatic in-memory fallback.

Redis Pub/Sub TTL Sessions Fallback
Function Calling Tools
9 production tools that the LLM can invoke during conversation. All are SQLite-backed with sample data seeding.
Tool Name Parameters Returns Description
check_order_status order_id Status, items, tracking, ETA Look up order by ID, return full shipping details
schedule_appointment date, time, name, purpose Confirmation #, details Book a new appointment with conflict checking
cancel_appointment confirmation_number Cancellation status Cancel an existing appointment
reschedule_appointment confirmation_number, new_date, new_time Updated details Change date/time of existing appointment
transfer_to_human department, reason Queue position, wait time Request transfer to human agent
look_up_account identifier (phone/email/ID) Customer profile, history Find customer record by any identifier
get_business_hours department (optional) Hours, current status Check if open/closed, return schedule
collect_feedback rating, comment Thank you + feedback ID Record customer satisfaction rating
end_call reason (optional) Goodbye message End the conversation politely

Source: tools/functions.py · 672 lines · Database tables: orders, appointments, customers, feedback, transfers

🔗 API Endpoints
All endpoints served by the FastAPI application (main.py).
GET
/admin
Admin dashboard UI
GET
/demo
Text-based demo chat page
GET
/voice-demo
Voice demo with mic + TTS playback
GET
/tracker
Implementation task tracker
GET
/api/config
Get all config (sensitive keys masked)
POST
/api/config
Update configuration values
POST
/twilio-webhook
Inbound Twilio call webhook
WS
/twilio-stream
Twilio Media Streams WebSocket
WS
/ws/chat
Text chat with streaming LLM + tools
WS
/ws/voice
Full voice pipeline (STT + LLM + TTS)
WS
/ws/webrtc
WebRTC browser audio stream
POST
/api/outbound-call
Initiate an outbound call
POST
/api/test-llm
Test LLM with text input
POST
/api/test-nlu
Test NLU intent detection
GET
/api/active-calls
List currently active calls
GET
/api/call-logs
Recent call history
GET
/api/stats
Dashboard statistics
GET
/api/compliance-audit
Run HIPAA/PCI compliance audit
GET
/api/compliance-checklist
Full compliance checklist
GET
/metrics
Prometheus metrics endpoint
GET
/health
Health check
GET
/api/campaigns
List outbound campaigns
POST
/api/campaigns
Create new campaign
POST
/api/campaigns/{id}/start
Start dialing a campaign
POST
/api/dnc
Add number to Do-Not-Call list
🔄 Data Flow & Integration Map
How components communicate within the system.
SourceDestinationProtocolData
Browser / PhoneNginxHTTP / WSAudio frames, JSON commands
NginxFastAPI (Uvicorn)Reverse proxyIP-hash sticky sessions
FastAPIDeepgramWebSocketRaw audio → transcript events
FastAPIGemini / OpenAI / ClaudeHTTPS (streaming)Prompt + history → token stream
FastAPIElevenLabs / CartesiaHTTPS / WSText → audio chunks
FastAPITwilioREST + WSTwiML, media stream, outbound dial
FastAPIRedisTCPSession state, pub/sub events
FastAPISQLiteFile I/OConfig, call logs, orders, appointments
PrometheusFastAPI /metricsHTTP scrapeCounter + histogram metrics
GrafanaPrometheusHTTP queryPromQL dashboard queries
📈 Observability & Metrics
Prometheus-format metrics exported at /metrics. Scraped every 15s.
voice_turn_latency_ms
Full turn latency (audio-in to audio-out)
histogram
voice_stt_latency_ms
Speech-to-text transcription latency
histogram
voice_llm_ttft_ms
LLM time-to-first-token
histogram
voice_tts_ttfb_ms
TTS time-to-first-byte
histogram
voice_nlu_latency_ms
NLU intent detection latency by method
histogram (labeled)
voice_calls_total
Total calls (inbound + outbound)
counter
voice_calls_active
Currently active concurrent calls
gauge
voice_barge_in_total
Barge-in (user interruption) count
counter
voice_tool_calls_total
Function calls by tool name
counter (labeled)
voice_cost_per_call_usd
Estimated cost per call
histogram

Circuit Breaker States

🟢
CLOSED
Normal operation
All requests pass through
🔴
OPEN
3 failures detected
Traffic blocked for 30s
🟠
HALF-OPEN
Testing recovery
1 request allowed through

Prometheus & Grafana Monitoring Stack

🔥

Prometheus

deployment/prometheus.yml · Port 9090

Time-series metrics database that scrapes the FastAPI /metrics endpoint. Runs as a Docker service under the monitoring compose profile. Stores 15 days of metrics with 15s scrape interval and supports PromQL queries for alerting.

  • Endpoint/metrics on voice-agent:8000 (Prometheus exposition format)
  • ScrapeEvery 15s with 5s timeout
  • Retention15d (configurable via --storage.tsdb.retention.time)
  • Port9090 (UI + API at http://localhost:9090)
  • Configdeployment/prometheus.yml — scrape targets, job labels
  • ProfileDocker Compose profile: monitoring
  • Client Libprometheus_client (Python) — integrated in observability.py
  • Startdocker compose --profile monitoring up
Prometheus PromQL TSDB Alerting
📊

Grafana

deployment/grafana/ · Port 3000

Dashboard visualization layer that connects to Prometheus as a data source. Ships with auto-provisioned datasource config and pre-built dashboards for voice pipeline monitoring. Access at http://localhost:3000.

  • Port3000 (UI at http://localhost:3000)
  • AuthDefault admin/admin (change on first login)
  • DatasourceAuto-provisioned via grafana/provisioning/datasources/prometheus.yml
  • DashboardsAuto-provisioned from grafana/provisioning/dashboards/
  • ProfileDocker Compose profile: monitoring
  • Startdocker compose --profile monitoring up
Grafana Dashboards Auto-Provision

Instrumentation Layer

middleware/observability.py · 286 lines

The Python application exposes metrics via the prometheus_client library. Counters, histograms, and gauges are registered at import time and updated throughout the pipeline. The /metrics route is mounted as a Starlette sub-app.

  • Libraryprometheus_client — CollectorRegistry, generate_latest()
  • Histogramsturn_latency, stt_latency, llm_ttft, tts_ttfb, nlu_latency, cost_per_call
  • Counterscalls_total, barge_in_total, tool_calls_total, errors_total
  • Gaugescalls_active, circuit_breaker_state
  • Labelsprovider, intent, tool_name, error_type, tier
  • RouteGET /metrics → Content-Type: text/plain; version=0.0.4
prometheus_client Histograms Counters Gauges
🔎

Key PromQL Queries

Grafana panels & alerting rules

Reference PromQL expressions used in Grafana dashboards and Prometheus alerting rules. These can be tested directly in the Prometheus UI at http://localhost:9090/graph.

  • P95 Turn Latencyhistogram_quantile(0.95, rate(voice_turn_latency_ms_bucket[5m]))
  • Call Raterate(voice_calls_total[5m])
  • Active Callsvoice_calls_active
  • Error Rate %rate(voice_errors_total[5m]) / rate(voice_calls_total[5m]) * 100
  • NLU by Tierrate(voice_nlu_latency_ms_count[5m]) by (tier)
  • LLM TTFT P50histogram_quantile(0.5, rate(voice_llm_ttft_ms_bucket[5m]))
PromQL Alerts Panels

Monitoring Data Flow

Voice Agent :8000/metrics scrape 15s Prometheus :9090 · TSDB query Grafana :3000 · Dashboards
🚀 Deployment Architecture
Docker Compose orchestration with optional profiles for production features.
📦

Docker Services

docker-compose.yml · 91 lines
  • voice-agentFastAPI server (port 8000) · 2 CPU, 2GB RAM
  • redisSession cache (port 6379) · 256MB, AOF, LRU
  • rasaNLU server (port 5005) · profile: with-rasa
  • prometheusMetrics scraping (port 9090) · profile: monitoring
  • grafanaDashboards (port 3000) · profile: monitoring
  • nginxLoad balancer (port 80) · profile: production
🛠

CI/CD Pipeline

.github/workflows/ci.yml · 178 lines
  • Stage 1Lint — Ruff check + format + mypy type check
  • Stage 2Test — pytest + coverage (with Redis service)
  • Stage 3Security — Safety + Bandit vulnerability scan
  • Stage 4Docker — Image build + health check test
  • Stage 5Deploy — Production deploy (main branch only)
🌐

Nginx Load Balancer

deployment/nginx.conf · 141 lines
  • StrategyIP hash (caller affinity / sticky sessions)
  • API Rate30 req/s per IP (burst 20)
  • WS Rate10 req/s per IP
  • WS Timeout3600s (1 hour for long calls)
  • Connections100 concurrent WebSockets per IP
📁 Project File Tree
Complete source structure (37 files, ~6,500+ lines of application code).
voice_agent/ ├── main.py # FastAPI server, all endpoints & WebSockets (960 lines) ├── config.py # Centralized config with SQLite persistence (266 lines) ├── __init__.py │ ├── engines/ # Core AI/audio processing engines │ ├── stt.py # Deepgram Nova-2 streaming STT (146 lines) │ ├── llm.py # Multi-provider LLM: Gemini/GPT/Claude (~390 lines) │ ├── tts.py # ElevenLabs + Cartesia TTS with SSML (462 lines) │ ├── nlu.py # 4-tier hybrid NLU + emotion detection (~620 lines) │ ├── memory.py # Session + cross-call user memory (234 lines) │ ├── rag.py # FAISS vector search + document chunking (383 lines) │ ├── realtime_llm.py # GPT-4o Realtime audio-in/audio-out (346 lines) │ ├── webrtc.py # Browser audio resampling 48k→16k (197 lines) │ ├── __init__.py │ └── rasa_nlu/ # Rasa NLU training data │ ├── config.yml │ ├── domain.yml │ └── nlu.yml │ ├── pipeline/ # Orchestration layer │ ├── voice_pipeline.py # STT→NLU→LLM→TTS orchestrator (428 lines) │ └── __init__.py │ ├── telephony/ # Phone integration │ ├── twilio_handler.py # Inbound/outbound Twilio + Media Streams (273 lines) │ ├── outbound.py # Campaign manager + AMD + DNC (396 lines) │ └── __init__.py │ ├── tools/ # LLM function calling │ ├── functions.py # 9 tools + execute_tool router (672 lines) │ └── __init__.py │ ├── middleware/ # Cross-cutting concerns │ ├── security.py # Auth, PII redaction, rate limiting (223 lines) │ ├── observability.py # Prometheus metrics + circuit breaker (286 lines) │ ├── compliance.py # HIPAA/PCI audit + PII scanner (361 lines) │ ├── redis_cache.py # Distributed session cache (253 lines) │ └── __init__.py │ ├── admin/ # Web UI pages │ ├── dashboard.html # Admin dashboard with config + call logs │ ├── demo.html # Text chat demo page │ ├── voice_demo.html # Voice demo with mic + TTS playback │ ├── tracker.html # Implementation task tracker │ ├── architecture.html # This document │ └── mic-processor.js # AudioWorklet for mic capture │ ├── deployment/ # Infrastructure configs │ └── nginx.conf # Load balancer + WebSocket proxy (141 lines) │ ├── monitoring/ # Observability stack │ ├── prometheus.yml # Scrape config (10 lines) │ └── grafana/provisioning/ │ ├── dashboards/ │ │ └── dashboards.yml │ └── datasources/ │ └── prometheus.yml │ ├── .github/workflows/ # CI/CD │ └── ci.yml # 5-stage pipeline: lint→test→security→docker→deploy │ └── docker-compose.yml # 6 services: app, redis, rasa, prometheus, grafana, nginx
Voice Agent Architecture Document
Generated March 2026 · FastAPI 1.0.0 · Python 3.12