Voice Agent — System Architecture
mulaw 8kHz (Twilio)
Streaming + VAD
Regex→Rasa→SetFit→LLM
+ Function Calls
SSML support
mulaw (Twilio)
Tool Call Feedback Loop
When the LLM invokes a function (e.g. check_order_status), the pipeline executes the tool, feeds results back to the LLM (up to 3 rounds), and disables tools on follow-up rounds to force a natural language response.
Speech-to-Text Engine
Streaming Deepgram Nova-2 WebSocket integration with VAD events, smart endpointing (800ms), and utterance buffering with debounce logic.
- ClassDeepgramSTTEngine
- ProtocolWebSocket streaming
- EventsTranscript, SpeechStarted, UtteranceEnd
- Formatsmulaw/8kHz (Twilio), linear16/16kHz (browser)
- FeaturesSmart format, punctuation, filler words, keywords
LLM Engine
Multi-provider streaming LLM with function calling. Supports Gemini, OpenAI, and Anthropic with automatic role merging for Gemini's alternating-role constraint.
- ClassLLMEngine
- ProvidersGemini, OpenAI, Anthropic
- StreamingAsync generator (token, tool_call, done)
- ToolsOpenAI-format function declarations
- Featuressystem_override, role merging, TTFT logging
NLU / Intent Router
4-tier hybrid intent classification: instant regex patterns, Rasa NLU, SetFit transformer, and LLM fallback. Includes emotion detection and entity extraction.
- TiersRegex (0ms) → Rasa (<10ms) → SetFit (<5ms) → LLM (~1s)
- Intents17 predefined (order, appointment, transfer, etc.)
- Entitiesorder_id, date, time, purpose, email, phone
- Emotions6 classes (neutral, happy, frustrated, sad, angry, confused)
Text-to-Speech Engine
Dual-provider TTS with smart routing. ElevenLabs for premium quality, Cartesia Sonic for ultra-low latency. Includes SSML builder and emotion-aware prosody.
- ProvidersElevenLabs, Cartesia Sonic
- RouterTTSRouter (auto, quality, speed, cost)
- SSMLBreaks, emphasis, prosody, say-as, phoneme
- Formatsmulaw_8000, pcm_16000, mp3
- FeaturesVoice cloning, multilingual, emotion prosody
Conversation Memory
Dual-layer memory: session-level turn tracking with auto-summarization, plus persistent cross-call user profiles with preferences and history.
- SessionConversationMemory (turn tracking, summarization)
- PersistentUserMemory (name, tier, preferences, call history)
- StorageSQLite (user_memory, conversation_turns)
- FeatureAuto-summarize when exceeding max_turns
RAG Knowledge Engine
Retrieval-Augmented Generation with FAISS vector search (or numpy fallback). Loads documents from a knowledge base directory, chunks text, embeds, and retrieves relevant context.
- Vector StoreFAISS IndexFlatIP (or numpy cosine fallback)
- Embeddingssentence-transformers (all-MiniLM-L6-v2)
- FormatsMD, TXT, HTML, JSON, CSV
- Chunking300-char voice-optimized paragraphs
- CachingFAISS index persistence + content hash invalidation
Realtime Audio LLM
OpenAI GPT-4o Realtime API integration for direct audio-in/audio-out streaming. Bypasses the separate STT + LLM + TTS pipeline for ultra-low latency.
- ProtocolWebSocket bidirectional audio
- VADServer-side voice activity detection
- ToolsFunction calling via Realtime API
- Barge-inResponse cancellation support
WebRTC Audio Handler
Browser-to-server audio streaming with format conversion. Resamples 48kHz float32 browser audio to 16kHz int16 for the STT pipeline.
- InputFloat32, 48kHz, mono (browser)
- OutputInt16, 16kHz, mono (STT)
- ResamplerLinear interpolation
- Buffering100ms frames
4-Tier Cascading Classification
Each incoming utterance is evaluated top-down. The first tier to return a high-confidence result short-circuits the cascade, minimizing latency. If no tier reaches the confidence threshold, the utterance falls through to LLM-based classification.
Tier 1 — Regex Patterns
Pre-compiled regular expressions match common, well-defined utterance patterns with zero latency and absolute confidence. Patterns are loaded at startup from a configurable YAML mapping and compiled with re.IGNORECASE.
- Patterns50+ compiled regexes across 17 intents
- Latency~0ms (in-process string matching)
- ConfidenceAlways 1.0 on match
- Examples"check my order", "transfer to agent", "cancel", "yes/no"
- LimitationOnly handles exact phrasings; no semantic understanding
Tier 2 — Rasa NLU
A Rasa NLU server (port 5005) runs the DIET (Dual Intent and Entity Transformer) architecture. It performs joint intent classification and entity extraction using a shared transformer backbone trained on domain-specific examples.
- ArchitectureDIET classifier + CRF entity extractor
- FeaturizersWhitespaceTokenizer → CountVectorsFeaturizer → LanguageModelFeaturizer
- Training Datarasa_nlu/data/nlu.yml (800+ labeled examples)
- Threshold0.75 confidence minimum to accept
- Latency<10ms per utterance
- Entity Typesorder_id, date, time, name, phone, email via CRF
Tier 3 — SetFit Transformer
A few-shot contrastive learning model built on sentence-transformers. SetFit fine-tunes a pre-trained embedding model using Siamese networks, then trains a classification head — requiring only 8–16 examples per intent to generalize well to unseen phrasings.
- Base Modelall-MiniLM-L6-v2 (384-dim embeddings)
- TrainingContrastive pairs + logistic regression head
- Few-Shot8–16 labeled examples per intent class
- Threshold0.70 confidence minimum to accept
- Latency<5ms (single forward pass, CPU-optimized)
- AdvantageHandles paraphrases and novel phrasings missed by regex/Rasa
Tier 4 — LLM Fallback
When all fast-path tiers miss, a structured prompt is sent to the active LLM provider (Gemini, GPT-4, or Claude). The prompt includes the full intent schema, entity definitions, and conversation context to produce a JSON-structured classification result.
- PromptSystem prompt with intent enum + entity schema + few-shot examples
- OutputJSON: {intent, confidence, entities[], reasoning}
- ProvidersGemini 2.0 Flash (primary), GPT-4o-mini, Claude 3.5 Haiku
- Latency~800–1200ms (network round-trip)
- ContextIncludes last 3 turns for disambiguation
- RetryProvider failover with 2s timeout per attempt
Intent Registry
All recognized intents are defined in a central registry with associated metadata: priority level, required entities, follow-up actions, and escalation rules. The registry drives both classification validation and downstream routing.
- Ordercheck_order, place_order, cancel_order, return_order
- Schedulingbook_appointment, reschedule, cancel_appointment
- Supporttransfer_agent, technical_support, billing_inquiry
- Generalgreeting, goodbye, affirm, deny, faq, out_of_scope
- Specialescalate (auto-triggered by frustration detection)
Entity Extraction
Entities are extracted in parallel by multiple methods and merged with confidence-weighted deduplication. Regex captures structured formats (order IDs, phone numbers), CRF handles contextual spans, and LLM extracts complex or implicit entities.
- order_idRegex: ORD-\d{6,10} or #\d{6}
- date / timeDuckling-style parser with relative date support ("next Tuesday", "in 2 hours")
- emailRFC 5322 regex with domain validation
- phoneE.164 pattern matching with country code inference
- purposeFree-text extracted by CRF or LLM (“I need to discuss my refund”)
- MergeHighest-confidence value wins per entity slot
Emotion Detection
Runs alongside intent classification to detect caller emotional state. Uses a lightweight DistilBERT model fine-tuned on call center transcripts. Emotion labels feed into TTS prosody adjustment and agent escalation logic.
- Classesneutral, happy, frustrated, sad, angry, confused
- ModelDistilBERT-base fine-tuned (6-class softmax)
- Latency<3ms (runs in parallel with intent classification)
- Escalationangry > 0.8 or frustrated > 0.85 → auto-escalate intent
- TTS EffectAdjusts prosody rate, pitch, and empathy phrasing
- TrackingEmotion trajectory logged per call for analytics
NLU Output Schema
Every NLU invocation returns a standardized NLUResult object consumed by the dialog manager and LLM prompt builder. The schema ensures consistent downstream handling regardless of which classification tier produced the result.
- intentstr — one of 17 registered intent labels
- confidencefloat (0.0–1.0) — classification confidence
- tierint (1–4) — which tier produced the result
- entitiesdict — {entity_name: {value, confidence, source}}
- emotionstr — detected emotion label
- emotion_scoresdict — all 6 class probabilities
- latency_msfloat — total NLU processing time
- raw_textstr — original STT transcript input
Twilio Handler
Inbound call webhook (TwiML response), WebSocket media stream handler, outbound dialing, and call session lifecycle management.
- InboundTwiML + Media Streams WebSocket
- Audiobase64 mulaw encoding/decoding
- DTMFDigit handling (0 = transfer to human)
- Barge-inMark-based audio sync + clear
- Stateactive_sessions dict (call_sid → CallSession)
Outbound Campaign Manager
Batch dialing engine with DNC compliance, answering machine detection, TCPA calling hours enforcement, and real-time campaign analytics.
- AMDAnswering machine detection (HUMAN/MACHINE/FAX)
- DNCDo-Not-Call list management + scrubbing
- TCPA9 AM – 9 PM calling hours enforcement
- ConcurrencySemaphore-based rate control
- AnalyticsContact attempt tracking, status filtering
Security Middleware
Twilio signature validation (HMAC-SHA1), PII redaction (SSN, credit cards, DOB, etc.), admin authentication, IP-based rate limiting (200 req/60s).
Observability
Prometheus-format metrics export, circuit breaker pattern for provider failover (CLOSED → OPEN → HALF_OPEN), and latency tracking across all pipeline stages.
Compliance Auditor
Automated HIPAA/PCI-DSS compliance checking with 20 audit controls, PII scanning (10 pattern types with severity levels), risk assessment, and score calculation.
Redis Session Cache
Distributed session management for horizontal scaling. Redis implementation with pub/sub for cross-instance coordination, plus automatic in-memory fallback.
| Tool Name | Parameters | Returns | Description |
|---|---|---|---|
check_order_status |
order_id |
Status, items, tracking, ETA | Look up order by ID, return full shipping details |
schedule_appointment |
date, time, name, purpose |
Confirmation #, details | Book a new appointment with conflict checking |
cancel_appointment |
confirmation_number |
Cancellation status | Cancel an existing appointment |
reschedule_appointment |
confirmation_number, new_date, new_time |
Updated details | Change date/time of existing appointment |
transfer_to_human |
department, reason |
Queue position, wait time | Request transfer to human agent |
look_up_account |
identifier (phone/email/ID) |
Customer profile, history | Find customer record by any identifier |
get_business_hours |
department (optional) |
Hours, current status | Check if open/closed, return schedule |
collect_feedback |
rating, comment |
Thank you + feedback ID | Record customer satisfaction rating |
end_call |
reason (optional) |
Goodbye message | End the conversation politely |
Source: tools/functions.py · 672 lines · Database tables: orders, appointments, customers, feedback, transfers
| Source | Destination | Protocol | Data |
|---|---|---|---|
| Browser / Phone | Nginx | HTTP / WS | Audio frames, JSON commands |
| Nginx | FastAPI (Uvicorn) | Reverse proxy | IP-hash sticky sessions |
| FastAPI | Deepgram | WebSocket | Raw audio → transcript events |
| FastAPI | Gemini / OpenAI / Claude | HTTPS (streaming) | Prompt + history → token stream |
| FastAPI | ElevenLabs / Cartesia | HTTPS / WS | Text → audio chunks |
| FastAPI | Twilio | REST + WS | TwiML, media stream, outbound dial |
| FastAPI | Redis | TCP | Session state, pub/sub events |
| FastAPI | SQLite | File I/O | Config, call logs, orders, appointments |
| Prometheus | FastAPI /metrics | HTTP scrape | Counter + histogram metrics |
| Grafana | Prometheus | HTTP query | PromQL dashboard queries |
/metrics. Scraped every 15s.Circuit Breaker States
All requests pass through
Traffic blocked for 30s
1 request allowed through
Prometheus & Grafana Monitoring Stack
Prometheus
Time-series metrics database that scrapes the FastAPI /metrics endpoint. Runs as a Docker service under the monitoring compose profile. Stores 15 days of metrics with 15s scrape interval and supports PromQL queries for alerting.
- Endpoint
/metricson voice-agent:8000 (Prometheus exposition format) - ScrapeEvery 15s with 5s timeout
- Retention15d (configurable via
--storage.tsdb.retention.time) - Port9090 (UI + API at
http://localhost:9090) - Configdeployment/prometheus.yml — scrape targets, job labels
- ProfileDocker Compose profile:
monitoring - Client Lib
prometheus_client(Python) — integrated in observability.py - Start
docker compose --profile monitoring up
Grafana
Dashboard visualization layer that connects to Prometheus as a data source. Ships with auto-provisioned datasource config and pre-built dashboards for voice pipeline monitoring. Access at http://localhost:3000.
- Port3000 (UI at
http://localhost:3000) - AuthDefault admin/admin (change on first login)
- DatasourceAuto-provisioned via
grafana/provisioning/datasources/prometheus.yml - DashboardsAuto-provisioned from
grafana/provisioning/dashboards/ - ProfileDocker Compose profile:
monitoring - Start
docker compose --profile monitoring up
Instrumentation Layer
The Python application exposes metrics via the prometheus_client library. Counters, histograms, and gauges are registered at import time and updated throughout the pipeline. The /metrics route is mounted as a Starlette sub-app.
- Library
prometheus_client— CollectorRegistry, generate_latest() - Histogramsturn_latency, stt_latency, llm_ttft, tts_ttfb, nlu_latency, cost_per_call
- Counterscalls_total, barge_in_total, tool_calls_total, errors_total
- Gaugescalls_active, circuit_breaker_state
- Labelsprovider, intent, tool_name, error_type, tier
- RouteGET
/metrics→ Content-Type: text/plain; version=0.0.4
Key PromQL Queries
Reference PromQL expressions used in Grafana dashboards and Prometheus alerting rules. These can be tested directly in the Prometheus UI at http://localhost:9090/graph.
- P95 Turn Latency
histogram_quantile(0.95, rate(voice_turn_latency_ms_bucket[5m])) - Call Rate
rate(voice_calls_total[5m]) - Active Calls
voice_calls_active - Error Rate %
rate(voice_errors_total[5m]) / rate(voice_calls_total[5m]) * 100 - NLU by Tier
rate(voice_nlu_latency_ms_count[5m]) by (tier) - LLM TTFT P50
histogram_quantile(0.5, rate(voice_llm_ttft_ms_bucket[5m]))
Monitoring Data Flow
Docker Services
- voice-agentFastAPI server (port 8000) · 2 CPU, 2GB RAM
- redisSession cache (port 6379) · 256MB, AOF, LRU
- rasaNLU server (port 5005) · profile: with-rasa
- prometheusMetrics scraping (port 9090) · profile: monitoring
- grafanaDashboards (port 3000) · profile: monitoring
- nginxLoad balancer (port 80) · profile: production
CI/CD Pipeline
- Stage 1Lint — Ruff check + format + mypy type check
- Stage 2Test — pytest + coverage (with Redis service)
- Stage 3Security — Safety + Bandit vulnerability scan
- Stage 4Docker — Image build + health check test
- Stage 5Deploy — Production deploy (main branch only)
Nginx Load Balancer
- StrategyIP hash (caller affinity / sticky sessions)
- API Rate30 req/s per IP (burst 20)
- WS Rate10 req/s per IP
- WS Timeout3600s (1 hour for long calls)
- Connections100 concurrent WebSockets per IP