Voice Agent

◆ System Overview

The Voice Agent is a production-grade, real-time conversational AI platform. It handles phone calls via Twilio and browser sessions via WebSocket/WebRTC, processes speech with streaming STT, routes through hybrid NLU, generates responses via multi-provider LLM with function calling, and synthesizes speech back to the caller — all within a single FastAPI application with horizontal scaling capabilities.

← Swipe to explore the diagram →

▶ Voice Pipeline Flow

Each voice turn follows this path: audio in → transcription → intent understanding → response generation → speech synthesis → audio out. The pipeline uses streaming throughout for minimal latency.

STEP 1

Audio In

PCM 16kHz (browser)
mulaw 8kHz (Twilio)

➜

STEP 2

STT

Deepgram Nova-2
Streaming + VAD

➜

STEP 3

NLU

4-tier Hybrid
Regex→Rasa→SetFit→LLM

➜

STEP 4

LLM

Streaming tokens
+ Function Calls

➜

STEP 5

TTS

Streaming audio
SSML support

➜

STEP 6

Audio Out

PCM (browser)
mulaw (Twilio)

Tool Call Feedback Loop

When the LLM invokes a function (e.g. check_order_status), the pipeline executes the tool, feeds results back to the LLM (up to 3 rounds), and disables tools on follow-up rounds to force a natural language response.

🤖

LLM

➜

⚙

Tool Call

➜

✅

Execute

➜

🔁

Feed Back

➜

💬

Text Out

⚙ Core Engines

Modular engine architecture: each engine handles one domain and can be swapped independently.

🎤

Speech-to-Text Engine

engines/stt.py · 146 lines

Streaming Deepgram Nova-2 WebSocket integration with VAD events, smart endpointing (800ms), and utterance buffering with debounce logic.

ClassDeepgramSTTEngine
ProtocolWebSocket streaming
EventsTranscript, SpeechStarted, UtteranceEnd
Formatsmulaw/8kHz (Twilio), linear16/16kHz (browser)
FeaturesSmart format, punctuation, filler words, keywords

Deepgram Streaming VAD

🤖

LLM Engine

engines/llm.py · ~390 lines

Multi-provider streaming LLM with function calling. Supports Gemini, OpenAI, and Anthropic with automatic role merging for Gemini's alternating-role constraint.

ClassLLMEngine
ProvidersGemini, OpenAI, Anthropic
StreamingAsync generator (token, tool_call, done)
ToolsOpenAI-format function declarations
Featuressystem_override, role merging, TTFT logging

Gemini OpenAI Anthropic Function Calling

💡

NLU / Intent Router

engines/nlu.py · ~620 lines

4-tier hybrid intent classification: instant regex patterns, Rasa NLU, SetFit transformer, and LLM fallback. Includes emotion detection and entity extraction.

TiersRegex (0ms) → Rasa (<10ms) → SetFit (<5ms) → LLM (~1s)
Intents17 predefined (order, appointment, transfer, etc.)
Entitiesorder_id, date, time, purpose, email, phone
Emotions6 classes (neutral, happy, frustrated, sad, angry, confused)

Regex Rasa SetFit LLM Emotion

🔊

Text-to-Speech Engine

engines/tts.py · 462 lines

Dual-provider TTS with smart routing. ElevenLabs for premium quality, Cartesia Sonic for ultra-low latency. Includes SSML builder and emotion-aware prosody.

ProvidersElevenLabs, Cartesia Sonic
RouterTTSRouter (auto, quality, speed, cost)
SSMLBreaks, emphasis, prosody, say-as, phoneme
Formatsmulaw_8000, pcm_16000, mp3
FeaturesVoice cloning, multilingual, emotion prosody

ElevenLabs Cartesia SSML Streaming

🗃

Conversation Memory

engines/memory.py · 234 lines

Dual-layer memory: session-level turn tracking with auto-summarization, plus persistent cross-call user profiles with preferences and history.

SessionConversationMemory (turn tracking, summarization)
PersistentUserMemory (name, tier, preferences, call history)
StorageSQLite (user_memory, conversation_turns)
FeatureAuto-summarize when exceeding max_turns

Session Cross-Call SQLite

📚

RAG Knowledge Engine

engines/rag.py · 383 lines

Retrieval-Augmented Generation with FAISS vector search (or numpy fallback). Loads documents from a knowledge base directory, chunks text, embeds, and retrieves relevant context.

Vector StoreFAISS IndexFlatIP (or numpy cosine fallback)
Embeddingssentence-transformers (all-MiniLM-L6-v2)
FormatsMD, TXT, HTML, JSON, CSV
Chunking300-char voice-optimized paragraphs
CachingFAISS index persistence + content hash invalidation

FAISS Embeddings Auto-Index

⚡

Realtime Audio LLM

engines/realtime_llm.py · 346 lines

OpenAI GPT-4o Realtime API integration for direct audio-in/audio-out streaming. Bypasses the separate STT + LLM + TTS pipeline for ultra-low latency.

ProtocolWebSocket bidirectional audio
VADServer-side voice activity detection
ToolsFunction calling via Realtime API
Barge-inResponse cancellation support

GPT-4o Audio Native WebSocket

🌐

WebRTC Audio Handler

engines/webrtc.py · 197 lines

Browser-to-server audio streaming with format conversion. Resamples 48kHz float32 browser audio to 16kHz int16 for the STT pipeline.

InputFloat32, 48kHz, mono (browser)
OutputInt16, 16kHz, mono (STT)
ResamplerLinear interpolation
Buffering100ms frames

WebRTC Resampling Format Conversion

💡 NLU Deep Dive

The Natural Language Understanding engine is the brain of the voice pipeline, sitting between STT and LLM to classify user intent, extract structured entities, and detect emotional tone — all in under 15ms for the fast path. It uses a 4-tier cascading architecture that balances speed, accuracy, and cost: deterministic regex for known patterns, Rasa NLU for trained intents, SetFit transformer for few-shot generalization, and LLM fallback for ambiguous or novel utterances.

▼

4-Tier Cascading Classification

engines/nlu.py · IntentClassifier

Each incoming utterance is evaluated top-down. The first tier to return a high-confidence result short-circuits the cascade, minimizing latency. If no tier reaches the confidence threshold, the utterance falls through to LLM-based classification.

🔍

Tier 1 — Regex Patterns

_regex_classify()

Pre-compiled regular expressions match common, well-defined utterance patterns with zero latency and absolute confidence. Patterns are loaded at startup from a configurable YAML mapping and compiled with re.IGNORECASE.

Patterns50+ compiled regexes across 17 intents
Latency~0ms (in-process string matching)
ConfidenceAlways 1.0 on match
Examples"check my order", "transfer to agent", "cancel", "yes/no"
LimitationOnly handles exact phrasings; no semantic understanding

Zero Latency Deterministic YAML Config

🧠

Tier 2 — Rasa NLU

rasa_nlu/ · DIET Classifier

A Rasa NLU server (port 5005) runs the DIET (Dual Intent and Entity Transformer) architecture. It performs joint intent classification and entity extraction using a shared transformer backbone trained on domain-specific examples.

ArchitectureDIET classifier + CRF entity extractor
FeaturizersWhitespaceTokenizer → CountVectorsFeaturizer → LanguageModelFeaturizer
Training Datarasa_nlu/data/nlu.yml (800+ labeled examples)
Threshold0.75 confidence minimum to accept
Latency<10ms per utterance
Entity Typesorder_id, date, time, name, phone, email via CRF

DIET CRF Entities Joint Model

🤖

Tier 3 — SetFit Transformer

SetFitModel · sentence-transformers

A few-shot contrastive learning model built on sentence-transformers. SetFit fine-tunes a pre-trained embedding model using Siamese networks, then trains a classification head — requiring only 8–16 examples per intent to generalize well to unseen phrasings.

Base Modelall-MiniLM-L6-v2 (384-dim embeddings)
TrainingContrastive pairs + logistic regression head
Few-Shot8–16 labeled examples per intent class
Threshold0.70 confidence minimum to accept
Latency<5ms (single forward pass, CPU-optimized)
AdvantageHandles paraphrases and novel phrasings missed by regex/Rasa

Few-Shot Sentence-Transformers Contrastive

🚀

Tier 4 — LLM Fallback

_llm_classify() · Multi-provider

When all fast-path tiers miss, a structured prompt is sent to the active LLM provider (Gemini, GPT-4, or Claude). The prompt includes the full intent schema, entity definitions, and conversation context to produce a JSON-structured classification result.

PromptSystem prompt with intent enum + entity schema + few-shot examples
OutputJSON: {intent, confidence, entities[], reasoning}
ProvidersGemini 2.0 Flash (primary), GPT-4o-mini, Claude 3.5 Haiku
Latency~800–1200ms (network round-trip)
ContextIncludes last 3 turns for disambiguation
RetryProvider failover with 2s timeout per attempt

Gemini GPT-4 Claude JSON Schema

📋

Intent Registry

17 production intents

All recognized intents are defined in a central registry with associated metadata: priority level, required entities, follow-up actions, and escalation rules. The registry drives both classification validation and downstream routing.

Ordercheck_order, place_order, cancel_order, return_order
Schedulingbook_appointment, reschedule, cancel_appointment
Supporttransfer_agent, technical_support, billing_inquiry
Generalgreeting, goodbye, affirm, deny, faq, out_of_scope
Specialescalate (auto-triggered by frustration detection)

17 Intents Priority Levels Auto-Escalation

🎯

Entity Extraction

Hybrid extraction pipeline

Entities are extracted in parallel by multiple methods and merged with confidence-weighted deduplication. Regex captures structured formats (order IDs, phone numbers), CRF handles contextual spans, and LLM extracts complex or implicit entities.

order_idRegex: ORD-\d{6,10} or #\d{6}
date / timeDuckling-style parser with relative date support ("next Tuesday", "in 2 hours")
emailRFC 5322 regex with domain validation
phoneE.164 pattern matching with country code inference
purposeFree-text extracted by CRF or LLM (“I need to discuss my refund”)
MergeHighest-confidence value wins per entity slot

Regex CRF LLM Dedup

💥

Emotion Detection

EmotionDetector · 6 classes

Runs alongside intent classification to detect caller emotional state. Uses a lightweight DistilBERT model fine-tuned on call center transcripts. Emotion labels feed into TTS prosody adjustment and agent escalation logic.

Classesneutral, happy, frustrated, sad, angry, confused
ModelDistilBERT-base fine-tuned (6-class softmax)
Latency<3ms (runs in parallel with intent classification)
Escalationangry > 0.8 or frustrated > 0.85 → auto-escalate intent
TTS EffectAdjusts prosody rate, pitch, and empathy phrasing
TrackingEmotion trajectory logged per call for analytics

Frustration Sentiment DistilBERT Auto-Escalate

📜

NLU Output Schema

NLUResult dataclass

Every NLU invocation returns a standardized NLUResult object consumed by the dialog manager and LLM prompt builder. The schema ensures consistent downstream handling regardless of which classification tier produced the result.

intentstr — one of 17 registered intent labels
confidencefloat (0.0–1.0) — classification confidence
tierint (1–4) — which tier produced the result
entitiesdict — {entity_name: {value, confidence, source}}
emotionstr — detected emotion label
emotion_scoresdict — all 6 class probabilities
latency_msfloat — total NLU processing time
raw_textstr — original STT transcript input

Dataclass Typed Serializable

📞 Telephony Layer

Full Twilio integration for inbound/outbound voice with media streams, DTMF, and campaign management.

🕾

Twilio Handler

telephony/twilio_handler.py · 273 lines

Inbound call webhook (TwiML response), WebSocket media stream handler, outbound dialing, and call session lifecycle management.

InboundTwiML + Media Streams WebSocket
Audiobase64 mulaw encoding/decoding
DTMFDigit handling (0 = transfer to human)
Barge-inMark-based audio sync + clear
Stateactive_sessions dict (call_sid → CallSession)

Twilio WebSocket TwiML

📡

Outbound Campaign Manager

telephony/outbound.py · 396 lines

Batch dialing engine with DNC compliance, answering machine detection, TCPA calling hours enforcement, and real-time campaign analytics.

AMDAnswering machine detection (HUMAN/MACHINE/FAX)
DNCDo-Not-Call list management + scrubbing
TCPA9 AM – 9 PM calling hours enforcement
ConcurrencySemaphore-based rate control
AnalyticsContact attempt tracking, status filtering

AMD DNC TCPA Campaigns

🛡 Middleware & Security

Production middleware stack for security, observability, compliance, and distributed session management.

🔒

Security Middleware

middleware/security.py · 223 lines

Twilio signature validation (HMAC-SHA1), PII redaction (SSN, credit cards, DOB, etc.), admin authentication, IP-based rate limiting (200 req/60s).

PII Redaction Rate Limiting Auth HMAC

📊

Observability

middleware/observability.py · 286 lines

Prometheus-format metrics export, circuit breaker pattern for provider failover (CLOSED → OPEN → HALF_OPEN), and latency tracking across all pipeline stages.

Prometheus Circuit Breaker Histograms Failover

✅

Compliance Auditor

middleware/compliance.py · 361 lines

Automated HIPAA/PCI-DSS compliance checking with 20 audit controls, PII scanning (10 pattern types with severity levels), risk assessment, and score calculation.

HIPAA PCI-DSS PII Scanner 20 Checks

🗃

Redis Session Cache

middleware/redis_cache.py · 253 lines

Distributed session management for horizontal scaling. Redis implementation with pub/sub for cross-instance coordination, plus automatic in-memory fallback.

Redis Pub/Sub TTL Sessions Fallback

⚙ Function Calling Tools

9 production tools that the LLM can invoke during conversation. All are SQLite-backed with sample data seeding.

Tool Name	Parameters	Returns	Description
`check_order_status`	`order_id`	Status, items, tracking, ETA	Look up order by ID, return full shipping details
`schedule_appointment`	`date, time, name, purpose`	Confirmation #, details	Book a new appointment with conflict checking
`cancel_appointment`	`confirmation_number`	Cancellation status	Cancel an existing appointment
`reschedule_appointment`	`confirmation_number, new_date, new_time`	Updated details	Change date/time of existing appointment
`transfer_to_human`	`department, reason`	Queue position, wait time	Request transfer to human agent
`look_up_account`	`identifier (phone/email/ID)`	Customer profile, history	Find customer record by any identifier
`get_business_hours`	`department` (optional)	Hours, current status	Check if open/closed, return schedule
`collect_feedback`	`rating, comment`	Thank you + feedback ID	Record customer satisfaction rating
`end_call`	`reason` (optional)	Goodbye message	End the conversation politely

Source: tools/functions.py · 672 lines · Database tables: orders, appointments, customers, feedback, transfers

🔗 API Endpoints

All endpoints served by the FastAPI application (main.py).

GET

/admin

Admin dashboard UI

GET

/demo

Text-based demo chat page

GET

/voice-demo

Voice demo with mic + TTS playback

GET

/tracker

Implementation task tracker

GET

/api/config

Get all config (sensitive keys masked)

POST

/api/config

Update configuration values

POST

/twilio-webhook

Inbound Twilio call webhook

WS

/twilio-stream

Twilio Media Streams WebSocket

WS

/ws/chat

Text chat with streaming LLM + tools

WS

/ws/voice

Full voice pipeline (STT + LLM + TTS)

WS

/ws/webrtc

WebRTC browser audio stream

POST

/api/outbound-call

Initiate an outbound call

POST

/api/test-llm

Test LLM with text input

POST

/api/test-nlu

Test NLU intent detection

GET

/api/active-calls

List currently active calls

GET

/api/call-logs

Recent call history

GET

/api/stats

Dashboard statistics

GET

/api/compliance-audit

Run HIPAA/PCI compliance audit

GET

/api/compliance-checklist

Full compliance checklist

GET

/metrics

Prometheus metrics endpoint

GET

/health

Health check

GET

/api/campaigns

List outbound campaigns

POST

/api/campaigns

Create new campaign

POST

/api/campaigns/{id}/start

Start dialing a campaign

POST

/api/dnc

Add number to Do-Not-Call list

🔄 Data Flow & Integration Map

How components communicate within the system.

Source	Destination	Protocol	Data
Browser / Phone	Nginx	HTTP / WS	Audio frames, JSON commands
Nginx	FastAPI (Uvicorn)	Reverse proxy	IP-hash sticky sessions
FastAPI	Deepgram	WebSocket	Raw audio → transcript events
FastAPI	Gemini / OpenAI / Claude	HTTPS (streaming)	Prompt + history → token stream
FastAPI	ElevenLabs / Cartesia	HTTPS / WS	Text → audio chunks
FastAPI	Twilio	REST + WS	TwiML, media stream, outbound dial
FastAPI	Redis	TCP	Session state, pub/sub events
FastAPI	SQLite	File I/O	Config, call logs, orders, appointments
Prometheus	FastAPI /metrics	HTTP scrape	Counter + histogram metrics
Grafana	Prometheus	HTTP query	PromQL dashboard queries

📈 Observability & Metrics

Prometheus-format metrics exported at /metrics. Scraped every 15s.

voice_turn_latency_ms

Full turn latency (audio-in to audio-out)

histogram

voice_stt_latency_ms

Speech-to-text transcription latency

histogram

voice_llm_ttft_ms

LLM time-to-first-token

histogram

voice_tts_ttfb_ms

TTS time-to-first-byte

histogram

voice_nlu_latency_ms

NLU intent detection latency by method

histogram (labeled)

voice_calls_total

Total calls (inbound + outbound)

counter

voice_calls_active

Currently active concurrent calls

gauge

voice_barge_in_total

Barge-in (user interruption) count

counter

voice_tool_calls_total

Function calls by tool name

counter (labeled)

voice_cost_per_call_usd

Estimated cost per call

histogram

Circuit Breaker States

🟢

CLOSED

Normal operation
All requests pass through

➜

🔴

OPEN

3 failures detected
Traffic blocked for 30s

➜

🟠

HALF-OPEN

Testing recovery
1 request allowed through

Prometheus & Grafana Monitoring Stack

🔥

Prometheus

deployment/prometheus.yml · Port 9090

Time-series metrics database that scrapes the FastAPI /metrics endpoint. Runs as a Docker service under the monitoring compose profile. Stores 15 days of metrics with 15s scrape interval and supports PromQL queries for alerting.

Endpoint/metrics on voice-agent:8000 (Prometheus exposition format)
ScrapeEvery 15s with 5s timeout
Retention15d (configurable via --storage.tsdb.retention.time)
Port9090 (UI + API at http://localhost:9090)
Configdeployment/prometheus.yml — scrape targets, job labels
ProfileDocker Compose profile: monitoring
Client Libprometheus_client (Python) — integrated in observability.py
Startdocker compose --profile monitoring up

Prometheus PromQL TSDB Alerting

📊

Grafana

deployment/grafana/ · Port 3000

Dashboard visualization layer that connects to Prometheus as a data source. Ships with auto-provisioned datasource config and pre-built dashboards for voice pipeline monitoring. Access at http://localhost:3000.

Port3000 (UI at http://localhost:3000)
AuthDefault admin/admin (change on first login)
DatasourceAuto-provisioned via grafana/provisioning/datasources/prometheus.yml
DashboardsAuto-provisioned from grafana/provisioning/dashboards/
ProfileDocker Compose profile: monitoring
Startdocker compose --profile monitoring up

Grafana Dashboards Auto-Provision

⚙

Instrumentation Layer

middleware/observability.py · 286 lines

The Python application exposes metrics via the prometheus_client library. Counters, histograms, and gauges are registered at import time and updated throughout the pipeline. The /metrics route is mounted as a Starlette sub-app.

Libraryprometheus_client — CollectorRegistry, generate_latest()
Histogramsturn_latency, stt_latency, llm_ttft, tts_ttfb, nlu_latency, cost_per_call
Counterscalls_total, barge_in_total, tool_calls_total, errors_total
Gaugescalls_active, circuit_breaker_state
Labelsprovider, intent, tool_name, error_type, tier
RouteGET /metrics → Content-Type: text/plain; version=0.0.4

prometheus_client Histograms Counters Gauges

🔎

Key PromQL Queries

Grafana panels & alerting rules

Reference PromQL expressions used in Grafana dashboards and Prometheus alerting rules. These can be tested directly in the Prometheus UI at http://localhost:9090/graph.

P95 Turn Latencyhistogram_quantile(0.95, rate(voice_turn_latency_ms_bucket[5m]))
Call Raterate(voice_calls_total[5m])
Active Callsvoice_calls_active
Error Rate %rate(voice_errors_total[5m]) / rate(voice_calls_total[5m]) * 100
NLU by Tierrate(voice_nlu_latency_ms_count[5m]) by (tier)
LLM TTFT P50histogram_quantile(0.5, rate(voice_llm_ttft_ms_bucket[5m]))

PromQL Alerts Panels

Monitoring Data Flow

🚀 Deployment Architecture

Docker Compose orchestration with optional profiles for production features.

📦

Docker Services

docker-compose.yml · 91 lines

voice-agentFastAPI server (port 8000) · 2 CPU, 2GB RAM
redisSession cache (port 6379) · 256MB, AOF, LRU
rasaNLU server (port 5005) · profile: with-rasa
prometheusMetrics scraping (port 9090) · profile: monitoring
grafanaDashboards (port 3000) · profile: monitoring
nginxLoad balancer (port 80) · profile: production

🛠

CI/CD Pipeline

.github/workflows/ci.yml · 178 lines

Stage 1Lint — Ruff check + format + mypy type check
Stage 2Test — pytest + coverage (with Redis service)
Stage 3Security — Safety + Bandit vulnerability scan
Stage 4Docker — Image build + health check test
Stage 5Deploy — Production deploy (main branch only)

🌐

Nginx Load Balancer

deployment/nginx.conf · 141 lines

StrategyIP hash (caller affinity / sticky sessions)
API Rate30 req/s per IP (burst 20)
WS Rate10 req/s per IP
WS Timeout3600s (1 hour for long calls)
Connections100 concurrent WebSockets per IP

📁 Project File Tree

Complete source structure (37 files, ~6,500+ lines of application code).

voice_agent/ ├── main.py # FastAPI server, all endpoints & WebSockets (960 lines) ├── config.py # Centralized config with SQLite persistence (266 lines) ├── __init__.py │ ├── engines/ # Core AI/audio processing engines │ ├── stt.py # Deepgram Nova-2 streaming STT (146 lines) │ ├── llm.py # Multi-provider LLM: Gemini/GPT/Claude (~390 lines) │ ├── tts.py # ElevenLabs + Cartesia TTS with SSML (462 lines) │ ├── nlu.py # 4-tier hybrid NLU + emotion detection (~620 lines) │ ├── memory.py # Session + cross-call user memory (234 lines) │ ├── rag.py # FAISS vector search + document chunking (383 lines) │ ├── realtime_llm.py # GPT-4o Realtime audio-in/audio-out (346 lines) │ ├── webrtc.py # Browser audio resampling 48k→16k (197 lines) │ ├── __init__.py │ └── rasa_nlu/ # Rasa NLU training data │ ├── config.yml │ ├── domain.yml │ └── nlu.yml │ ├── pipeline/ # Orchestration layer │ ├── voice_pipeline.py # STT→NLU→LLM→TTS orchestrator (428 lines) │ └── __init__.py │ ├── telephony/ # Phone integration │ ├── twilio_handler.py # Inbound/outbound Twilio + Media Streams (273 lines) │ ├── outbound.py # Campaign manager + AMD + DNC (396 lines) │ └── __init__.py │ ├── tools/ # LLM function calling │ ├── functions.py # 9 tools + execute_tool router (672 lines) │ └── __init__.py │ ├── middleware/ # Cross-cutting concerns │ ├── security.py # Auth, PII redaction, rate limiting (223 lines) │ ├── observability.py # Prometheus metrics + circuit breaker (286 lines) │ ├── compliance.py # HIPAA/PCI audit + PII scanner (361 lines) │ ├── redis_cache.py # Distributed session cache (253 lines) │ └── __init__.py │ ├── admin/ # Web UI pages │ ├── dashboard.html # Admin dashboard with config + call logs │ ├── demo.html # Text chat demo page │ ├── voice_demo.html # Voice demo with mic + TTS playback │ ├── tracker.html # Implementation task tracker │ ├── architecture.html # This document │ └── mic-processor.js # AudioWorklet for mic capture │ ├── deployment/ # Infrastructure configs │ └── nginx.conf # Load balancer + WebSocket proxy (141 lines) │ ├── monitoring/ # Observability stack │ ├── prometheus.yml # Scrape config (10 lines) │ └── grafana/provisioning/ │ ├── dashboards/ │ │ └── dashboards.yml │ └── datasources/ │ └── prometheus.yml │ ├── .github/workflows/ # CI/CD │ └── ci.yml # 5-stage pipeline: lint→test→security→docker→deploy │ └── docker-compose.yml # 6 services: app, redis, rasa, prometheus, grafana, nginx

Voice Agent — System Architecture

Tool Call Feedback Loop

Speech-to-Text Engine

LLM Engine

NLU / Intent Router

Text-to-Speech Engine

Conversation Memory

RAG Knowledge Engine

Realtime Audio LLM

WebRTC Audio Handler

4-Tier Cascading Classification

Tier 1 — Regex Patterns

Tier 2 — Rasa NLU

Tier 3 — SetFit Transformer

Tier 4 — LLM Fallback

Intent Registry

Entity Extraction

Emotion Detection

NLU Output Schema

Twilio Handler

Outbound Campaign Manager

Security Middleware

Observability

Compliance Auditor

Redis Session Cache

Circuit Breaker States

Prometheus & Grafana Monitoring Stack

Prometheus

Grafana

Instrumentation Layer

Key PromQL Queries

Monitoring Data Flow

Docker Services

CI/CD Pipeline

Nginx Load Balancer