Building a Production-Ready AI Voice Agent: A Deep Technical Implementation Guide
The landscape of voice AI has shifted dramatically with the emergence of real-time conversational APIs. Traditional pipeline architectures—where speech-to-text, language models, and text-to-speech operate sequentially—are giving way to direct audio-to-audio processing systems that promise sub-second latency and natural conversational dynamics. This comprehensive guide explores the technical implementation of a production-ready AI voice agent using LiveKit’s real-time communication infrastructure, Twilio’s telephony gateway, and Google’s Gemini Live API.
Architecture Overview: Beyond Traditional Pipelines
The Paradigm Shift in Voice AI
Traditional voice assistants operate on a sequential pipeline: audio comes in, gets transcribed to text, processed by a language model, then synthesized back to speech. Each step adds latency, typically resulting in 2-3 second response times that feel unnatural in conversation. The new generation of real-time APIs processes audio streams directly, maintaining conversational context while generating responses with latencies approaching human conversation speeds (200-500ms).
Core Technology Stack
The system architecture consists of four primary components:
LiveKit Server acts as the central nervous system, managing WebRTC-based real-time media transport. It handles room management, participant orchestration, and provides the framework for server-side agents. The choice between LiveKit Cloud and self-hosted deployment significantly impacts operational complexity—Cloud abstracts infrastructure management but requires careful cost monitoring at scale.
Twilio SIP Trunking bridges the gap between modern WebRTC infrastructure and the Public Switched Telephone Network (PSTN). This enables the AI agent to handle standard phone calls, making it accessible to anyone with a phone rather than requiring specialized apps or web interfaces.
LiveKit Agents Framework provides the Python-based runtime for implementing server-side logic. Agents run as persistent processes that join LiveKit rooms as participants, managing the bidirectional flow of audio and conversation state.
Google Gemini Live API delivers the conversational intelligence through direct audio processing. Unlike traditional LLMs that require text input, Gemini Live accepts raw audio streams and generates both audio responses and real-time transcriptions, all while maintaining conversational context across the session.
Phase 1: Infrastructure Configuration Deep Dive
LiveKit Server Setup Considerations
The deployment model choice between LiveKit Cloud and self-hosting involves critical trade-offs:
LiveKit Cloud provides global edge infrastructure with automatic scaling and built-in analytics. The WebSocket endpoint (typically wss://your-project.livekit.cloud
) handles millions of concurrent connections without infrastructure management. However, costs scale linearly with usage, and debugging is limited to provided analytics.
Self-hosting requires provisioning infrastructure capable of handling WebRTC’s demanding requirements:
- UDP ports 50000-60000 for media transport
- TCP port 7880 for WebSocket signaling
- Redis for multi-node coordination
- Host networking for Kubernetes deployments (limiting one LiveKit pod per node)
Critical configuration parameters that frequently cause issues:
# LiveKit server config for self-hosting
port: 7880
rtc:
port_range_start: 50000
port_range_end: 60000
use_external_ip: true # Critical for cloud deployments
redis:
address: redis-cluster:6379
use_tls: true
Twilio SIP Trunk Architecture
The SIP trunk configuration requires precise coordination between Twilio and LiveKit:
Trunk Domain Requirements: The domain MUST end with pstn.twilio.com
(e.g., my-agent-trunk.pstn.twilio.com
). This isn’t just a naming convention—it’s a routing requirement within Twilio’s infrastructure.
Authentication Flow: Twilio uses digest authentication for SIP. The credential list created in Twilio must exactly match the credentials configured in LiveKit’s outbound trunk:
# LiveKit outbound trunk configuration
outbound_trunk = {
"address": "my-agent-trunk.pstn.twilio.com",
"auth_username": "your-secure-username",
"auth_password": "your-secure-password",
"numbers": ["+1234567890"] # Must be E.164 format
}
Inbound Call Routing Options:
-
Direct Origination URI: Simpler but less flexible
sip:unique-id.sip.livekit.cloud;transport=tcp
-
TwiML Application: Enables pre-processing
<?xml version="1.0" encoding="UTF-8"?> <Response> <Dial> <Sip username="auth-user" password="auth-pass"> sip:unique-id.sip.livekit.cloud;transport=tcp </Sip> </Dial> </Response>
Google Cloud and Gemini Live API Setup
The Gemini Live API has strict requirements that must be understood upfront:
Audio Format Specifications:
- Input: 16-bit PCM, 16kHz, mono, little-endian
- Output: 16-bit PCM, 24kHz, mono, little-endian
These aren’t suggestions—any deviation results in processing failures or garbled audio.
Authentication Strategy: Service accounts are strongly preferred over API keys for production:
# Service account setup
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
# Required roles in Google Cloud IAM:
# - Vertex AI User
# - Vertex AI Viewer (for monitoring)
Model Selection: The API is evolving rapidly. Model identifiers like gemini-2.0-flash-live-001
may change. Always verify the current production model through the Google Cloud Console rather than relying on documentation.
Phase 2: LiveKit Agent Implementation Details
Agent Process Architecture
Understanding LiveKit’s agent execution model is crucial for proper implementation:
import asyncio
from livekit import agents, rtc
from livekit.agents import WorkerOptions, JobContext
import logging
class GeminiVoiceAgent:
def __init__(self):
self.audio_source = None
self.gemini_session = None
async def entrypoint(self, ctx: JobContext):
"""Main entry point for each agent job"""
# This runs in an isolated subprocess for each session
logging.info(f"Agent started for room: {ctx.room.name}")
# Connect with audio-only subscription for voice agents
await ctx.connect(auto_subscribe=rtc.AutoSubscribe.AUDIO_ONLY)
# Wait for SIP participant (phone caller)
participant = await self.wait_for_sip_participant(ctx)
# Initialize audio pipeline
await self.setup_audio_pipeline(ctx, participant)
# Start conversation
await self.run_conversation_loop()
# Worker configuration with explicit naming
worker_options = WorkerOptions(
entrypoint_fnc=GeminiVoiceAgent().entrypoint,
agent_name="gemini-sip-agent" # Critical for dispatch rules
)
if __name__ == "__main__":
agents.cli.run_app(worker_options)
The agent_name
parameter is critical—it disables automatic dispatch and enables explicit targeting through SIP dispatch rules and API calls.
Audio Stream Management
The agent must handle bidirectional audio streams with precise format conversion:
async def setup_audio_pipeline(self, ctx: JobContext, participant: rtc.RemoteParticipant):
# Subscribe to user's audio track
audio_track = None
for pub in participant.track_publications.values():
if pub.kind == rtc.TrackKind.KIND_AUDIO:
pub.set_subscribed(True)
audio_track = pub.track
break
# Create audio stream for incoming audio
user_audio_stream = rtc.AudioStream(audio_track)
# Create audio source for outgoing audio (24kHz for Gemini output)
self.audio_source = rtc.AudioSource(24000, 1) # 24kHz, mono
agent_track = rtc.LocalAudioTrack.create_audio_track(
"agent-voice",
self.audio_source
)
# Publish agent's audio track
await ctx.agent.publish_track(
agent_track,
rtc.TrackPublishOptions(source=rtc.TrackSource.SOURCE_MICROPHONE)
)
# Start processing loops
asyncio.create_task(self.process_incoming_audio(user_audio_stream))
asyncio.create_task(self.process_gemini_responses())
Resampling and Format Conversion
The most critical and often overlooked aspect is audio format conversion:
import numpy as np
from scipy import signal
class AudioProcessor:
@staticmethod
def resample_audio(audio_data: np.ndarray,
orig_rate: int,
target_rate: int) -> np.ndarray:
"""Resample audio data to target sample rate"""
if orig_rate == target_rate:
return audio_data
# Use high-quality resampling
num_samples = int(len(audio_data) * target_rate / orig_rate)
resampled = signal.resample(audio_data, num_samples)
# Ensure 16-bit PCM range
resampled = np.clip(resampled, -32768, 32767)
return resampled.astype(np.int16)
@staticmethod
def convert_to_gemini_format(frame: rtc.AudioFrame) -> bytes:
"""Convert LiveKit audio frame to Gemini input format"""
# LiveKit typically provides 48kHz audio
audio_array = np.frombuffer(frame.data, dtype=np.int16)
# Resample to 16kHz for Gemini input
resampled = AudioProcessor.resample_audio(
audio_array,
frame.sample_rate,
16000
)
# Return as little-endian bytes
return resampled.tobytes()
Phase 3: Gemini Live API Integration
Establishing the Bidirectional Stream
The LiveKit Google plugin abstracts the WebSocket complexity but understanding the underlying protocol helps with debugging:
from livekit.plugins.google.beta.realtime import RealtimeModel
async def initialize_gemini_session(self):
self.gemini_session = RealtimeModel(
model="gemini-2.0-flash-live-001",
voice="Puck", # Voice selection impacts latency
temperature=0.7,
modalities=["AUDIO", "TEXT"], # Enable transcriptions
instructions="""You are a helpful AI assistant on a phone call.
Keep responses concise and conversational.
Ask clarifying questions when needed.""",
vertexai=True, # Use Vertex AI for production
project=os.getenv("GOOGLE_CLOUD_PROJECT"),
location="us-central1"
)
# The plugin manages WebSocket lifecycle
await self.gemini_session.connect()
Real-time Audio Processing Loop
The core challenge is managing concurrent streams without introducing latency:
async def process_incoming_audio(self, audio_stream: rtc.AudioStream):
"""Process user audio and send to Gemini"""
try:
async for frame in audio_stream:
# Convert to Gemini format (16kHz, 16-bit PCM)
gemini_audio = AudioProcessor.convert_to_gemini_format(frame)
# Send to Gemini Live API
await self.gemini_session.send_audio(gemini_audio)
except Exception as e:
logging.error(f"Audio processing error: {e}")
# Implement exponential backoff for transient errors
async def process_gemini_responses(self):
"""Process Gemini responses and play audio"""
try:
async for response in self.gemini_session.responses():
if response.audio_data:
# Convert 24kHz Gemini output to AudioFrame
audio_array = np.frombuffer(
response.audio_data,
dtype=np.int16
)
frame = rtc.AudioFrame(
data=audio_array.tobytes(),
sample_rate=24000,
num_channels=1,
samples_per_channel=len(audio_array)
)
# Send to LiveKit audio source
await self.audio_source.capture_frame(frame)
if response.transcript:
# Log for debugging/analytics
logging.info(f"Gemini: {response.transcript}")
except Exception as e:
logging.error(f"Gemini response error: {e}")
Handling Interruptions and Turn-Taking
Gemini Live’s native interruption handling requires careful state management:
class ConversationManager:
def __init__(self):
self.is_agent_speaking = False
self.pending_audio_buffer = []
async def handle_interruption(self):
"""Handle user interruption during agent speech"""
if self.is_agent_speaking:
# Immediately stop audio playback
self.pending_audio_buffer.clear()
self.is_agent_speaking = False
# Signal Gemini about interruption
await self.gemini_session.signal_interruption()
logging.info("User interrupted - clearing agent audio")
Phase 4: Telephony Integration Deep Dive
SIP Dispatch Rules for Inbound Calls
The dispatch rule is the critical link between incoming calls and agent assignment:
# Create inbound trunk
lk sip inbound create \
--number "+1234567890" \
--name "production-inbound" \
--metadata '{"environment": "production"}'
# Create dispatch rule with explicit agent assignment
lk sip dispatch create \
--type individual \
--room-prefix "call-" \
--trunk-id "ST_in_xxxx" \
--agent-name "gemini-sip-agent" \
--metadata '{"source": "twilio_inbound"}'
The --agent-name
parameter must exactly match the name configured in the agent’s WorkerOptions.
Implementing Outbound Call Logic
Outbound calls require external orchestration:
async def initiate_outbound_call(
destination: str,
context: dict,
api_key: str,
api_secret: str
):
"""Initiate an outbound call with agent dispatch"""
# Create unique room for this call
room_name = f"outbound-{uuid.uuid4()}"
# Dispatch agent to room with call context
dispatch_request = {
"agent_name": "gemini-sip-agent",
"room": room_name,
"metadata": json.dumps({
"sip_call_to": destination,
"context": context,
"call_type": "outbound"
})
}
# Use LiveKit Server API to dispatch agent
async with aiohttp.ClientSession() as session:
url = "https://api.livekit.cloud/dispatch/agent"
headers = generate_livekit_auth_headers(api_key, api_secret)
async with session.post(url, json=dispatch_request, headers=headers) as resp:
if resp.status != 200:
raise Exception(f"Dispatch failed: {await resp.text()}")
The agent then retrieves the destination from metadata and places the call:
async def handle_outbound_call(self, ctx: JobContext):
"""Handle outbound call from agent perspective"""
metadata = json.loads(ctx.job.metadata)
destination = metadata["sip_call_to"]
# Create SIP participant (place the call)
sip_participant = await ctx.api.sip.create_sip_participant(
room_name=ctx.room.name,
sip_trunk_id="ST_out_yyyy", # Outbound trunk ID
sip_call_to=destination,
participant_identity=f"callee-{destination}",
wait_until_answered=True, # Block until answered
play_dialtone=True,
dtmf="#" # Optional DTMF to send after connection
)
# Wait for participant to join room
callee = await ctx.wait_for_participant(
identity=f"callee-{destination}"
)
# Monitor call status
while callee.attributes.get("sip.callStatus") != "active":
await asyncio.sleep(0.1)
# Begin conversation
await self.start_conversation(ctx, callee)
Phase 5: Production Deployment Considerations
Container Architecture
The Dockerfile must account for audio processing dependencies:
FROM python:3.11-slim
# Install audio processing libraries
RUN apt-get update && apt-get install -y \
libopus0 \
libvpx7 \
libsrtp2-1 \
libwebrtc-audio-processing1 \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY . .
# Run in production mode
CMD ["python", "main.py", "start"]
Kubernetes Deployment Strategy
For production Kubernetes deployments:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gemini-voice-agent
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Never kill running agents
template:
spec:
terminationGracePeriodSeconds: 600 # 10 minutes for calls to complete
containers:
- name: agent
image: your-registry/gemini-voice-agent:latest
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
env:
- name: LIVEKIT_URL
valueFrom:
secretKeyRef:
name: livekit-credentials
key: url
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /secrets/gcp/key.json
volumeMounts:
- name: gcp-key
mountPath: /secrets/gcp
readOnly: true
livenessProbe:
httpGet:
path: /
port: 8081
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 8081
initialDelaySeconds: 10
periodSeconds: 5
Autoscaling Configuration
The Horizontal Pod Autoscaler must account for long-running sessions:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: gemini-agent-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: gemini-voice-agent
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50 # Scale before worker limit
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # React quickly to load
policies:
- type: Percent
value: 100 # Double capacity if needed
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
policies:
- type: Pods
value: 1 # Remove one pod at a time
periodSeconds: 300
Phase 6: Monitoring and Observability
Comprehensive Metrics Collection
Implement structured logging with correlation IDs across the entire call flow:
import structlog
from dataclasses import dataclass
from datetime import datetime
@dataclass
class CallContext:
room_sid: str
participant_sid: str
twilio_call_sid: str
start_time: datetime
logger = structlog.get_logger()
class InstrumentedAgent:
async def entrypoint(self, ctx: JobContext):
call_context = CallContext(
room_sid=ctx.room.sid,
participant_sid=None, # Set when participant joins
twilio_call_sid=None, # Extract from participant attributes
start_time=datetime.utcnow()
)
log = logger.bind(
room_sid=call_context.room_sid,
job_id=ctx.job.id
)
log.info("agent.session.started")
try:
# ... agent logic ...
pass
except Exception as e:
log.error("agent.session.error", error=str(e))
raise
finally:
duration = (datetime.utcnow() - call_context.start_time).total_seconds()
log.info("agent.session.ended", duration_seconds=duration)
Multi-Layer Monitoring Strategy
Monitor at every layer of the stack:
Infrastructure Layer:
- Container resource utilization (CPU, memory, network)
- Pod restart counts and reasons
- Node capacity and scheduling pressure
LiveKit Layer:
- Active rooms and participants
- Track publish/subscribe events
- Packet loss and jitter metrics
- NACK (Negative Acknowledgment) counts
Agent Application Layer:
- Agent dispatch latency
- Audio processing pipeline latency
- Gemini API call duration and error rates
- Audio format conversion performance
Telephony Layer:
- SIP response codes (200 OK, 486 Busy, 603 Declined)
- Call setup time
- Call duration distribution
- Geographic distribution of calls
AI Service Layer:
- Gemini session duration
- Token usage (if applicable)
- Interruption frequency
- Turn-taking metrics
Production Debugging Patterns
For production issues, correlate logs across services:
async def debug_call_flow(room_sid: str):
"""Correlate logs across all services for a specific call"""
# Query LiveKit server logs
livekit_logs = await query_logs(
service="livekit-server",
filters={"room_sid": room_sid}
)
# Extract participant SIDs
participant_sids = extract_participant_sids(livekit_logs)
# Query agent logs
agent_logs = await query_logs(
service="gemini-voice-agent",
filters={"room_sid": room_sid}
)
# Extract Twilio Call SID from participant attributes
twilio_call_sid = extract_twilio_call_sid(agent_logs)
# Query Twilio CDRs
twilio_cdrs = await twilio_client.calls.get(twilio_call_sid)
# Construct timeline
timeline = merge_log_timeline([
livekit_logs,
agent_logs,
twilio_cdrs
])
return analyze_call_issues(timeline)
Phase 7: Testing Strategy for Real-time Systems
Load Testing with Realistic Patterns
Use LiveKit’s load testing tool with realistic call patterns:
# Simulate gradual ramp-up of concurrent calls
lk load-test \
--room-prefix "loadtest-" \
--num-sessions 100 \
--duration 10m \
--num-per-second 2 \
--agent-name "gemini-sip-agent" \
--agent-echo-test
Monitor during load tests:
- Agent worker CPU and memory utilization
- LiveKit server packet processing metrics
- Gemini API latency percentiles (p50, p95, p99)
- Audio quality degradation indicators
End-to-End Quality Metrics
Implement automated quality testing:
class QualityTestRunner:
async def run_e2e_quality_test(self):
"""Automated E2E quality test"""
test_phone = "+1234567899" # Test number
# Place test call
call_sid = await self.place_test_call(test_phone)
# Record conversation
recording = await self.record_conversation(call_sid, duration=30)
# Analyze quality metrics
metrics = {
"audio_clarity": self.analyze_audio_clarity(recording),
"response_latency": self.measure_response_latency(recording),
"interruption_handling": self.test_interruption_response(recording),
"transcript_accuracy": self.compare_transcripts(recording)
}
# Alert if quality degrades
if metrics["response_latency"] > 800: # ms
await self.alert_quality_degradation(metrics)
return metrics
Advanced Patterns and Optimizations
Connection Pooling for Gemini Sessions
Implement session pooling to reduce connection overhead:
class GeminiSessionPool:
def __init__(self, max_sessions: int = 10):
self.available_sessions = asyncio.Queue(maxsize=max_sessions)
self.all_sessions = []
async def acquire(self) -> RealtimeModel:
"""Acquire a session from the pool"""
try:
session = self.available_sessions.get_nowait()
if await self.is_session_healthy(session):
return session
except asyncio.QueueEmpty:
pass
# Create new session if needed
if len(self.all_sessions) < self.max_sessions:
session = await self.create_session()
self.all_sessions.append(session)
return session
# Wait for available session
return await self.available_sessions.get()
Adaptive Audio Processing
Implement adaptive quality based on network conditions:
class AdaptiveAudioProcessor:
def __init__(self):
self.network_quality = 1.0 # 0.0 to 1.0
async def process_frame(self, frame: rtc.AudioFrame) -> bytes:
"""Adapt processing based on network quality"""
if self.network_quality < 0.5:
# Use lower quality resampling for poor networks
return self.fast_resample(frame)
else:
# Use high quality resampling
return self.high_quality_resample(frame)
def update_network_quality(self, packet_loss: float, jitter: float):
"""Update quality score based on network metrics"""
# Packet loss has higher impact than jitter
quality = 1.0 - (packet_loss * 2.0 + jitter * 0.5)
self.network_quality = max(0.0, min(1.0, quality))
Cost Optimization Strategies
Implement intelligent session management to control costs:
class CostAwareSessionManager:
def __init__(self, daily_budget: float):
self.daily_budget = daily_budget
self.daily_cost = 0.0
self.cost_per_minute = 0.05 # Example rate
async def should_accept_call(self) -> bool:
"""Determine if we should accept new calls based on budget"""
if self.daily_cost >= self.daily_budget * 0.9:
# Near budget limit - only accept high-priority calls
return False
return True
async def optimize_session_duration(self, conversation_context: dict):
"""Dynamically adjust conversation strategies based on cost"""
if self.daily_cost > self.daily_budget * 0.7:
# Encourage shorter conversations when approaching budget
return "concise_mode"
return "standard_mode"
Conclusion: Building for Scale and Reliability
Creating a production-ready AI voice agent requires careful attention to numerous technical details across multiple system layers. The shift from traditional STT-LLM-TTS pipelines to direct audio processing with APIs like Gemini Live represents a fundamental change in how we build conversational AI systems.
Key technical takeaways:
-
Audio format conversion is critical - The specific requirements (16kHz input, 24kHz output) must be handled precisely to avoid quality issues.
-
Session state management differs fundamentally - Unlike traditional LLM integrations, the conversational state lives within the persistent WebSocket connection to Gemini Live.
-
Graceful shutdown is non-negotiable - Voice calls cannot be abruptly terminated. Plan for 10+ minute termination grace periods.
-
Monitoring must span all layers - From SIP response codes to AI API latencies, comprehensive observability is essential for production operations.
-
Testing requires real PSTN calls - Simulation alone is insufficient; actual telephone network testing is required for confidence.
The convergence of real-time communication infrastructure, telephony gateways, and advanced AI APIs has made it possible to build voice agents that feel truly conversational. However, the technical complexity requires careful architecture, robust implementation, and comprehensive operational practices to deliver reliable service at scale.
As these technologies continue to evolve—with APIs moving from beta to general availability and new capabilities being added—the patterns and practices outlined here provide a foundation for building voice AI systems that can adapt and scale with growing demands.

About Sharad Jain
Sharad Jain is an AI Engineer and Data Scientist specializing in enterprise-scale generative AI and NLP. Currently leading AI initiatives at Autoscreen.ai, he has developed ACRUE frameworks and optimized LLM performance at scale. Previously at Meta, Autodesk, and WithJoy.com, he brings extensive experience in machine learning, data analytics, and building scalable AI systems. He holds an MS in Business Analytics from UC Davis.