Lesson 1 of 5·11 min read

Voice Agents Architecture

Conversational AI voice agents are the next evolution of chatbots. Instead of typing text, users speak naturally — and the agent responds in real time with a human voice. The architecture behind this is complex but crucial for latency, quality, and user experience.

Conversational AI Agents

What Is a Voice Agent?

A voice agent is an autonomous system that conducts conversations in natural language:

User speaks → ASR → LLM → TTS → User hears
      ↑                  ↓
      └── Turn-Taking ───┘

Architecture Components

ComponentFunctionElevenLabs Feature
ASRSpeech → textScribe STT
LLMUnderstanding + responseIntegrated (GPT-4o, Claude)
TTSText → speechTurbo v2.5 (< 300 ms)
Turn managerConversation controlConversational AI Engine
Tool routerCall external APIsFunction Calling
MemoryContext across turnsSession State

ElevenLabs Conversational AI Setup

const agent = await elevenlabs.conversationalAI.create({
  name: 'Customer Service Agent',
  voice_id: 'brand-voice-id',
  model: {
    provider: 'openai',
    model_id: 'gpt-4o',
  },
  system_prompt: `You are a friendly customer service agent
    for EverStrategy.ai. You help with questions about products,
    orders, and technical support.`,
  tools: [
    { name: 'check_order_status', description: '...' },
    { name: 'create_ticket', description: '...' },
  ],
  first_message: 'Hello! How can I help you?',
})

Turn-Taking

The Fundamental Problem

In a phone conversation, humans speak in turns — with natural transitions. A voice agent must replicate this behavior:

  • When should the agent listen? (User is speaking)
  • When should the agent respond? (User has stopped)
  • What happens with overlap? (Both speaking simultaneously)

ElevenLabs Turn-Taking

FeatureDescription
End-of-turn detectionDetects when the user is finished (~300 ms)
Filler words"Hmm", "So" during LLM processing
BackchannelingShort confirmations: "Yes", "I see"
Silence handlingFollow-up after 5 seconds of silence

Interruption Handling

Why Interruptions Are Critical

When a user interrupts the agent, the system must react immediately:

  1. Stop audio output — immediately stop speaking
  2. Process new input — what is the user saying?
  3. Adjust context — discard or adapt previous response
  4. Respond anew — react to the interrupt

Latency Budget

Interrupt detection:    50 ms
Audio stop:           100 ms
ASR processing:       200 ms
LLM response:         300 ms
TTS start:            200 ms
─────────────────────────────
Total:                850 ms (target: < 1,000 ms)

Emotion Detection

Emotional Intelligence for Voice Agents

Modern voice agents detect the user's emotional state:

  • Frustration: Louder voice, faster speaking, sighing
  • Confusion: Hesitation, "um" sounds, repetitions
  • Satisfaction: Calm tone, positive words
  • Urgency: Fast speaking, short sentences

Responding to Emotions

Detected EmotionAgent Response
FrustrationEmpathetic: "I understand that's frustrating. Let me help right away."
ConfusionClarifying: "Let me explain that differently..."
UrgencyEfficient: Shorter answers, get to the point faster
SatisfactionConfirming: "Glad I could help!"

Voice Agent Lifecycle

From Development to Production

Phase 1 — Design (1–2 weeks):

  • Define persona (voice, tonality, boundaries)
  • Design dialog flows
  • Plan tools and integrations
  • Create test cases

Phase 2 — Development (2–4 weeks):

  • Configure agent (prompt, voice, tools)
  • Build backend integrations (CRM, ticket system)
  • Testing: Happy path + edge cases
  • Optimize latency

Phase 3 — Pilot (2–4 weeks):

  • 10% of traffic to voice agent
  • Monitoring: Containment rate, CSAT, latency
  • Daily analysis of failed conversations
  • Iterative improvement

Phase 4 — Rollout (ongoing):

  • Gradual traffic increase
  • A/B testing different configurations
  • Add new use cases
  • Continuous monitoring

Practical tip: Invest 50% of your time in Phase 1 (Design). A well-designed agent with 3 use cases beats a poorly designed one with 20. Conversational design is more important than technology.

📝

Quiz

Question 1 of 3

Welches Latenz-Budget sollte für die gesamte Interrupt-Verarbeitung angestrebt werden?