LLM + Voice Pipeline

Connecting LLMs with voice output is the heart of every voice agent. The challenge: How do you create a pipeline that sounds natural, responds quickly, and scales? Here you'll learn the architecture and optimization of the LLM-to-voice pipeline.

Connecting LLMs with Voice Output

The Pipeline in Detail

Audio Input ──→ ASR ──→ Text ──→ LLM ──→ Response Text ──→ TTS ──→ Audio Output
   (User)     (STT)           (Brain)                    (Voice)    (User)

Naive Implementation (slow)

// BAD: Sequential — waits for complete LLM response
const transcript = await elevenlabs.speechToText(audioInput)
const llmResponse = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: transcript }],
})
const audio = await elevenlabs.textToSpeech.convert(voiceId, {
  text: llmResponse.choices[0].message.content,
})
// Total latency: 2,000–4,000 ms

Optimized Implementation (fast)

// GOOD: Streaming — forward LLM tokens directly to TTS
const transcript = await elevenlabs.speechToText(audioInput)

// Start LLM streaming
const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: transcript }],
  stream: true,
})

// Open TTS WebSocket
const ttsWs = new WebSocket(
  `wss://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream-input`
)

let buffer = ''
for await (const chunk of stream) {
  const token = chunk.choices[0]?.delta?.content || ''
  buffer += token

  // Sentence detection: Send to TTS at sentence end
  if (buffer.match(/[.!?]\s/)) {
    ttsWs.send(JSON.stringify({ text: buffer }))
    buffer = ''
  }
}
// Total latency: 500–1,000 ms to first audio

Latency Optimization

The Latency Budget

Phase	Target	Optimization
ASR	< 300 ms	Streaming STT, Voice Activity Detection
LLM TTFT	< 500 ms	Model choice, prompt caching
TTS	< 300 ms	Turbo model, streaming
Network	< 100 ms	Regional endpoints, WebSocket
Total	< 1,200 ms	End-to-end optimization

Optimization Strategies in Detail

1. Reduce LLM latency:

Smaller model: GPT-4o-mini instead of GPT-4o (2x faster)
Prompt caching: Cache system prompt (50% faster with OpenAI)
Shorter prompts: Every token counts for latency
Streaming: Forward tokens immediately, don't wait for completion

2. Reduce ASR latency:

Voice Activity Detection (VAD): Only transcribe when speech is detected
Streaming STT: Use interim results for faster reaction
Endpointing: Detect when the user has finished speaking

3. Reduce TTS latency:

Turbo model: eleven_turbo_v2_5 instead of multilingual_v2
Streaming output: Play first audio bytes immediately
Chunking: Send text sentence by sentence to TTS

4. Reduce network latency:

WebSocket instead of REST: Permanent connection, no handshake overhead
Regional deployment: Server close to the user
Connection pooling: Reuse connections

Streaming TTS

Sentence-Level Streaming

The most effective strategy: Split LLM output into sentences and send each sentence immediately to TTS:

LLM Token 1: "I"
LLM Token 2: "can"
LLM Token 3: "help"
LLM Token 4: "you."  ← Sentence end detected!
→ Send sentence to TTS: "I can help you."
→ Audio is generated while LLM produces the next sentence

Sentence Detection

function detectSentenceEnd(buffer: string): boolean {
  // Simple heuristic
  return /[.!?]\s/.test(buffer) || buffer.length > 200
}

// Extended: Also comma pauses for more natural speech
function detectPausePoint(buffer: string): boolean {
  return /[.!?,;:]\s/.test(buffer) || buffer.length > 150
}

Context-Aware Voice Modulation

Dynamic Voice Adaptation

The voice agent adapts its voice to the context:

Context	Voice Settings
Greeting	stability: 0.6, style: 0.4 (friendly, warm)
Technical explanation	stability: 0.8, style: 0.1 (clear, factual)
Apology	stability: 0.7, style: 0.3 (empathetic)
Farewell	stability: 0.5, style: 0.5 (warm, personal)

Implementation

function getVoiceSettings(context: ConversationContext) {
  if (context.intent === 'complaint') {
    return { stability: 0.7, similarity_boost: 0.8, style: 0.3 }
  }
  if (context.intent === 'technical_support') {
    return { stability: 0.8, similarity_boost: 0.8, style: 0.1 }
  }
  // Default: friendly and natural
  return { stability: 0.5, similarity_boost: 0.75, style: 0.3 }
}

Speed Adaptation

Complex info: Speak slower (prosody rate: 0.9)
Confirmations: Normal pace (rate: 1.0)
Summaries: Slightly faster (rate: 1.1)
Urgent messages: Somewhat faster (rate: 1.15)

Practical tip: The most important optimization is sentence-level streaming. Without streaming, the user waits 2–4 seconds for a response — with streaming, they hear the first words after 500–800 ms. That's the difference between "feels like a robot" and "feels like a conversation."