Connecting LLMs with voice output is the heart of every voice agent. The challenge: How do you create a pipeline that sounds natural, responds quickly, and scales? Here you'll learn the architecture and optimization of the LLM-to-voice pipeline.
Audio Input ──→ ASR ──→ Text ──→ LLM ──→ Response Text ──→ TTS ──→ Audio Output
(User) (STT) (Brain) (Voice) (User)
// BAD: Sequential — waits for complete LLM response
const transcript = await elevenlabs.speechToText(audioInput)
const llmResponse = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: transcript }],
})
const audio = await elevenlabs.textToSpeech.convert(voiceId, {
text: llmResponse.choices[0].message.content,
})
// Total latency: 2,000–4,000 ms
// GOOD: Streaming — forward LLM tokens directly to TTS
const transcript = await elevenlabs.speechToText(audioInput)
// Start LLM streaming
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: transcript }],
stream: true,
})
// Open TTS WebSocket
const ttsWs = new WebSocket(
`wss://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream-input`
)
let buffer = ''
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content || ''
buffer += token
// Sentence detection: Send to TTS at sentence end
if (buffer.match(/[.!?]\s/)) {
ttsWs.send(JSON.stringify({ text: buffer }))
buffer = ''
}
}
// Total latency: 500–1,000 ms to first audio
| Phase | Target | Optimization |
|---|---|---|
| ASR | < 300 ms | Streaming STT, Voice Activity Detection |
| LLM TTFT | < 500 ms | Model choice, prompt caching |
| TTS | < 300 ms | Turbo model, streaming |
| Network | < 100 ms | Regional endpoints, WebSocket |
| Total | < 1,200 ms | End-to-end optimization |
1. Reduce LLM latency:
2. Reduce ASR latency:
3. Reduce TTS latency:
4. Reduce network latency:
The most effective strategy: Split LLM output into sentences and send each sentence immediately to TTS:
LLM Token 1: "I"
LLM Token 2: "can"
LLM Token 3: "help"
LLM Token 4: "you." ← Sentence end detected!
→ Send sentence to TTS: "I can help you."
→ Audio is generated while LLM produces the next sentence
function detectSentenceEnd(buffer: string): boolean {
// Simple heuristic
return /[.!?]\s/.test(buffer) || buffer.length > 200
}
// Extended: Also comma pauses for more natural speech
function detectPausePoint(buffer: string): boolean {
return /[.!?,;:]\s/.test(buffer) || buffer.length > 150
}
The voice agent adapts its voice to the context:
| Context | Voice Settings |
|---|---|
| Greeting | stability: 0.6, style: 0.4 (friendly, warm) |
| Technical explanation | stability: 0.8, style: 0.1 (clear, factual) |
| Apology | stability: 0.7, style: 0.3 (empathetic) |
| Farewell | stability: 0.5, style: 0.5 (warm, personal) |
function getVoiceSettings(context: ConversationContext) {
if (context.intent === 'complaint') {
return { stability: 0.7, similarity_boost: 0.8, style: 0.3 }
}
if (context.intent === 'technical_support') {
return { stability: 0.8, similarity_boost: 0.8, style: 0.1 }
}
// Default: friendly and natural
return { stability: 0.5, similarity_boost: 0.75, style: 0.3 }
}
Practical tip: The most important optimization is sentence-level streaming. Without streaming, the user waits 2–4 seconds for a response — with streaming, they hear the first words after 500–800 ms. That's the difference between "feels like a robot" and "feels like a conversation."