The ElevenLabs Text-to-Speech API is the core tool for developers. From simple REST calls to streaming audio with sub-300ms latency — here you'll learn everything you need for integration.
const response = await fetch(
'https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM',
{
method: 'POST',
headers: {
'xi-api-key': process.env.ELEVENLABS_API_KEY,
'Content-Type': 'application/json',
},
body: JSON.stringify({
text: 'Welcome to our service.',
model_id: 'eleven_multilingual_v2',
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
style: 0.3,
use_speaker_boost: true,
},
}),
}
)
const audioBuffer = await response.arrayBuffer()
| Parameter | Range | Effect |
|---|---|---|
| stability | 0.0–1.0 | Low = more expressive, High = more consistent |
| similarity_boost | 0.0–1.0 | How close to the original voice |
| style | 0.0–1.0 | Strength of speaking style (increases latency) |
| use_speaker_boost | true/false | Optimizes voice clarity |
| Model | Latency | Quality | Languages | Use Case |
|---|---|---|---|---|
| eleven_turbo_v2_5 | ~200 ms | Good | 32 | Real-time conversations |
| eleven_multilingual_v2 | ~400 ms | Excellent | 29 | Highest quality |
| eleven_monolingual_v1 | ~300 ms | Very good | 1 (EN) | English only |
With the normal API, the client waits until the entire audio is generated. With streaming, the first bytes arrive in < 300 ms — critical for real-time applications.
const response = await fetch(
'https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM/stream',
{
method: 'POST',
headers: {
'xi-api-key': process.env.ELEVENLABS_API_KEY,
'Content-Type': 'application/json',
},
body: JSON.stringify({
text: 'This will be returned as a stream.',
model_id: 'eleven_turbo_v2_5',
output_format: 'mp3_44100_128',
}),
}
)
// Forward stream directly to client
const reader = response.body.getReader()
while (true) {
const { done, value } = await reader.read()
if (done) break
// Process audio chunks
}
| Format | Quality | Size | Use Case |
|---|---|---|---|
| mp3_44100_128 | High | Medium | Standard |
| mp3_44100_64 | Medium | Small | Mobile |
| pcm_16000 | Raw | Large | Telephony |
| pcm_44100 | Raw | Very large | Post-production |
| ulaw_8000 | Phone | Small | Twilio/SIP |
SSML gives you fine-grained control over speech output:
<speak>
Welcome to <emphasis level="strong">EverStrategy.ai</emphasis>.
<break time="500ms"/>
Your appointment is on <say-as interpret-as="date">2026-03-15</say-as>.
That costs <say-as interpret-as="currency">49.99 EUR</say-as>.
</speak>
<break> — Insert pause (time="500ms")<emphasis> — Emphasis (level: strong, moderate, reduced)<say-as> — Interpretation: date, currency, telephone, cardinal<phoneme> — Exact pronunciation via IPA<prosody> — Speed, pitch, volumeTotal latency = API overhead + Model inference + Text length + Network
| Scenario | Without Optimization | With Optimization |
|---|---|---|
| Single sentence | 600 ms | 200 ms |
| Paragraph (500 chars) | 1,500 ms | 400 ms |
| Full page | 5,000 ms | 800 ms (streaming) |
Practical tip: For real-time voice agents, always use the Turbo model with streaming. For offline production (audiobooks, podcasts), use the Multilingual v2 model — the higher quality is worth the longer wait.