Text-to-Speech API

The ElevenLabs Text-to-Speech API is the core tool for developers. From simple REST calls to streaming audio with sub-300ms latency — here you'll learn everything you need for integration.

REST API Basics

Simple TTS Call

const response = await fetch(
  'https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM',
  {
    method: 'POST',
    headers: {
      'xi-api-key': process.env.ELEVENLABS_API_KEY,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      text: 'Welcome to our service.',
      model_id: 'eleven_multilingual_v2',
      voice_settings: {
        stability: 0.5,
        similarity_boost: 0.75,
        style: 0.3,
        use_speaker_boost: true,
      },
    }),
  }
)
const audioBuffer = await response.arrayBuffer()

Voice Settings Explained

Parameter	Range	Effect
stability	0.0–1.0	Low = more expressive, High = more consistent
similarity_boost	0.0–1.0	How close to the original voice
style	0.0–1.0	Strength of speaking style (increases latency)
use_speaker_boost	true/false	Optimizes voice clarity

Model Selection

Model	Latency	Quality	Languages	Use Case
eleven_turbo_v2_5	~200 ms	Good	32	Real-time conversations
eleven_multilingual_v2	~400 ms	Excellent	29	Highest quality
eleven_monolingual_v1	~300 ms	Very good	1 (EN)	English only

Streaming Audio

Why Streaming?

With the normal API, the client waits until the entire audio is generated. With streaming, the first bytes arrive in < 300 ms — critical for real-time applications.

const response = await fetch(
  'https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM/stream',
  {
    method: 'POST',
    headers: {
      'xi-api-key': process.env.ELEVENLABS_API_KEY,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      text: 'This will be returned as a stream.',
      model_id: 'eleven_turbo_v2_5',
      output_format: 'mp3_44100_128',
    }),
  }
)

// Forward stream directly to client
const reader = response.body.getReader()
while (true) {
  const { done, value } = await reader.read()
  if (done) break
  // Process audio chunks
}

Output Formats

Format	Quality	Size	Use Case
mp3_44100_128	High	Medium	Standard
mp3_44100_64	Medium	Small	Mobile
pcm_16000	Raw	Large	Telephony
pcm_44100	Raw	Very large	Post-production
ulaw_8000	Phone	Small	Twilio/SIP

SSML Support

Speech Synthesis Markup Language

SSML gives you fine-grained control over speech output:

<speak>
  Welcome to <emphasis level="strong">EverStrategy.ai</emphasis>.
  <break time="500ms"/>
  Your appointment is on <say-as interpret-as="date">2026-03-15</say-as>.
  That costs <say-as interpret-as="currency">49.99 EUR</say-as>.
</speak>

Supported SSML Tags

<break> — Insert pause (time="500ms")
<emphasis> — Emphasis (level: strong, moderate, reduced)
<say-as> — Interpretation: date, currency, telephone, cardinal
<phoneme> — Exact pronunciation via IPA
<prosody> — Speed, pitch, volume

Latency Optimization

The Latency Formula

Total latency = API overhead + Model inference + Text length + Network

Optimization Strategies

Use Turbo model — eleven_turbo_v2_5 instead of multilingual_v2
Enable streaming — first bytes in < 300 ms
Chunk text — split long texts into sentences and send in parallel
Use CDN — cache generated audio and serve via CDN
Regional endpoints — server location close to the user
Pre-generation — pre-generate and cache common announcements

Latency Benchmarks

Scenario	Without Optimization	With Optimization
Single sentence	600 ms	200 ms
Paragraph (500 chars)	1,500 ms	400 ms
Full page	5,000 ms	800 ms (streaming)

Practical tip: For real-time voice agents, always use the Turbo model with streaming. For offline production (audiobooks, podcasts), use the Multilingual v2 model — the higher quality is worth the longer wait.