Lesson 3 of 5·11 min read

Text-to-Speech API

The ElevenLabs Text-to-Speech API is the core tool for developers. From simple REST calls to streaming audio with sub-300ms latency — here you'll learn everything you need for integration.

REST API Basics

Simple TTS Call

const response = await fetch(
  'https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM',
  {
    method: 'POST',
    headers: {
      'xi-api-key': process.env.ELEVENLABS_API_KEY,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      text: 'Welcome to our service.',
      model_id: 'eleven_multilingual_v2',
      voice_settings: {
        stability: 0.5,
        similarity_boost: 0.75,
        style: 0.3,
        use_speaker_boost: true,
      },
    }),
  }
)
const audioBuffer = await response.arrayBuffer()

Voice Settings Explained

ParameterRangeEffect
stability0.0–1.0Low = more expressive, High = more consistent
similarity_boost0.0–1.0How close to the original voice
style0.0–1.0Strength of speaking style (increases latency)
use_speaker_boosttrue/falseOptimizes voice clarity

Model Selection

ModelLatencyQualityLanguagesUse Case
eleven_turbo_v2_5~200 msGood32Real-time conversations
eleven_multilingual_v2~400 msExcellent29Highest quality
eleven_monolingual_v1~300 msVery good1 (EN)English only

Streaming Audio

Why Streaming?

With the normal API, the client waits until the entire audio is generated. With streaming, the first bytes arrive in < 300 ms — critical for real-time applications.

const response = await fetch(
  'https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM/stream',
  {
    method: 'POST',
    headers: {
      'xi-api-key': process.env.ELEVENLABS_API_KEY,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      text: 'This will be returned as a stream.',
      model_id: 'eleven_turbo_v2_5',
      output_format: 'mp3_44100_128',
    }),
  }
)

// Forward stream directly to client
const reader = response.body.getReader()
while (true) {
  const { done, value } = await reader.read()
  if (done) break
  // Process audio chunks
}

Output Formats

FormatQualitySizeUse Case
mp3_44100_128HighMediumStandard
mp3_44100_64MediumSmallMobile
pcm_16000RawLargeTelephony
pcm_44100RawVery largePost-production
ulaw_8000PhoneSmallTwilio/SIP

SSML Support

Speech Synthesis Markup Language

SSML gives you fine-grained control over speech output:

<speak>
  Welcome to <emphasis level="strong">EverStrategy.ai</emphasis>.
  <break time="500ms"/>
  Your appointment is on <say-as interpret-as="date">2026-03-15</say-as>.
  That costs <say-as interpret-as="currency">49.99 EUR</say-as>.
</speak>

Supported SSML Tags

  • <break> — Insert pause (time="500ms")
  • <emphasis> — Emphasis (level: strong, moderate, reduced)
  • <say-as> — Interpretation: date, currency, telephone, cardinal
  • <phoneme> — Exact pronunciation via IPA
  • <prosody> — Speed, pitch, volume

Latency Optimization

The Latency Formula

Total latency = API overhead + Model inference + Text length + Network

Optimization Strategies

  1. Use Turbo model — eleven_turbo_v2_5 instead of multilingual_v2
  2. Enable streaming — first bytes in < 300 ms
  3. Chunk text — split long texts into sentences and send in parallel
  4. Use CDN — cache generated audio and serve via CDN
  5. Regional endpoints — server location close to the user
  6. Pre-generation — pre-generate and cache common announcements

Latency Benchmarks

ScenarioWithout OptimizationWith Optimization
Single sentence600 ms200 ms
Paragraph (500 chars)1,500 ms400 ms
Full page5,000 ms800 ms (streaming)

Practical tip: For real-time voice agents, always use the Turbo model with streaming. For offline production (audiobooks, podcasts), use the Multilingual v2 model — the higher quality is worth the longer wait.