Lesson 4 of 5·10 min read

Speech-to-Text & Audio Intelligence

ElevenLabs isn't just known for TTS — the platform also offers powerful speech-to-text capabilities and audio analysis tools. From transcription through speaker diarization to real-time processing.

Transcription API

Core Function

The ElevenLabs Speech-to-Text API transcribes audio to text with high accuracy:

const formData = new FormData()
formData.append('audio', audioFile)
formData.append('model_id', 'scribe_v1')
formData.append('language_code', 'de')

const response = await fetch(
  'https://api.elevenlabs.io/v1/speech-to-text',
  {
    method: 'POST',
    headers: { 'xi-api-key': process.env.ELEVENLABS_API_KEY },
    body: formData,
  }
)
const result = await response.json()
// { text: "Welcome to...", language: "en", segments: [...] }

Supported Formats

  • Audio: MP3, WAV, FLAC, OGG, M4A, WebM
  • Maximum file size: 1 GB
  • Languages: 99+ languages with automatic detection
  • Accuracy: Word Error Rate (WER) of 3–5% for German

Quality Comparison

SystemWER GermanWER EnglishReal-timePrice/Min
ElevenLabs Scribe~4%~3%Yes€0.005
OpenAI Whisper~5%~4%NoFree (local)
Deepgram~4%~3%Yes€0.0043
Azure Speech~5%~4%Yes€0.0093

Speaker Diarization

Who Said What?

Speaker diarization detects and separates different speakers in a recording:

{
  "segments": [
    { "speaker": "Speaker_1", "start": 0.0, "end": 3.5, "text": "Good day, how can I help?" },
    { "speaker": "Speaker_2", "start": 3.8, "end": 7.2, "text": "I have a question about my order." },
    { "speaker": "Speaker_1", "start": 7.5, "end": 11.0, "text": "Of course, can you give me your order number?" }
  ]
}

Accuracy by Speaker Count

SpeakersAccuracyRecommendation
295–98%Excellent
3–590–95%Very good
6–1080–90%Good, manual review recommended
10+70–80%Challenging, pre-enrollment helps

Use Cases

  • Meeting minutes: Automatic attribution of statements to participants
  • Call center analysis: Analyze customer vs. agent separately
  • Interview transcription: Clearly separate questions and answers
  • Compliance recording: Who said what and when?

Audio Analysis

Sentiment & Emotion Detection

Beyond pure transcription, ElevenLabs also analyzes emotional content:

  • Sentiment: Positive, neutral, negative
  • Emotions: Joy, anger, sadness, surprise, fear
  • Intensity: Strength of detected emotion
  • Trends: Emotional progression throughout the conversation

Speech Analysis Metrics

  • Speaking speed: Words per minute
  • Pauses: Frequency and duration of silence
  • Overlaps: How often do people speak simultaneously?
  • Filler words: "Um", "Uh", "Well" — count frequency

Real-Time Transcription

WebSocket-Based

For live applications, ElevenLabs offers real-time transcription via WebSocket:

const ws = new WebSocket(
  'wss://api.elevenlabs.io/v1/speech-to-text/stream'
)

ws.onopen = () => {
  ws.send(JSON.stringify({
    type: 'config',
    api_key: process.env.ELEVENLABS_API_KEY,
    model_id: 'scribe_v1',
    language_code: 'en',
  }))
}

// Send audio chunks
mediaRecorder.ondataavailable = (event) => {
  ws.send(event.data)
}

// Receive transcripts
ws.onmessage = (event) => {
  const data = JSON.parse(event.data)
  console.log('Transcript:', data.text)
}

Latency Expectations

  • Interim results: ~200 ms (preliminary, may change)
  • Final results: ~500 ms (confirmed, stable)
  • End-of-speech detection: ~300 ms after speaking ends

Applications

  • Live subtitles: For webinars, conferences, live streams
  • Voice commands: Recognize voice commands in real time
  • Meeting AI: Live transcription during meetings
  • Accessibility: Real-time subtitles for hearing-impaired users

Practical tip: Combine speech-to-text with the TTS API for complete voice pipelines: Audio in → Transcription → LLM processing → Response audio out. This is the foundation of every voice agent.