Speech-to-Text & Audio Intelligence

ElevenLabs isn't just known for TTS — the platform also offers powerful speech-to-text capabilities and audio analysis tools. From transcription through speaker diarization to real-time processing.

Transcription API

Core Function

The ElevenLabs Speech-to-Text API transcribes audio to text with high accuracy:

const formData = new FormData()
formData.append('audio', audioFile)
formData.append('model_id', 'scribe_v1')
formData.append('language_code', 'de')

const response = await fetch(
  'https://api.elevenlabs.io/v1/speech-to-text',
  {
    method: 'POST',
    headers: { 'xi-api-key': process.env.ELEVENLABS_API_KEY },
    body: formData,
  }
)
const result = await response.json()
// { text: "Welcome to...", language: "en", segments: [...] }

Supported Formats

Audio: MP3, WAV, FLAC, OGG, M4A, WebM
Maximum file size: 1 GB
Languages: 99+ languages with automatic detection
Accuracy: Word Error Rate (WER) of 3–5% for German

Quality Comparison

System	WER German	WER English	Real-time	Price/Min
ElevenLabs Scribe	~4%	~3%	Yes	€0.005
OpenAI Whisper	~5%	~4%	No	Free (local)
Deepgram	~4%	~3%	Yes	€0.0043
Azure Speech	~5%	~4%	Yes	€0.0093

Speaker Diarization

Who Said What?

Speaker diarization detects and separates different speakers in a recording:

{
  "segments": [
    { "speaker": "Speaker_1", "start": 0.0, "end": 3.5, "text": "Good day, how can I help?" },
    { "speaker": "Speaker_2", "start": 3.8, "end": 7.2, "text": "I have a question about my order." },
    { "speaker": "Speaker_1", "start": 7.5, "end": 11.0, "text": "Of course, can you give me your order number?" }
  ]
}

Accuracy by Speaker Count

Speakers	Accuracy	Recommendation
2	95–98%	Excellent
3–5	90–95%	Very good
6–10	80–90%	Good, manual review recommended
10+	70–80%	Challenging, pre-enrollment helps

Use Cases

Meeting minutes: Automatic attribution of statements to participants
Call center analysis: Analyze customer vs. agent separately
Interview transcription: Clearly separate questions and answers
Compliance recording: Who said what and when?

Audio Analysis

Sentiment & Emotion Detection

Beyond pure transcription, ElevenLabs also analyzes emotional content:

Sentiment: Positive, neutral, negative
Emotions: Joy, anger, sadness, surprise, fear
Intensity: Strength of detected emotion
Trends: Emotional progression throughout the conversation

Speech Analysis Metrics

Speaking speed: Words per minute
Pauses: Frequency and duration of silence
Overlaps: How often do people speak simultaneously?
Filler words: "Um", "Uh", "Well" — count frequency

Real-Time Transcription

WebSocket-Based

For live applications, ElevenLabs offers real-time transcription via WebSocket:

const ws = new WebSocket(
  'wss://api.elevenlabs.io/v1/speech-to-text/stream'
)

ws.onopen = () => {
  ws.send(JSON.stringify({
    type: 'config',
    api_key: process.env.ELEVENLABS_API_KEY,
    model_id: 'scribe_v1',
    language_code: 'en',
  }))
}

// Send audio chunks
mediaRecorder.ondataavailable = (event) => {
  ws.send(event.data)
}

// Receive transcripts
ws.onmessage = (event) => {
  const data = JSON.parse(event.data)
  console.log('Transcript:', data.text)
}

Latency Expectations

Interim results: ~200 ms (preliminary, may change)
Final results: ~500 ms (confirmed, stable)
End-of-speech detection: ~300 ms after speaking ends

Applications

Live subtitles: For webinars, conferences, live streams
Voice commands: Recognize voice commands in real time
Meeting AI: Live transcription during meetings
Accessibility: Real-time subtitles for hearing-impaired users

Practical tip: Combine speech-to-text with the TTS API for complete voice pipelines: Audio in → Transcription → LLM processing → Response audio out. This is the foundation of every voice agent.