ElevenLabs isn't just known for TTS — the platform also offers powerful speech-to-text capabilities and audio analysis tools. From transcription through speaker diarization to real-time processing.
The ElevenLabs Speech-to-Text API transcribes audio to text with high accuracy:
const formData = new FormData()
formData.append('audio', audioFile)
formData.append('model_id', 'scribe_v1')
formData.append('language_code', 'de')
const response = await fetch(
'https://api.elevenlabs.io/v1/speech-to-text',
{
method: 'POST',
headers: { 'xi-api-key': process.env.ELEVENLABS_API_KEY },
body: formData,
}
)
const result = await response.json()
// { text: "Welcome to...", language: "en", segments: [...] }
| System | WER German | WER English | Real-time | Price/Min |
|---|---|---|---|---|
| ElevenLabs Scribe | ~4% | ~3% | Yes | €0.005 |
| OpenAI Whisper | ~5% | ~4% | No | Free (local) |
| Deepgram | ~4% | ~3% | Yes | €0.0043 |
| Azure Speech | ~5% | ~4% | Yes | €0.0093 |
Speaker diarization detects and separates different speakers in a recording:
{
"segments": [
{ "speaker": "Speaker_1", "start": 0.0, "end": 3.5, "text": "Good day, how can I help?" },
{ "speaker": "Speaker_2", "start": 3.8, "end": 7.2, "text": "I have a question about my order." },
{ "speaker": "Speaker_1", "start": 7.5, "end": 11.0, "text": "Of course, can you give me your order number?" }
]
}
| Speakers | Accuracy | Recommendation |
|---|---|---|
| 2 | 95–98% | Excellent |
| 3–5 | 90–95% | Very good |
| 6–10 | 80–90% | Good, manual review recommended |
| 10+ | 70–80% | Challenging, pre-enrollment helps |
Beyond pure transcription, ElevenLabs also analyzes emotional content:
For live applications, ElevenLabs offers real-time transcription via WebSocket:
const ws = new WebSocket(
'wss://api.elevenlabs.io/v1/speech-to-text/stream'
)
ws.onopen = () => {
ws.send(JSON.stringify({
type: 'config',
api_key: process.env.ELEVENLABS_API_KEY,
model_id: 'scribe_v1',
language_code: 'en',
}))
}
// Send audio chunks
mediaRecorder.ondataavailable = (event) => {
ws.send(event.data)
}
// Receive transcripts
ws.onmessage = (event) => {
const data = JSON.parse(event.data)
console.log('Transcript:', data.text)
}
Practical tip: Combine speech-to-text with the TTS API for complete voice pipelines: Audio in → Transcription → LLM processing → Response audio out. This is the foundation of every voice agent.