Speech & Transcription

Speech-to-text, text-to-speech, voice activity detection.

Hosted STT (speech-to-text)

★ Deepgram — fast, accurate, generous free credit ($200) on signup. Realtime + batch APIs. The default for production STT in 2026.
★ AssemblyAI — very accurate, great speaker diarization; free tier exists.
OpenAI Whisper API — easy if you're already on OpenAI; cheaper batch than realtime.
Google Speech-to-Text — long-time baseline, paid.
AWS Transcribe — bundled with AWS.
Azure Speech — bundled with Azure.
Soniox, Speechmatics, Rev AI, Gladia — strong alternatives with various free tiers.

★ whisper.cpp — fastest CPU/GPU port of Whisper; runs on a M1 Mac or a Hetzner box.
WhisperX — Whisper + word-level timestamps + diarization.
Vosk — offline, lightweight, real-time-friendly; many languages.
Coqui STT (formerly Mozilla DeepSpeech) — open source.
NeMo / Riva (NVIDIA) — heavyweight, GPU-required.
Faster Whisper — CTranslate2-based; meaningfully faster than vanilla Whisper.

★ Web Speech API — built into browsers; free; quality varies; great for prototypes.
Transformers.js — Whisper in the browser via WebGPU; free, no server.
WebLLM-style WASM — for offline-capable PWAs.

Deepgram Streaming, AssemblyAI Realtime, Speechmatics Realtime — websocket APIs.
Soniox Realtime — competitive realtime accuracy.
OpenAI Realtime API — speech-in / speech-out for voice agents.