Speech & Transcription
Speech-to-text, text-to-speech, voice activity detection.
Hosted STT (speech-to-text)
- ★ Deepgram — fast, accurate, generous free credit ($200) on signup. Realtime + batch APIs. The default for production STT in 2026.
- ★ AssemblyAI — very accurate, great speaker diarization; free tier exists.
- OpenAI Whisper API — easy if you're already on OpenAI; cheaper batch than realtime.
- Google Speech-to-Text — long-time baseline, paid.
- AWS Transcribe — bundled with AWS.
- Azure Speech — bundled with Azure.
- Soniox, Speechmatics, Rev AI, Gladia — strong alternatives with various free tiers.
Self-host / local STT
- ★ whisper.cpp — fastest CPU/GPU port of Whisper; runs on a M1 Mac or a Hetzner box.
- WhisperX — Whisper + word-level timestamps + diarization.
- Vosk — offline, lightweight, real-time-friendly; many languages.
- Coqui STT (formerly Mozilla DeepSpeech) — open source.
- NeMo / Riva (NVIDIA) — heavyweight, GPU-required.
- Faster Whisper — CTranslate2-based; meaningfully faster than vanilla Whisper.
In-browser STT
- ★ Web Speech API — built into browsers; free; quality varies; great for prototypes.
- Transformers.js — Whisper in the browser via WebGPU; free, no server.
- WebLLM-style WASM — for offline-capable PWAs.
Realtime / streaming
- Deepgram Streaming, AssemblyAI Realtime, Speechmatics Realtime — websocket APIs.
- Soniox Realtime — competitive realtime accuracy.
- OpenAI Realtime API — speech-in / speech-out for voice agents.
Voice activity detection (VAD)
- Silero VAD — small ONNX model; great for "is the user speaking?".
@ricky0123/vad-web— browser VAD wrapper.@ricky0123/vad-node— Node VAD.
Text-to-speech (TTS)
- ★ ElevenLabs — best voices; generous free tier; paid for production.
- OpenAI TTS — cheap, decent voices, easy if already on OpenAI.
- Cartesia — fast, low-latency TTS; free trial credit.
- Azure TTS, Amazon Polly, Google TTS — incumbent cloud options.
- Coqui TTS, Piper, F5-TTS — open-source, self-hostable.
Pick this if…
- Default production STT: Deepgram.
- Best accuracy + diarization: AssemblyAI.
- Self-host on a Mac / GPU: whisper.cpp / Faster Whisper.
- In-browser, no server: Transformers.js or Web Speech API.
- Production TTS: ElevenLabs (quality) or OpenAI TTS (cheap).
- Self-host TTS: Piper or F5-TTS.