Web Dev Tools

Speech & Transcription

Speech-to-text, text-to-speech, voice activity detection.

Hosted STT (speech-to-text)

  • Deepgram — fast, accurate, generous free credit ($200) on signup. Realtime + batch APIs. The default for production STT in 2026.
  • AssemblyAI — very accurate, great speaker diarization; free tier exists.
  • OpenAI Whisper API — easy if you're already on OpenAI; cheaper batch than realtime.
  • Google Speech-to-Text — long-time baseline, paid.
  • AWS Transcribe — bundled with AWS.
  • Azure Speech — bundled with Azure.
  • Soniox, Speechmatics, Rev AI, Gladia — strong alternatives with various free tiers.

Self-host / local STT

  • whisper.cpp — fastest CPU/GPU port of Whisper; runs on a M1 Mac or a Hetzner box.
  • WhisperX — Whisper + word-level timestamps + diarization.
  • Vosk — offline, lightweight, real-time-friendly; many languages.
  • Coqui STT (formerly Mozilla DeepSpeech) — open source.
  • NeMo / Riva (NVIDIA) — heavyweight, GPU-required.
  • Faster Whisper — CTranslate2-based; meaningfully faster than vanilla Whisper.

In-browser STT

  • Web Speech API — built into browsers; free; quality varies; great for prototypes.
  • Transformers.js — Whisper in the browser via WebGPU; free, no server.
  • WebLLM-style WASM — for offline-capable PWAs.

Realtime / streaming

  • Deepgram Streaming, AssemblyAI Realtime, Speechmatics Realtime — websocket APIs.
  • Soniox Realtime — competitive realtime accuracy.
  • OpenAI Realtime API — speech-in / speech-out for voice agents.

Voice activity detection (VAD)

  • Silero VAD — small ONNX model; great for "is the user speaking?".
  • @ricky0123/vad-web — browser VAD wrapper.
  • @ricky0123/vad-node — Node VAD.

Text-to-speech (TTS)

  • ElevenLabs — best voices; generous free tier; paid for production.
  • OpenAI TTS — cheap, decent voices, easy if already on OpenAI.
  • Cartesia — fast, low-latency TTS; free trial credit.
  • Azure TTS, Amazon Polly, Google TTS — incumbent cloud options.
  • Coqui TTS, Piper, F5-TTS — open-source, self-hostable.

Pick this if…

  • Default production STT: Deepgram.
  • Best accuracy + diarization: AssemblyAI.
  • Self-host on a Mac / GPU: whisper.cpp / Faster Whisper.
  • In-browser, no server: Transformers.js or Web Speech API.
  • Production TTS: ElevenLabs (quality) or OpenAI TTS (cheap).
  • Self-host TTS: Piper or F5-TTS.

On this page