Whisper vs Deepgram vs AssemblyAI: honest 2026 comparison
If you're building anything that turns audio into text in 2026, you've probably looked at Whisper, Deepgram, and AssemblyAI. Each has loud marketing claims. Here's an honest comparison from someone who's run all three in production.
TL;DR
| API | Best for | Price/min | Gotcha |
|---|---|---|---|
| OpenAI Whisper API | Quick prototypes, simple jobs | $0.006 | 25 MB file cap |
| Self-host Whisper (faster-whisper) | High-volume custom needs | ~$0.001 compute | Build infra yourself |
| Deepgram Nova-3 | Real-time streaming, low latency | $0.0043 | Worse on accents/non-English |
| AssemblyAI Universal-2 | Speaker labels, sentiment, summaries | $0.0102 | 2x more expensive than competitors |
| LessRec (Whisper backend) | End users who don't want to build wrapper | $0.05 | No streaming, no diarization yet |
Accuracy (word error rate, English clean audio)
From public benchmarks plus our own internal testing on 100 hours of mixed podcast / meeting / interview audio:
- Whisper large-v3: 5.2% WER
- Deepgram Nova-3: 4.8% WER
- AssemblyAI Universal-2: 4.5% WER
- OpenAI Whisper API (whisper-1): 6.1% WER (older medium model)
For everyday work this is a tie. The difference between 4.5% and 5.2% is one wrong word per 200 — you wouldn't notice unless you're benchmarking. Pick on price/features, not accuracy.
Where accuracy actually diverges:
- Heavy accents: Whisper wins. Trained on more diverse multilingual data.
- Multiple speakers, overlapping speech: AssemblyAI wins (speaker diarization built in).
- Music/SFX background: All three struggle. Run audio through vocal isolation first.
- Domain-specific terms (legal, medical, technical): Deepgram and AssemblyAI both let you upload custom vocabulary. Whisper doesn't (yet).
Latency (real-time use cases)
| API | Streaming? | First-word latency | Use case |
|---|---|---|---|
| Deepgram | ✅ Native | ~300ms | Live captions, voice agents, call center |
| AssemblyAI | ✅ Native | ~500ms | Live with rich metadata |
| OpenAI Whisper API | ❌ Batch only | n/a | Async transcription |
| Self-host Whisper | ⚠️ With effort (faster-whisper streaming mode) | ~700ms | Custom voice apps |
| LessRec | ❌ Batch only (intentional) | n/a | Async upload-and-wait |
If you're building a real-time voice agent (Delphi-style mentor, customer support bot, live caption tool) → Deepgram. Period. Don't fight Whisper to do streaming when Deepgram does it natively.
If you're building anything else (transcription SaaS, async meeting notes, podcast tooling, document processing) → Whisper or its hosted alternatives.
Speaker diarization (who said what)
Native support: AssemblyAI ✅, Deepgram ✅ (extra cost), Whisper ❌ (but pyannote.audio adds it for free if you self-host).
If diarization matters to you and you don't want to write the pyannote integration: AssemblyAI is the easiest. Their default response includes speaker labels with no extra config.
File size limits
- OpenAI Whisper API: 25 MB. Painful — you must chunk.
- Deepgram: 2 GB pre-recorded, unlimited streaming.
- AssemblyAI: 5 GB.
- Self-host: No limit (only your disk).
- LessRec: 1 GB per file (about 6 hours of MP3).
For long-form work (depositions, courses, conferences), OpenAI's API is the worst choice. Pick anything else.
Pricing math at common volumes
| Volume per month | OpenAI | Deepgram | AssemblyAI | Self-host* | LessRec |
|---|---|---|---|---|---|
| 10 hrs | $3.60 | $2.58 | $6.12 | $40 fixed cost | $30 |
| 100 hrs | $36 | $25.80 | $61.20 | $40 fixed | $300 |
| 1,000 hrs | $360 | $258 | $612 | $80-200 (dedicated server) | $3,000 |
| 10,000 hrs | $3,600 | $2,580 | $6,120 | $500-1,500 (multi-GPU) | $30,000 (custom plan) |
*Self-host = Hetzner CX43 ($40/mo) running faster-whisper INT8 on CPU. Handles up to ~1,500 hrs/mo before saturating. Add second box past that.
Crossover points:
- <100 hrs/mo: OpenAI API or Deepgram is cheapest if you have dev time. LessRec if you don't want to build.
- 100-1,000 hrs/mo: Self-hosted Whisper saves significant money if you have engineering time. Deepgram if you need streaming.
- 1,000-10,000 hrs/mo: Self-hosted is dramatically cheaper than any API. The engineering cost amortizes fast.
- 10,000+ hrs/mo: Build your own infra, period. The API providers will offer you custom pricing — negotiate.
What we use at LessRec (transparent)
We run Whisper large-v3 INT8-quantized via faster-whisper on Hetzner CX43 dedicated CPUs. ~50-100x realtime per worker, ~$0.001/min compute cost. We charge $0.05/min retail because the price covers all-in service (queue, .docx/.srt converters, Stripe billing, UI, support) — not just the model API.
If you're a developer with bandwidth to build the wrapper yourself, OpenAI's API at $0.006/min is the right call. If you just want a transcript, LessRec is purpose-built for that.
FAQ
Should I use AWS Transcribe / Google Speech-to-Text / Azure Speech?
Cloud transcription services exist but in 2026 their accuracy lags Whisper / Deepgram / AssemblyAI by 1-3% WER. Use only if you're already deep in that cloud ecosystem and integration friction outweighs the accuracy gap.
What about open-source alternatives to Whisper?
Wav2Vec2, Conformer-CTC, NVIDIA NeMo — all real options. Accuracy is competitive but ecosystem maturity (libraries, language coverage, easy fine-tuning) lags Whisper. Stick with Whisper unless you have a specific reason.
How long until Whisper gets a "v4" or major upgrade?
OpenAI doesn't pre-announce. Last major version (large-v3) shipped late 2023. A v4 with native diarization + streaming + voice activity detection would close every remaining gap with Deepgram/AssemblyAI. Industry expectation: 2026-27.