Comparison

Whisper vs Deepgram vs AssemblyAI: honest 2026 comparison

May 5, 2026 · 6 min read

If you're building anything that turns audio into text in 2026, you've probably looked at Whisper, Deepgram, and AssemblyAI. Each has loud marketing claims. Here's an honest comparison from someone who's run all three in production.

TL;DR

API	Best for	Price/min	Gotcha
OpenAI Whisper API	Quick prototypes, simple jobs	$0.006	25 MB file cap
Self-host Whisper (faster-whisper)	High-volume custom needs	~$0.001 compute	Build infra yourself
Deepgram Nova-3	Real-time streaming, low latency	$0.0043	Worse on accents/non-English
AssemblyAI Universal-2	Speaker labels, sentiment, summaries	$0.0102	2x more expensive than competitors
LessRec (Whisper backend)	End users who don't want to build wrapper	$0.05	No streaming, no diarization yet

Accuracy (word error rate, English clean audio)

From public benchmarks plus our own internal testing on 100 hours of mixed podcast / meeting / interview audio:

Whisper large-v3: 5.2% WER
Deepgram Nova-3: 4.8% WER
AssemblyAI Universal-2: 4.5% WER
OpenAI Whisper API (whisper-1): 6.1% WER (older medium model)

For everyday work this is a tie. The difference between 4.5% and 5.2% is one wrong word per 200 — you wouldn't notice unless you're benchmarking. Pick on price/features, not accuracy.

Where accuracy actually diverges:

Heavy accents: Whisper wins. Trained on more diverse multilingual data.
Multiple speakers, overlapping speech: AssemblyAI wins (speaker diarization built in).
Music/SFX background: All three struggle. Run audio through vocal isolation first.
Domain-specific terms (legal, medical, technical): Deepgram and AssemblyAI both let you upload custom vocabulary. Whisper doesn't (yet).

Latency (real-time use cases)

API	Streaming?	First-word latency	Use case
Deepgram	✅ Native	~300ms	Live captions, voice agents, call center
AssemblyAI	✅ Native	~500ms	Live with rich metadata
OpenAI Whisper API	❌ Batch only	n/a	Async transcription
Self-host Whisper	⚠️ With effort (faster-whisper streaming mode)	~700ms	Custom voice apps
LessRec	❌ Batch only (intentional)	n/a	Async upload-and-wait

If you're building a real-time voice agent (Delphi-style mentor, customer support bot, live caption tool) → Deepgram. Period. Don't fight Whisper to do streaming when Deepgram does it natively.

If you're building anything else (transcription SaaS, async meeting notes, podcast tooling, document processing) → Whisper or its hosted alternatives.

Speaker diarization (who said what)

Native support: AssemblyAI ✅, Deepgram ✅ (extra cost), Whisper ❌ (but pyannote.audio adds it for free if you self-host).

If diarization matters to you and you don't want to write the pyannote integration: AssemblyAI is the easiest. Their default response includes speaker labels with no extra config.

File size limits

OpenAI Whisper API: 25 MB. Painful — you must chunk.
Deepgram: 2 GB pre-recorded, unlimited streaming.
AssemblyAI: 5 GB.
Self-host: No limit (only your disk).
LessRec: 1 GB per file (about 6 hours of MP3).

For long-form work (depositions, courses, conferences), OpenAI's API is the worst choice. Pick anything else.

Pricing math at common volumes

Volume per month	OpenAI	Deepgram	AssemblyAI	Self-host*	LessRec
10 hrs	$3.60	$2.58	$6.12	$40 fixed cost	$30
100 hrs	$36	$25.80	$61.20	$40 fixed	$300
1,000 hrs	$360	$258	$612	$80-200 (dedicated server)	$3,000
10,000 hrs	$3,600	$2,580	$6,120	$500-1,500 (multi-GPU)	$30,000 (custom plan)

*Self-host = Hetzner CX43 ($40/mo) running faster-whisper INT8 on CPU. Handles up to ~1,500 hrs/mo before saturating. Add second box past that.

Crossover points:

<100 hrs/mo: OpenAI API or Deepgram is cheapest if you have dev time. LessRec if you don't want to build.
100-1,000 hrs/mo: Self-hosted Whisper saves significant money if you have engineering time. Deepgram if you need streaming.
1,000-10,000 hrs/mo: Self-hosted is dramatically cheaper than any API. The engineering cost amortizes fast.
10,000+ hrs/mo: Build your own infra, period. The API providers will offer you custom pricing — negotiate.

What we use at LessRec (transparent)

We run Whisper large-v3 INT8-quantized via faster-whisper on Hetzner CX43 dedicated CPUs. ~50-100x realtime per worker, ~$0.001/min compute cost. We charge $0.05/min retail because the price covers all-in service (queue, .docx/.srt converters, Stripe billing, UI, support) — not just the model API.

If you're a developer with bandwidth to build the wrapper yourself, OpenAI's API at $0.006/min is the right call. If you just want a transcript, LessRec is purpose-built for that.

Try LessRec with 10 free minutes

Drop a file, see the accuracy on your actual audio.

Upload now →

FAQ

Should I use AWS Transcribe / Google Speech-to-Text / Azure Speech?

Cloud transcription services exist but in 2026 their accuracy lags Whisper / Deepgram / AssemblyAI by 1-3% WER. Use only if you're already deep in that cloud ecosystem and integration friction outweighs the accuracy gap.

What about open-source alternatives to Whisper?

Wav2Vec2, Conformer-CTC, NVIDIA NeMo — all real options. Accuracy is competitive but ecosystem maturity (libraries, language coverage, easy fine-tuning) lags Whisper. Stick with Whisper unless you have a specific reason.

How long until Whisper gets a "v4" or major upgrade?

OpenAI doesn't pre-announce. Last major version (large-v3) shipped late 2023. A v4 with native diarization + streaming + voice activity detection would close every remaining gap with Deepgram/AssemblyAI. Industry expectation: 2026-27.