Whisper speaker diarization 2026: what works, what doesn’t, and the pyannote/WhisperX stack
Whisper transcribes audio to text exceptionally well, but it does not by itself answer the question every podcaster, interviewer, journalist, and clinician actually wants answered: who said what? Speaker diarization is the second step — segmenting the audio into speaker turns and labelling each segment as Speaker 1, Speaker 2, etc. As of 2026 the engineering picture has finally settled around two practical paths: pyannote.audio 3.x as the canonical diarization model, and WhisperX as the integrated wrapper that pairs Whisper transcription with pyannote diarization and word-level timestamps in one pipeline.
This post covers what these stacks can and cannot do reliably, the 4 failure modes nobody documents, accuracy by use case, and the cheapest way to run diarization for production at scale.
The two-step problem
Whisper outputs text with rough segment-level timestamps (~5-30s segments). It does not know that segment 3 was a different speaker from segment 2. To get speaker-tagged output you need:
- Voice Activity Detection (VAD) — finding the speech segments in the audio (silero VAD or pyannote VAD).
- Speaker embedding — converting each speech segment to a fixed-dimensional vector that captures voice identity (pyannote uses ECAPA-TDNN-based embeddings, ~192 dims).
- Clustering — grouping embeddings into speaker clusters (HMM, agglomerative, spectral). Number of speakers can be auto-detected or specified.
- Alignment — mapping the speaker clusters back to Whisper’s transcript segments.
WhisperX bundles 1+2+3+4 with Whisper itself in a single CLI / Python API. Vanilla pyannote.audio gives you 1+2+3 and you align with Whisper output yourself.
Accuracy you can actually expect in 2026
Diarization Error Rate (DER) is the standard metric: percentage of total time mis-labelled. Lower is better. State-of-the-art DER on clean clinical audio is 5-10%; on real-world conditions it’s usually 12-25%.
| Use case | 2 speakers DER | 3-4 speakers DER | 5+ speakers DER |
|---|---|---|---|
| Podcast interview (clean studio) | 4-7% | 8-12% | 15-25% |
| Telehealth visit (1 doc + 1 patient) | 5-9% | n/a | n/a |
| Pediatric visit (doc + parent + child) | n/a | 15-20% | n/a |
| Conference call / Zoom meeting | 10-15% | 15-25% | 25-40% |
| Deposition / legal proceedings | 5-10% | 10-15% | 15-25% |
| Restaurant / cafe (background noise) | 20-30% | 30-45% | 50%+ |
| Multi-language code-switching | 15-20% | 20-30% | 30-40% |
The number of speakers matters more than the model. Two-speaker diarization is largely solved (sub-10% DER on clean audio); five-plus speakers in noise is still beyond reliable in 2026 even with frontier models.
Stack choice: pyannote vs WhisperX vs commercial
| Stack | Pros | Cons | Best for |
|---|---|---|---|
| pyannote.audio 3.x (DIY) | Open source; SOTA accuracy; full control of pipeline | Setup non-trivial; pyannote 3.x needs HuggingFace token + accept gated repo terms | Engineers who want to own the stack |
| WhisperX (DIY) | Bundled Whisper + pyannote + word-level alignment; CLI works in 1 command | Slightly older pyannote inside; word-level alignment can drift | Practitioners who want diarization with minimum effort |
| AssemblyAI Universal-1 | Built-in diarization, cloud-managed, decent accuracy | $0.012-0.037/min; no on-prem | Teams that don’t want to run their own GPU |
| Deepgram Nova-3 | Diarization built-in, fast, decent | $0.0043/min for streaming; pre-recorded $0.0145/min | Real-time / streaming diarization needs |
| Rev AI Streaming + Diarization | High-quality, US-hosted, BAA available | $0.02-0.035/min | Legal + medical with compliance needs |
| LessRec (Whisper Large v3 + pyannote) | $0.05/min flat; same accuracy as DIY without setup | Pre-recorded only (no streaming yet) | Solo / small team with batch audio |
Four failure modes nobody documents
- Two voices that sound alike. Brothers, twins, similar-aged colleagues with similar speech patterns. Speaker embeddings cluster them together. DER on this case is 30-50%, not the headline 8%. No fix — the clustering is fundamentally limited.
- One speaker who whispers, then projects. Voice quality changes shift the embedding far enough that the same speaker gets two cluster IDs. Mitigation: use longer minimum segment duration (3+ seconds), smooth across boundaries, or post-process with semantic continuity (LLM pass).
- Speaker count mis-detection. Auto-detection often over-counts on noisy audio (one speaker becomes 2-3) and under-counts on similar voices. If you know the speaker count in advance (e.g. clinical visit = 2), pin it explicitly. Cuts DER by 5-10 points.
- Crosstalk and overlapping speech. Pyannote 3.x has overlap detection but accuracy on overlapping segments is still 30-50% DER. For depositions with frequent crosstalk, expect manual cleanup. For meetings with raise-hand etiquette, much better.
Hardware: GPU vs CPU
WhisperX on CPU is slow but works for batch processing. WhisperX on GPU (RTX 3090, 4090, A40, H100) is real-time-or-faster. The cross-over for cost-effectiveness:
| Setup | 1-hour audio processing time | Cost per hour processed (cloud) |
|---|---|---|
| M2 Mac mini (CPU only) | 15-25 min | $0.30-0.50 (electricity) |
| RTX 3090 (24GB VRAM) | 3-6 min | $0.30-0.60 (cloud rental ~$0.30/hr) |
| RTX 4090 (24GB VRAM) | 2-4 min | $0.60-1.20 (cloud rental ~$0.50/hr) |
| A40 (48GB VRAM) | 1.5-3 min | $0.40-0.70 |
| H100 (80GB VRAM) | 0.8-1.5 min | $2-4 |
| OpenAI Whisper API + AssemblyAI diarization | ~real-time (cloud) | $0.84/hr (Whisper $0.36 + AAI $0.48) |
For under ~50 hours of audio per month, the cloud API path is cheapest. For 50-500 hours, owning a 3090 box pays for itself in 6-12 months. For 500+ hours, consider an on-prem A40 or H100 cluster.
The 5-minute setup (WhisperX with diarization)
# 1. Install (Python 3.10+, CUDA 11.8+ for GPU)
pip install whisperx
# 2. HuggingFace token (free) for pyannote gated model
export HF_TOKEN=hf_xxx
# Accept terms at: https://hf.co/pyannote/segmentation-3.0
# https://hf.co/pyannote/speaker-diarization-3.1
# 3. Run
whisperx audio.mp3 \
--model large-v3 \
--diarize \
--hf_token $HF_TOKEN \
--min_speakers 2 \
--max_speakers 4 \
--output_format srt
Output is an SRT with speaker tags inline (`SPEAKER_00`, `SPEAKER_01`). Re-label them post-hoc to Alice/Bob/Doctor/Patient in your editor.
Practical recipes by use case
Podcast interview (1 host + 1 guest)
WhisperX with `--min_speakers 2 --max_speakers 2`. DER 4-7%. Total per-hour cost <$0.20 if you own a GPU, ~$0.50 cloud.
Telehealth visit (doc + patient)
WhisperX or LessRec HIPAA tier with pinned 2 speakers. Run a post-process LLM pass that re-labels SPEAKER_00 vs SPEAKER_01 based on content (“the speaker asking diagnostic questions is the doc”). DER 5-9%.
Multi-party meeting / Zoom
WhisperX with `--min_speakers 3 --max_speakers
Deposition / legal proceedings
WhisperX with explicit speaker count, then manual review of crossstalk segments. DER 10-15% on clean recordings; legal-grade output needs human verification regardless. Or hire a court reporter for the binding transcript and use AI for first-pass drafting.
Conference panel / podcast roundtable (4-6 speakers)
This is the hardest case. Plan for 25-40% DER and significant manual cleanup. Multi-mic recording (one mic per speaker) makes the problem trivial — the diarization is solved by which mic captured which voice. If you can’t multi-mic, pre-tag the speaker order in your notes and re-label after.
What’s coming in late 2026
Three developments to watch:
- NeMo Diarization 2.0 (NVIDIA) — targets sub-5% DER on multi-party meetings, streaming-capable. Beta in mid-2026.
- Whisper Large v4 (OpenAI rumored) — reportedly includes built-in diarization; would fold the two-step problem into one model.
- Voxtral (Mistral) — multimodal model with built-in diarization claims. Early benchmarks promising for European languages.
For now (Q2 2026) WhisperX + pyannote 3.x is the production-ready answer. It’s what we run at LessRec for diarized output requests.