Whisper

Whisper speaker diarization 2026: what works, what doesn’t, and the pyannote/WhisperX stack

May 8, 2026 · 8 min read

Whisper transcribes audio to text exceptionally well, but it does not by itself answer the question every podcaster, interviewer, journalist, and clinician actually wants answered: who said what? Speaker diarization is the second step — segmenting the audio into speaker turns and labelling each segment as Speaker 1, Speaker 2, etc. As of 2026 the engineering picture has finally settled around two practical paths: pyannote.audio 3.x as the canonical diarization model, and WhisperX as the integrated wrapper that pairs Whisper transcription with pyannote diarization and word-level timestamps in one pipeline.

This post covers what these stacks can and cannot do reliably, the 4 failure modes nobody documents, accuracy by use case, and the cheapest way to run diarization for production at scale.

The two-step problem

Whisper outputs text with rough segment-level timestamps (~5-30s segments). It does not know that segment 3 was a different speaker from segment 2. To get speaker-tagged output you need:

Voice Activity Detection (VAD) — finding the speech segments in the audio (silero VAD or pyannote VAD).
Speaker embedding — converting each speech segment to a fixed-dimensional vector that captures voice identity (pyannote uses ECAPA-TDNN-based embeddings, ~192 dims).
Clustering — grouping embeddings into speaker clusters (HMM, agglomerative, spectral). Number of speakers can be auto-detected or specified.
Alignment — mapping the speaker clusters back to Whisper’s transcript segments.

WhisperX bundles 1+2+3+4 with Whisper itself in a single CLI / Python API. Vanilla pyannote.audio gives you 1+2+3 and you align with Whisper output yourself.

Accuracy you can actually expect in 2026

Diarization Error Rate (DER) is the standard metric: percentage of total time mis-labelled. Lower is better. State-of-the-art DER on clean clinical audio is 5-10%; on real-world conditions it’s usually 12-25%.

Use case	2 speakers DER	3-4 speakers DER	5+ speakers DER
Podcast interview (clean studio)	4-7%	8-12%	15-25%
Telehealth visit (1 doc + 1 patient)	5-9%	n/a	n/a
Pediatric visit (doc + parent + child)	n/a	15-20%	n/a
Conference call / Zoom meeting	10-15%	15-25%	25-40%
Deposition / legal proceedings	5-10%	10-15%	15-25%
Restaurant / cafe (background noise)	20-30%	30-45%	50%+
Multi-language code-switching	15-20%	20-30%	30-40%

The number of speakers matters more than the model. Two-speaker diarization is largely solved (sub-10% DER on clean audio); five-plus speakers in noise is still beyond reliable in 2026 even with frontier models.

Stack choice: pyannote vs WhisperX vs commercial

Stack	Pros	Cons	Best for
pyannote.audio 3.x (DIY)	Open source; SOTA accuracy; full control of pipeline	Setup non-trivial; pyannote 3.x needs HuggingFace token + accept gated repo terms	Engineers who want to own the stack
WhisperX (DIY)	Bundled Whisper + pyannote + word-level alignment; CLI works in 1 command	Slightly older pyannote inside; word-level alignment can drift	Practitioners who want diarization with minimum effort
AssemblyAI Universal-1	Built-in diarization, cloud-managed, decent accuracy	$0.012-0.037/min; no on-prem	Teams that don’t want to run their own GPU
Deepgram Nova-3	Diarization built-in, fast, decent	$0.0043/min for streaming; pre-recorded $0.0145/min	Real-time / streaming diarization needs
Rev AI Streaming + Diarization	High-quality, US-hosted, BAA available	$0.02-0.035/min	Legal + medical with compliance needs
LessRec (Whisper Large v3 + pyannote)	$0.05/min flat; same accuracy as DIY without setup	Pre-recorded only (no streaming yet)	Solo / small team with batch audio

Four failure modes nobody documents

Two voices that sound alike. Brothers, twins, similar-aged colleagues with similar speech patterns. Speaker embeddings cluster them together. DER on this case is 30-50%, not the headline 8%. No fix — the clustering is fundamentally limited.
One speaker who whispers, then projects. Voice quality changes shift the embedding far enough that the same speaker gets two cluster IDs. Mitigation: use longer minimum segment duration (3+ seconds), smooth across boundaries, or post-process with semantic continuity (LLM pass).
Speaker count mis-detection. Auto-detection often over-counts on noisy audio (one speaker becomes 2-3) and under-counts on similar voices. If you know the speaker count in advance (e.g. clinical visit = 2), pin it explicitly. Cuts DER by 5-10 points.
Crosstalk and overlapping speech. Pyannote 3.x has overlap detection but accuracy on overlapping segments is still 30-50% DER. For depositions with frequent crosstalk, expect manual cleanup. For meetings with raise-hand etiquette, much better.

Hardware: GPU vs CPU

WhisperX on CPU is slow but works for batch processing. WhisperX on GPU (RTX 3090, 4090, A40, H100) is real-time-or-faster. The cross-over for cost-effectiveness:

Setup	1-hour audio processing time	Cost per hour processed (cloud)
M2 Mac mini (CPU only)	15-25 min	$0.30-0.50 (electricity)
RTX 3090 (24GB VRAM)	3-6 min	$0.30-0.60 (cloud rental ~$0.30/hr)
RTX 4090 (24GB VRAM)	2-4 min	$0.60-1.20 (cloud rental ~$0.50/hr)
A40 (48GB VRAM)	1.5-3 min	$0.40-0.70
H100 (80GB VRAM)	0.8-1.5 min	$2-4
OpenAI Whisper API + AssemblyAI diarization	~real-time (cloud)	$0.84/hr (Whisper $0.36 + AAI $0.48)

For under ~50 hours of audio per month, the cloud API path is cheapest. For 50-500 hours, owning a 3090 box pays for itself in 6-12 months. For 500+ hours, consider an on-prem A40 or H100 cluster.

The 5-minute setup (WhisperX with diarization)

# 1. Install (Python 3.10+, CUDA 11.8+ for GPU)
pip install whisperx

# 2. HuggingFace token (free) for pyannote gated model
export HF_TOKEN=hf_xxx
# Accept terms at: https://hf.co/pyannote/segmentation-3.0
#                  https://hf.co/pyannote/speaker-diarization-3.1

# 3. Run
whisperx audio.mp3 \
  --model large-v3 \
  --diarize \
  --hf_token $HF_TOKEN \
  --min_speakers 2 \
  --max_speakers 4 \
  --output_format srt

Output is an SRT with speaker tags inline (`SPEAKER_00`, `SPEAKER_01`). Re-label them post-hoc to Alice/Bob/Doctor/Patient in your editor.

Practical recipes by use case

Podcast interview (1 host + 1 guest)

WhisperX with `--min_speakers 2 --max_speakers 2`. DER 4-7%. Total per-hour cost <$0.20 if you own a GPU, ~$0.50 cloud.

Telehealth visit (doc + patient)

WhisperX or LessRec HIPAA tier with pinned 2 speakers. Run a post-process LLM pass that re-labels SPEAKER_00 vs SPEAKER_01 based on content (“the speaker asking diagnostic questions is the doc”). DER 5-9%.

Multi-party meeting / Zoom

WhisperX with `--min_speakers 3 --max_speakers `. DER 15-25%. Plan for manual cleanup of crossstalk. Consider Deepgram if you need streaming during the meeting.

Deposition / legal proceedings

WhisperX with explicit speaker count, then manual review of crossstalk segments. DER 10-15% on clean recordings; legal-grade output needs human verification regardless. Or hire a court reporter for the binding transcript and use AI for first-pass drafting.

Conference panel / podcast roundtable (4-6 speakers)

This is the hardest case. Plan for 25-40% DER and significant manual cleanup. Multi-mic recording (one mic per speaker) makes the problem trivial — the diarization is solved by which mic captured which voice. If you can’t multi-mic, pre-tag the speaker order in your notes and re-label after.

What’s coming in late 2026

Three developments to watch:

NeMo Diarization 2.0 (NVIDIA) — targets sub-5% DER on multi-party meetings, streaming-capable. Beta in mid-2026.
Whisper Large v4 (OpenAI rumored) — reportedly includes built-in diarization; would fold the two-step problem into one model.
Voxtral (Mistral) — multimodal model with built-in diarization claims. Early benchmarks promising for European languages.

For now (Q2 2026) WhisperX + pyannote 3.x is the production-ready answer. It’s what we run at LessRec for diarized output requests.

Try LessRec free → no card, no signup