Whisper self-hosted vs OpenAI API 2026: cost break-even, hardware, and latency trade-offs
OpenAI's Whisper API is $0.006 per minute. Whisper.cpp running on a $2,000 used workstation is $0/min after you buy the box. Somewhere between "10 minutes a month" and "100 hours a day" is your break-even. Here's the math with the inputs sales decks omit: hardware depreciation, electricity, your time, and the HIPAA premium.
The headline numbers
| Option | Cost per minute | Setup time | HIPAA-clean? |
|---|---|---|---|
| OpenAI Whisper API (api.openai.com) | $0.006 | 5 minutes | Only with BAA addendum (request through OpenAI Enterprise) |
| OpenAI direct, no BAA | $0.006 | 5 minutes | NO — PHI prohibited |
| Azure OpenAI Whisper | $0.006–$0.012 (region-dependent) | 30 minutes | Yes — covered by Azure HIPAA BAA |
| self-hosted whisper.cpp on used 3090 ($800) | ~$0.0001 amortized | 1–2 hours one-time | Fully — PHI never leaves premises |
| self-hosted faster-whisper on RTX 4090 ($1,800) | ~$0.0002 amortized | 1–2 hours | Fully |
| self-hosted on M2/M3 Mac (Metal) | ~$0.0001 | 30 min (Homebrew) | Fully |
The break-even formula
Break-even minutes = (hardware cost amortized over N months + monthly electricity + your maintenance hours × $/hr) ÷ ($0.006 − per-minute self-host cost).
Plug numbers:
- Hardware: used RTX 3090 PC = $1,800 all-in, depreciate over 36 months = $50/month.
- Electricity: 250W average load × 8 hr/day × 30 days = 60 kWh/month at $0.18/kWh = $10.80/month.
- Your time: 30 min/month maintenance at $100/hr = $50/month.
- Total monthly fixed cost: ~$110.
- API cost saved per minute: $0.006 (since self-host is essentially $0).
- Break-even = $110 ÷ $0.006 = 18,333 minutes/month = ~305 hours/month = ~10 hours/day.
If you transcribe more than ~10 hours of audio per day, self-hosting is cheaper. Less than that, API wins on pure cost.
Where the simple math breaks
HIPAA premium
OpenAI API requires an Enterprise BAA before you can legally send PHI. The BAA has no extra fee in 2026 but takes 3–6 weeks of legal back-and-forth. Self-hosted needs no BAA chain (you are the responsible party). For healthcare, "self-host wins instantly" is the right read regardless of volume — legal-friction cost dominates.
Latency
| Stack | Latency for 10-min audio |
|---|---|
| OpenAI API (large-v3, network US) | 30–60 sec |
| Azure OpenAI (regional) | 20–45 sec |
| whisper.cpp on M2 Pro Mac (medium model) | ~50 sec (CPU + Metal) |
| faster-whisper on RTX 3090 (large-v3) | 5–10 sec |
| faster-whisper on RTX 4090 (large-v3) | 3–6 sec |
For real-time scribe workflows (live note generation during visit), GPU self-host wins on latency too. For batch overnight processing, API is fine.
Reliability
Hosted APIs have ~99.9% uptime. Your homelab might be 99.0–99.5%. For solo practice this is fine. For a 4-clinic group with shared infrastructure, plan a fallback path (one server fails over to another, or to OpenAI API as backup).
Quality
OpenAI API uses whisper-large-v3 by default. Self-hosted gives you choice of model (tiny / base / small / medium / large-v3 / large-v3-turbo). Quality difference between large-v3 and large-v3-turbo is < 1% WER on English; turbo is 8× faster. For medical workloads, run large-v3 if accuracy matters; turbo for high-volume real-time.
Hardware recommendations 2026
| Use case | Recommended hardware | Why |
|---|---|---|
| Solo provider, <3 hr audio/day | M2/M3 Mac mini (8–16GB RAM) | Silent, $599–$899, runs base/medium models well |
| Solo provider, 3–10 hr/day | Used Linux PC + RTX 3060 12GB ($600–$900) | Handles large-v3 in real-time |
| 2–5 provider clinic | RTX 3090 24GB used ($800) or 4090 24GB ($1,800 new) | Parallel 4 streams real-time |
| 5–20 provider group | 2× RTX 4090 in single workstation, or A40 48GB ($3,500 used) | 10+ streams real-time, batch overnight |
| Hospital / 50+ provider | Dedicated GPU server (A100 40GB or H100), or stay on Azure OpenAI | Self-host wins at this scale only with FTE to maintain |
The used GPU market in 2026 is your friend. RTX 3090 (24GB) at $700–$900 used handles whisper-large-v3 at 5× real-time, runs Llama 3 70B for note generation, and pulls 250–350W under load. That's the small-clinic sweet spot.
Setup: whisper.cpp in 30 minutes (Mac or Linux)
# Mac (Homebrew) brew install whisper-cpp ffmpeg # Linux (build from source for best perf) git clone https://github.com/ggerganov/whisper.cpp cd whisper.cpp make GGML_CUDA=1 # NVIDIA path # Download model bash ./models/download-ggml-model.sh large-v3-turbo # Run ./main -m models/ggml-large-v3-turbo.bin -f visit.wav -otxt -osrt
For higher throughput in Python, faster-whisper (CTranslate2 backend) is 2–4× faster than whisper.cpp at the same accuracy:
pip install faster-whisper
from faster_whisper import WhisperModel
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")
segments, _ = model.transcribe("visit.wav", language="en")
for s in segments:
print(f"[{s.start:.1f}-{s.end:.1f}] {s.text}")
The hidden cost: your time
Self-hosting in year 1: ~10 hours total setup + troubleshooting. Year 2+: ~1 hour/quarter. If your billable hourly rate is $200, year 1 costs you $2,000 of "labor" on top of hardware. The break-even formula above already accounts for this.
If you have zero comfort with command line, this isn't the path. Heidi Health at $150/month buys away the maintenance burden — and for <200 visits/month, that's the right call.
Decision tree
- Healthcare workload (PHI)? → Skip raw OpenAI API (no PHI without BAA). Either Azure OpenAI (easiest) or self-hosted.
- < 5 hr audio/day total? → API wins on cost. Use Azure OpenAI Whisper.
- 5–15 hr audio/day, technical comfort? → Self-host on used RTX 3090. Pays back in 6–9 months.
- > 15 hr/day or > 5 providers? → Self-host on RTX 4090 or A40. Pays back in 3–5 months.
- Real-time scribe workflow during visit? → Self-host (latency wins). Even if cost is closer to even.
- No engineer, no NAS, no comfort with bash? → Stay on hosted (Azure OpenAI or LessRec). The cost premium is worth your time back.
Hosted Whisper at $0.05/min — while you decide
No subscription floor. BAA on request. 10 free minutes, no signup.
Transcribe a file →