Try free →
Decision

Whisper self-hosted vs OpenAI API 2026: cost break-even, hardware, and latency trade-offs

May 8, 2026 · 7 min read

OpenAI's Whisper API is $0.006 per minute. Whisper.cpp running on a $2,000 used workstation is $0/min after you buy the box. Somewhere between "10 minutes a month" and "100 hours a day" is your break-even. Here's the math with the inputs sales decks omit: hardware depreciation, electricity, your time, and the HIPAA premium.

The headline numbers

OptionCost per minuteSetup timeHIPAA-clean?
OpenAI Whisper API (api.openai.com)$0.0065 minutesOnly with BAA addendum (request through OpenAI Enterprise)
OpenAI direct, no BAA$0.0065 minutesNO — PHI prohibited
Azure OpenAI Whisper$0.006–$0.012 (region-dependent)30 minutesYes — covered by Azure HIPAA BAA
self-hosted whisper.cpp on used 3090 ($800)~$0.0001 amortized1–2 hours one-timeFully — PHI never leaves premises
self-hosted faster-whisper on RTX 4090 ($1,800)~$0.0002 amortized1–2 hoursFully
self-hosted on M2/M3 Mac (Metal)~$0.000130 min (Homebrew)Fully

The break-even formula

Break-even minutes = (hardware cost amortized over N months + monthly electricity + your maintenance hours × $/hr) ÷ ($0.006 − per-minute self-host cost).

Plug numbers:

If you transcribe more than ~10 hours of audio per day, self-hosting is cheaper. Less than that, API wins on pure cost.

Where the simple math breaks

HIPAA premium

OpenAI API requires an Enterprise BAA before you can legally send PHI. The BAA has no extra fee in 2026 but takes 3–6 weeks of legal back-and-forth. Self-hosted needs no BAA chain (you are the responsible party). For healthcare, "self-host wins instantly" is the right read regardless of volume — legal-friction cost dominates.

Latency

StackLatency for 10-min audio
OpenAI API (large-v3, network US)30–60 sec
Azure OpenAI (regional)20–45 sec
whisper.cpp on M2 Pro Mac (medium model)~50 sec (CPU + Metal)
faster-whisper on RTX 3090 (large-v3)5–10 sec
faster-whisper on RTX 4090 (large-v3)3–6 sec

For real-time scribe workflows (live note generation during visit), GPU self-host wins on latency too. For batch overnight processing, API is fine.

Reliability

Hosted APIs have ~99.9% uptime. Your homelab might be 99.0–99.5%. For solo practice this is fine. For a 4-clinic group with shared infrastructure, plan a fallback path (one server fails over to another, or to OpenAI API as backup).

Quality

OpenAI API uses whisper-large-v3 by default. Self-hosted gives you choice of model (tiny / base / small / medium / large-v3 / large-v3-turbo). Quality difference between large-v3 and large-v3-turbo is < 1% WER on English; turbo is 8× faster. For medical workloads, run large-v3 if accuracy matters; turbo for high-volume real-time.

Hardware recommendations 2026

Use caseRecommended hardwareWhy
Solo provider, <3 hr audio/dayM2/M3 Mac mini (8–16GB RAM)Silent, $599–$899, runs base/medium models well
Solo provider, 3–10 hr/dayUsed Linux PC + RTX 3060 12GB ($600–$900)Handles large-v3 in real-time
2–5 provider clinicRTX 3090 24GB used ($800) or 4090 24GB ($1,800 new)Parallel 4 streams real-time
5–20 provider group2× RTX 4090 in single workstation, or A40 48GB ($3,500 used)10+ streams real-time, batch overnight
Hospital / 50+ providerDedicated GPU server (A100 40GB or H100), or stay on Azure OpenAISelf-host wins at this scale only with FTE to maintain

The used GPU market in 2026 is your friend. RTX 3090 (24GB) at $700–$900 used handles whisper-large-v3 at 5× real-time, runs Llama 3 70B for note generation, and pulls 250–350W under load. That's the small-clinic sweet spot.

Setup: whisper.cpp in 30 minutes (Mac or Linux)

# Mac (Homebrew)
brew install whisper-cpp ffmpeg

# Linux (build from source for best perf)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make GGML_CUDA=1   # NVIDIA path

# Download model
bash ./models/download-ggml-model.sh large-v3-turbo

# Run
./main -m models/ggml-large-v3-turbo.bin -f visit.wav -otxt -osrt

For higher throughput in Python, faster-whisper (CTranslate2 backend) is 2–4× faster than whisper.cpp at the same accuracy:

pip install faster-whisper

from faster_whisper import WhisperModel
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")
segments, _ = model.transcribe("visit.wav", language="en")
for s in segments:
    print(f"[{s.start:.1f}-{s.end:.1f}] {s.text}")

The hidden cost: your time

Self-hosting in year 1: ~10 hours total setup + troubleshooting. Year 2+: ~1 hour/quarter. If your billable hourly rate is $200, year 1 costs you $2,000 of "labor" on top of hardware. The break-even formula above already accounts for this.

If you have zero comfort with command line, this isn't the path. Heidi Health at $150/month buys away the maintenance burden — and for <200 visits/month, that's the right call.

Decision tree

  1. Healthcare workload (PHI)? → Skip raw OpenAI API (no PHI without BAA). Either Azure OpenAI (easiest) or self-hosted.
  2. < 5 hr audio/day total? → API wins on cost. Use Azure OpenAI Whisper.
  3. 5–15 hr audio/day, technical comfort? → Self-host on used RTX 3090. Pays back in 6–9 months.
  4. > 15 hr/day or > 5 providers? → Self-host on RTX 4090 or A40. Pays back in 3–5 months.
  5. Real-time scribe workflow during visit? → Self-host (latency wins). Even if cost is closer to even.
  6. No engineer, no NAS, no comfort with bash? → Stay on hosted (Azure OpenAI or LessRec). The cost premium is worth your time back.

Hosted Whisper at $0.05/min — while you decide

No subscription floor. BAA on request. 10 free minutes, no signup.

Transcribe a file →