Article · Recording

Best Audio Format for Accurate Transcription

A practical comparison of WAV, FLAC, MP3, M4A, and OGG — with bitrate floors, codec quirks, and the format we actually request when accuracy matters.

Audio file upload formats for transcription — WAV, MP3, M4A, FLAC

Most transcription mistakes happen before the file reaches us. The audio was recorded too quietly, compressed too hard, or saved in a format that strips out frequencies the speech recognition model needs. This guide is a practical reference — what to send when accuracy matters, and what to fix in the recording chain rather than in post.

TL;DR — what to send

  • If you have it: WAV or FLAC, 16-bit, 44.1 kHz, mono if single speaker / stereo if multi-mic.
  • Otherwise: M4A (AAC) at 128 kbps+ or MP3 at 192 kbps+, 44.1 kHz.
  • Video: send the original MP4 / MOV / MKV — we extract audio losslessly.
  • Avoid: <96 kbps MP3, OPUS at low bitrates, anything from a phone’s “voice memo” auto-compressed mode.

Why format matters

Speech recognition models — Whisper, AssemblyAI, Deepgram, Microsoft, Google — all train on a band of audio between roughly 80 Hz and 8 kHz, the range that carries meaning in human speech. Lossy codecs (MP3, AAC, OPUS, OGG) discard data in this range to save bytes. Below a certain bitrate threshold, the codec starts removing audio cues the model relies on for distinguishing similar phonemes — “sin / shin”, “path / pat”, “t / d”.

For legal transcripts, this is where errors creep in — and where a wrong word changes the meaning of a deposition. For research interviews, codec artifacts make speaker separation harder, especially with overlapping speech.

WAV vs MP3 vs M4A — what they actually are

FormatCodecLossless?Typical accuracyBest for
WAV / AIFFPCMYesbaselineLegal, research master files
FLACFLACYes (compressed)baselineSmaller archive of master quality
M4A / AAC 128 kbpsAAC-LCNo−0.3% WERiPhone Voice Memos, podcasts
MP3 192 kbpsMP3No−0.5% WERGeneral purpose
OGG / OPUSOPUSNovariesWebRTC calls — depends on stream
MP3 96 kbpsMP3No−2 to −5% WERAvoid if accuracy matters

WER is “word error rate” — the percentage of words the ASR gets wrong. The differences look small until you remember a 30-minute deposition contains ~5,000 words. Half a percent is 25 errors a human editor has to fix.

Bitrate floor for clean ASR

Below these bitrates, accuracy degrades enough to add cost on the human-review side:

  • MP3: 192 kbps mono, 256 kbps stereo
  • AAC / M4A: 128 kbps mono, 192 kbps stereo
  • OPUS: 64 kbps mono, 96 kbps stereo (OPUS is more efficient — same quality at lower bitrate)
  • WMA: avoid — older codec, poorly supported by modern ASR pipelines

If the source is already below these thresholds, do not re-encode upward — that doesn’t add information back. Send what you have, and we’ll factor the audio quality into the transcription pricing estimate.

Phone calls, video calls, on-site interviews

The format isn’t the only variable — phone codecs cap audio at narrowband (8 kHz sample rate, ~3.4 kHz top frequency). That’s why phone interviews always have a higher base error rate than studio recordings, regardless of file format.

Practical fixes:

  • Record both sides locally when possible. Zoom, Teams, Meet all offer local recording — that recording is wideband (16 kHz+) instead of phone-band (8 kHz).
  • Lavalier or close mic for in-person interviews. Even a clip-on Lavalier 50 cm from the speaker reduces room reverb dramatically vs. a tabletop recorder 2 m away.
  • Two recorders, two speakers for important conversations. Each speaker on a dedicated recorder makes speaker separation trivial.

Export tips by app

  • iPhone Voice Memos: default is M4A 64 kbps — change to “Lossless” in Settings → Voice Memos → Audio Quality before recording legal/research material.
  • Zoom: “Record audio as separate file for each participant” in advanced settings — gives you per-speaker channels.
  • OBS / Streamlabs: export master at 256 kbps AAC or higher; default 128 kbps stereo is fine for podcasts but not for legal.
  • Sony / Zoom / Tascam recorders: 24-bit / 48 kHz WAV is overkill for ASR but doesn’t hurt; 16-bit / 44.1 kHz is the sweet spot.

FAQ

Should I send WAV or MP3?

WAV is best when you have it. 192 kbps+ MP3 or 128 kbps+ M4A are fine in practice.

Does sample rate matter?

16 kHz mono is the floor for ASR; 44.1 or 48 kHz stereo is ideal.

My recorder only outputs MP4. Convert to WAV first?

No need. Send the MP4 — we extract the audio losslessly on our side.

Can you transcribe video files directly?

Yes. MP4, MOV, MKV, WebM, AVI all supported.

Have a recording? Send it.

Upload audio or video. We’ll send a transparent estimate within an hour and confirm the deadline before you pay.

Upload audio