Article · Recording

Best Audio Format for Accurate Transcription

Q: Should I send WAV or MP3?

WAV is best when you have it (uncompressed, no quality loss). 192 kbps+ MP3 or 128 kbps+ AAC/M4A are fine in practice. Below those bitrates, error rates climb steeply.

Q: Does sample rate matter?

16 kHz mono is the floor for ASR; 44.1 or 48 kHz stereo is ideal. Higher than 48 kHz gains nothing.

Q: My recorder only outputs MP4. Convert to WAV first?

No need. Send the MP4 — we extract the audio losslessly on our side. Re-encoding adds noise.

Q: Can you transcribe video files directly?

Yes. MP4, MOV, MKV, WebM, AVI all supported. We extract the audio track without quality loss.

A practical comparison of WAV, FLAC, MP3, M4A, and OGG — with bitrate floors, codec quirks, and the format we actually request when accuracy matters.

Lessrec editorial · April 21, 2026 · 7 min read

Most transcription mistakes happen before the file reaches us. The audio was recorded too quietly, compressed too hard, or saved in a format that strips out frequencies the speech recognition model needs. This guide is a practical reference — what to send when accuracy matters, and what to fix in the recording chain rather than in post.

TL;DR — what to send

If you have it: WAV or FLAC, 16-bit, 44.1 kHz, mono if single speaker / stereo if multi-mic.
Otherwise: M4A (AAC) at 128 kbps+ or MP3 at 192 kbps+, 44.1 kHz.
Video: send the original MP4 / MOV / MKV — we extract audio losslessly.
Avoid: <96 kbps MP3, OPUS at low bitrates, anything from a phone’s “voice memo” auto-compressed mode.

Why format matters

Speech recognition models — Whisper, AssemblyAI, Deepgram, Microsoft, Google — all train on a band of audio between roughly 80 Hz and 8 kHz, the range that carries meaning in human speech. Lossy codecs (MP3, AAC, OPUS, OGG) discard data in this range to save bytes. Below a certain bitrate threshold, the codec starts removing audio cues the model relies on for distinguishing similar phonemes — “sin / shin”, “path / pat”, “t / d”.

For legal transcripts, this is where errors creep in — and where a wrong word changes the meaning of a deposition. For research interviews, codec artifacts make speaker separation harder, especially with overlapping speech.

WAV vs MP3 vs M4A — what they actually are

Format	Codec	Lossless?	Typical accuracy	Best for
WAV / AIFF	PCM	Yes	baseline	Legal, research master files
FLAC	FLAC	Yes (compressed)	baseline	Smaller archive of master quality
M4A / AAC 128 kbps	AAC-LC	No	−0.3% WER	iPhone Voice Memos, podcasts
MP3 192 kbps	MP3	No	−0.5% WER	General purpose
OGG / OPUS	OPUS	No	varies	WebRTC calls — depends on stream
MP3 96 kbps	MP3	No	−2 to −5% WER	Avoid if accuracy matters

WER is “word error rate” — the percentage of words the ASR gets wrong. The differences look small until you remember a 30-minute deposition contains ~5,000 words. Half a percent is 25 errors a human editor has to fix.

Bitrate floor for clean ASR

Below these bitrates, accuracy degrades enough to add cost on the human-review side:

MP3: 192 kbps mono, 256 kbps stereo
AAC / M4A: 128 kbps mono, 192 kbps stereo
OPUS: 64 kbps mono, 96 kbps stereo (OPUS is more efficient — same quality at lower bitrate)
WMA: avoid — older codec, poorly supported by modern ASR pipelines

If the source is already below these thresholds, do not re-encode upward — that doesn’t add information back. Send what you have, and we’ll factor the audio quality into the transcription pricing estimate.

Phone calls, video calls, on-site interviews

The format isn’t the only variable — phone codecs cap audio at narrowband (8 kHz sample rate, ~3.4 kHz top frequency). That’s why phone interviews always have a higher base error rate than studio recordings, regardless of file format.

Practical fixes:

Record both sides locally when possible. Zoom, Teams, Meet all offer local recording — that recording is wideband (16 kHz+) instead of phone-band (8 kHz).
Lavalier or close mic for in-person interviews. Even a clip-on Lavalier 50 cm from the speaker reduces room reverb dramatically vs. a tabletop recorder 2 m away.
Two recorders, two speakers for important conversations. Each speaker on a dedicated recorder makes speaker separation trivial.

Export tips by app

iPhone Voice Memos: default is M4A 64 kbps — change to “Lossless” in Settings → Voice Memos → Audio Quality before recording legal/research material.
Zoom: “Record audio as separate file for each participant” in advanced settings — gives you per-speaker channels.
OBS / Streamlabs: export master at 256 kbps AAC or higher; default 128 kbps stereo is fine for podcasts but not for legal.
Sony / Zoom / Tascam recorders: 24-bit / 48 kHz WAV is overkill for ASR but doesn’t hurt; 16-bit / 44.1 kHz is the sweet spot.

FAQ

Should I send WAV or MP3?

WAV is best when you have it. 192 kbps+ MP3 or 128 kbps+ M4A are fine in practice.

Does sample rate matter?

16 kHz mono is the floor for ASR; 44.1 or 48 kHz stereo is ideal.

My recorder only outputs MP4. Convert to WAV first?

No need. Send the MP4 — we extract the audio losslessly on our side.

Can you transcribe video files directly?

Yes. MP4, MOV, MKV, WebM, AVI all supported.

Have a recording? Send it.

Upload audio or video. We’ll send a transparent estimate within an hour and confirm the deadline before you pay.

Upload audio