Most transcription mistakes happen before the file reaches us. The audio was recorded too quietly, compressed too hard, or saved in a format that strips out frequencies the speech recognition model needs. This guide is a practical reference — what to send when accuracy matters, and what to fix in the recording chain rather than in post.
TL;DR — what to send
- If you have it: WAV or FLAC, 16-bit, 44.1 kHz, mono if single speaker / stereo if multi-mic.
- Otherwise: M4A (AAC) at 128 kbps+ or MP3 at 192 kbps+, 44.1 kHz.
- Video: send the original MP4 / MOV / MKV — we extract audio losslessly.
- Avoid: <96 kbps MP3, OPUS at low bitrates, anything from a phone’s “voice memo” auto-compressed mode.
Why format matters
Speech recognition models — Whisper, AssemblyAI, Deepgram, Microsoft, Google — all train on a band of audio between roughly 80 Hz and 8 kHz, the range that carries meaning in human speech. Lossy codecs (MP3, AAC, OPUS, OGG) discard data in this range to save bytes. Below a certain bitrate threshold, the codec starts removing audio cues the model relies on for distinguishing similar phonemes — “sin / shin”, “path / pat”, “t / d”.
For legal transcripts, this is where errors creep in — and where a wrong word changes the meaning of a deposition. For research interviews, codec artifacts make speaker separation harder, especially with overlapping speech.
WAV vs MP3 vs M4A — what they actually are
| Format | Codec | Lossless? | Typical accuracy | Best for |
|---|---|---|---|---|
| WAV / AIFF | PCM | Yes | baseline | Legal, research master files |
| FLAC | FLAC | Yes (compressed) | baseline | Smaller archive of master quality |
| M4A / AAC 128 kbps | AAC-LC | No | −0.3% WER | iPhone Voice Memos, podcasts |
| MP3 192 kbps | MP3 | No | −0.5% WER | General purpose |
| OGG / OPUS | OPUS | No | varies | WebRTC calls — depends on stream |
| MP3 96 kbps | MP3 | No | −2 to −5% WER | Avoid if accuracy matters |
WER is “word error rate” — the percentage of words the ASR gets wrong. The differences look small until you remember a 30-minute deposition contains ~5,000 words. Half a percent is 25 errors a human editor has to fix.
Bitrate floor for clean ASR
Below these bitrates, accuracy degrades enough to add cost on the human-review side:
- MP3: 192 kbps mono, 256 kbps stereo
- AAC / M4A: 128 kbps mono, 192 kbps stereo
- OPUS: 64 kbps mono, 96 kbps stereo (OPUS is more efficient — same quality at lower bitrate)
- WMA: avoid — older codec, poorly supported by modern ASR pipelines
If the source is already below these thresholds, do not re-encode upward — that doesn’t add information back. Send what you have, and we’ll factor the audio quality into the transcription pricing estimate.
Phone calls, video calls, on-site interviews
The format isn’t the only variable — phone codecs cap audio at narrowband (8 kHz sample rate, ~3.4 kHz top frequency). That’s why phone interviews always have a higher base error rate than studio recordings, regardless of file format.
Practical fixes:
- Record both sides locally when possible. Zoom, Teams, Meet all offer local recording — that recording is wideband (16 kHz+) instead of phone-band (8 kHz).
- Lavalier or close mic for in-person interviews. Even a clip-on Lavalier 50 cm from the speaker reduces room reverb dramatically vs. a tabletop recorder 2 m away.
- Two recorders, two speakers for important conversations. Each speaker on a dedicated recorder makes speaker separation trivial.
Export tips by app
- iPhone Voice Memos: default is M4A 64 kbps — change to “Lossless” in Settings → Voice Memos → Audio Quality before recording legal/research material.
- Zoom: “Record audio as separate file for each participant” in advanced settings — gives you per-speaker channels.
- OBS / Streamlabs: export master at 256 kbps AAC or higher; default 128 kbps stereo is fine for podcasts but not for legal.
- Sony / Zoom / Tascam recorders: 24-bit / 48 kHz WAV is overkill for ASR but doesn’t hurt; 16-bit / 44.1 kHz is the sweet spot.
FAQ
Should I send WAV or MP3?
WAV is best when you have it. 192 kbps+ MP3 or 128 kbps+ M4A are fine in practice.
Does sample rate matter?
16 kHz mono is the floor for ASR; 44.1 or 48 kHz stereo is ideal.
My recorder only outputs MP4. Convert to WAV first?
No need. Send the MP4 — we extract the audio losslessly on our side.
Can you transcribe video files directly?
Yes. MP4, MOV, MKV, WebM, AVI all supported.
Have a recording? Send it.
Upload audio or video. We’ll send a transparent estimate within an hour and confirm the deadline before you pay.
Upload audio