May 2, 2026 · 1547 words · 8 minute read
Linguists have a word for the um s, uh s, er s, and elongated versions ( ummmm , uhhhhh ) that pad spoken English: disfluencies.
I don’t record a lot of voice audio, but a few friends do, and they tell me editing those out by hand is miserable. So I built erm to do it.
uvx erm input.wav
That’s the whole interface for the common case. It writes a cleaned .wav and a JSON cut list next to the input. This post walks through how it works, because the obvious approach doesn’t sound very good and most of the code is the stuff that fixes that.
The naive version doesn’t work 🔗
You’d expect the job to be: transcribe with word-level timestamps, find tokens like um and uh , cut those ranges with ffmpeg.
That gets you maybe 60% of the way, and the result sounds worse than the original. Three reasons:
Whisper quietly leaves a lot of fillers out of the transcript, so there’s no um token to match in the first place.
token to match in the first place. Slicing audio at an arbitrary point in time produces a tiny step in the waveform. Your ear hears it as a click.
... continue reading