Linguists have a word for the um s, uh s, er s, and elongated versions (ummmm , uhhhhh ) that pad spoken English: disfluencies. I don’t record a lot of voice audio, but a few friends do, and they tell me editing those out by hand is miserable. So I built erm to do it. uvx erm input.wav That’s the whole interface for the common case. It writes a cleaned .wav and a JSON cut list next to the input. This post walks through how it works, because the obvious approach doesn’t sound very good and most..