See it with your lying ears

For the past couple of weeks, I couldn’t shake off an intrusive thought: raster graphics and audio files are awfully similar — they’re sequences of analog measurements — so what would happen if we apply the same transformations to both?…

Let’s start with downsampling: what if we divide the data stream into buckets of n samples each, and then map the entire bucket to a single, averaged value?

for (pos = 0; pos < len; pos = win_size) { float sum = 0; for (int i = 0; i < win_size; i++) sum += buf[pos + i]; for (int i = 0; i < win_size; i++) buf[pos + i] = sum / win_size; }

For images, the result is aesthetically pleasing pixel art. But if we do the same audio… well, put your headphones on, you’re in for a treat:

The model for the images is our dog, Skye. The song fragment is a cover of “It Must Have Been Love” performed by Effie Passero.

If you’re familiar with audio formats, you might’ve expected this to sound different: a muffled but neutral rendition associated with low sample rates. Yet, the result of the “audio pixelation” filter is different: it adds unpleasant, metallic-sounding overtones. The culprit is the stairstep pattern in the resulting waveform:

Not great, not terrible.

Our eyes don’t mind the pattern on the computer screen, but the cochlea is a complex mechanical structure that doesn’t measure sound pressure levels per se; instead, it has clusters of different nerve cells sensitive to different sine-wave frequencies. Abrupt jumps in the waveform are perceived as wideband noise that wasn’t present in the original audio stream.

The problem is easy to solve: we can run the jagged waveform through a rolling-average filter, the equivalent of blurring the pixelated image to remove the artifacts:

But this brings up another question: is the effect similar if we keep the original 44.1 kHz sample rate but reduce the bit depth of each sample in the file?

... continue reading