24-bit/192kHz music downloads and why they make no sense (2012)

Also see Xiph.Org's new video, Digital Show & Tell, for detailed demonstrations of digital sampling in action on real equipment! Articles last month revealed that musician Neil Young and Apple's Steve Jobs discussed offering digital music downloads of 'uncompromised studio quality'. Much of the press and user commentary was particularly enthusiastic about the prospect of uncompressed 24 bit 192kHz downloads. 24/192 featured prominently in my own conversations with Mr. Young's group several months ago. Unfortunately, there is no point to distributing music in 24-bit/192kHz format. Its playback fidelity is slightly inferior to 16/44.1 or 16/48, and it takes up 6 times the space. There are a few real problems with the audio quality and 'experience' of digitally distributed music today. 24/192 solves none of them. While everyone fixates on 24/192 as a magic bullet, we're not going to see any actual improvement.

First, the bad news In the past few weeks, I've had conversations with intelligent, scientifically minded individuals who believe in 24/192 downloads and want to know how anyone could possibly disagree. They asked good questions that deserve detailed answers. I was also interested in what motivated high-rate digital audio advocacy. Responses indicate that few people understand basic signal theory or the sampling theorem, which is hardly surprising. Misunderstandings of the mathematics, technology, and physiology arose in most of the conversations, often asserted by professionals who otherwise possessed significant audio expertise. Some even argued that the sampling theorem doesn't really explain how digital audio actually works [1]. Misinformation and superstition only serve charlatans. So, let's cover some of the basics of why 24/192 distribution makes no sense before suggesting some improvements that actually do.

Gentlemen, meet your ears The ear hears via hair cells that sit on the resonant basilar membrane in the cochlea. Each hair cell is effectively tuned to a narrow frequency band determined by its position on the membrane. Sensitivity peaks in the middle of the band and falls off to either side in a lopsided cone shape overlapping the bands of other nearby hair cells. A sound is inaudible if there are no hair cells tuned to hear it. This is similar to an analog radio that picks up the frequency of a strong station near where the tuner is actually set. The farther off the station's frequency is, the weaker and more distorted it gets until it disappears completely, no matter how strong. There is an upper (and lower) audible frequency limit, past which the sensitivity of the last hair cells drops to zero, and hearing ends.

Sampling rate and the audible spectrum I'm sure you've heard this many, many times: The human hearing range spans 20Hz to 20kHz. It's important to know how researchers arrive at those specific numbers. First, we measure the 'absolute threshold of hearing' across the entire audio range for a group of listeners. This gives us a curve representing the very quietest sound the human ear can perceive for any given frequency as measured in ideal circumstances on healthy ears. Anechoic surroundings, precision calibrated playback equipment, and rigorous statistical analysis are the easy part. Ears and auditory concentration both fatigue quickly, so testing must be done when a listener is fresh. That means lots of breaks and pauses. Testing takes anywhere from many hours to many days depending on the methodology. Then we collect data for the opposite extreme, the 'threshold of pain'. This is the point where the audio amplitude is so high that the ear's physical and neural hardware is not only completely overwhelmed by the input, but experiences physical pain. Collecting this data is trickier. You don't want to permanently damage anyone's hearing in the process. The upper limit of the human audio range is defined to be where the absolute threshold of hearing curve crosses the threshold of pain. To even faintly perceive the audio at that point (or beyond), it must simultaneously be unbearably loud. At low frequencies, the cochlea works like a bass reflex cabinet. The helicotrema is an opening at the apex of the basilar membrane that acts as a port tuned to somewhere between 40Hz to 65Hz depending on the individual. Response rolls off steeply below this frequency. Thus, 20Hz - 20kHz is a generous range. It thoroughly covers the audible spectrum, an assertion backed by nearly a century of experimental data.

Genetic gifts and golden ears Based on my correspondences, many people believe in individuals with extraordinary gifts of hearing. Do such 'golden ears' really exist? It depends on what you call a golden ear. Young, healthy ears hear better than old or damaged ears. Some people are exceptionally well trained to hear nuances in sound and music most people don't even know exist. There was a time in the 1990s when I could identify every major mp3 encoder by sound (back when they were all pretty bad), and could demonstrate this reliably in double-blind testing [2]. When healthy ears combine with highly trained discrimination abilities, I would call that person a golden ear. Even so, below-average hearing can also be trained to notice details that escape untrained listeners. Golden ears are more about training than hearing beyond the physical ability of average mortals. Auditory researchers would love to find, test, and document individuals with truly exceptional hearing, such as a greatly extended hearing range. Normal people are nice and all, but everyone wants to find a genetic freak for a really juicy paper. We haven't found any such people in the past 100 years of testing, so they probably don't exist. Sorry. We'll keep looking.

Spectrophiles Perhaps you're skeptical about everything I've just written; it certainly goes against most marketing material. Instead, let's consider a hypothetical Wide Spectrum Video craze that doesn't carry preexisting audiophile baggage. The human eye sees a limited range of frequencies of light, aka, the visible spectrum. This is directly analogous to the audible spectrum of sound waves. Like the ear, the eye has sensory cells (rods and cones) that detect light in different but overlapping frequency bands. The visible spectrum extends from about 400THz (deep red) to 850THz (deep violet) [3]. Perception falls off steeply at the edges. Beyond these approximate limits, the light power needed for the slightest perception can fry your retinas. Thus, this is a generous span even for young, healthy, genetically gifted individuals, analogous to the generous limits of the audible spectrum. In our hypothetical Wide Spectrum Video craze, consider a fervent group of Spectrophiles who believe these limits aren't generous enough. They propose that video represent not only the visible spectrum, but also infrared and ultraviolet. Continuing the comparison, there's an even more hardcore [and proud of it!] faction that insists this expanded range is yet insufficient, and that video feels so much more natural when it also includes microwaves and some of the X-ray spectrum. To a Golden Eye, they insist, the difference is night and day! Of course this is ludicrous. No one can see X-rays (or infrared, or ultraviolet, or microwaves). It doesn't matter how much a person believes he can. Retinas simply don't have the sensory hardware. Here's an experiment anyone can do: Go get your Apple IR remote. The LED emits at 980nm, or about 306THz, in the near-IR spectrum. This is not far outside of the visible range. Take the remote into the basement, or the darkest room in your house, in the middle of the night, with the lights off. Let your eyes adjust to the blackness. Can you see the Apple Remote's LED flash when you press a button [4]? No? Not even the tiniest amount? Try a few other IR remotes; many use an IR wavelength a bit closer to the visible band, around 310-350THz. You won't be able to see them either. The rest emit right at the edge of visibility from 350-380 THz and may be just barely visible in complete blackness with dark-adjusted eyes [5]. All would be blindingly, painfully bright if they were well inside the visible spectrum. These near-IR LEDs emit from the visible boundry to at most 20% beyond the visible frequency limit. 192kHz audio extends to 400% of the audible limit. Lest I be accused of comparing apples and oranges, auditory and visual perception drop off similarly toward the edges.

192kHz considered harmful 192kHz digital music files offer no benefits. They're not quite neutral either; practical fidelity is slightly worse. The ultrasonics are a liability during playback. Neither audio transducers nor power amplifiers are free of distortion, and distortion tends to increase rapidly at the lowest and highest frequencies. If the same transducer reproduces ultrasonics along with audible content, any nonlinearity will shift some of the ultrasonic content down into the audible range as an uncontrolled spray of intermodulation distortion products covering the entire audible spectrum. Nonlinearity in a power amplifier will produce the same effect. The effect is very slight, but listening tests have confirmed that both effects can be audible. There are a few ways to avoid the extra distortion: A dedicated ultrasonic-only speaker, amplifier, and crossover stage to separate and independently reproduce the ultrasonics you can't hear, just so they don't mess up the sounds you can. Amplifiers and transducers designed for wider frequency reproduction, so ultrasonics don't cause audible intermodulation. Given equal expense and complexity, this additional frequency range must come at the cost of some performance reduction in the audible portion of the spectrum. Speakers and amplifiers carefully designed not to reproduce ultrasonics anyway. Not encoding such a wide frequency range to begin with. You can't and won't have ultrasonic intermodulation distortion in the audible band if there's no ultrasonic content. They all amount to the same thing, but only 4) makes any sense. If you're curious about the performance of your own system, the following samples contain a 30kHz and a 33kHz tone in a 24/96 WAV file, a longer version in a FLAC, some tri-tone warbles, and a normal song clip shifted up by 24kHz so that it's entirely in the ultrasonic range from 24kHz to 46kHz: Intermod Tests: 30kHz tone + 33kHz tone (24 bit / 96kHz) [5 second WAV] [30 second FLAC] 26kHz - 48kHz warbling tones (24 bit / 96kHz) [10 second WAV] 26kHz - 96kHz warbling tones (24 bit / 192kHz) [10 second WAV] Song clip shifted up by 24kHz (24 bit / 96kHz WAV) [10 second WAV]

(original version of above clip) (16 bit / 44.1kHz WAV)

Assuming your system is actually capable of full 96kHz playback [6], the above files should be completely silent with no audible noises, tones, whistles, clicks, or other sounds. If you hear anything, your system has a nonlinearity causing audible intermodulation of the ultrasonics. Be careful when increasing volume; running into digital or analog clipping, even soft clipping, will suddenly cause loud intermodulation tones. In summary, it's not certain that intermodulation from ultrasonics will be audible on a given system. The added distortion could be insignificant or it could be noticable. Either way, ultrasonic content is never a benefit, and on plenty of systems it will audibly hurt fidelity. On the systems it doesn't hurt, the cost and complexity of handling ultrasonics could have been saved, or spent on improved audible range performance instead.

Sampling fallacies and misconceptions Sampling theory is often unintuitive without a signal processing background. It's not surprising most people, even brilliant PhDs in other fields, routinely misunderstand it. It's also not surprising many people don't even realize they have it wrong. The most common misconception is that sampling is fundamentally rough and lossy. A sampled signal is often depicted as a jagged, hard-cornered stair-step facsimile of the original perfectly smooth waveform. If this is how you envision sampling working, you may believe that the faster the sampling rate (and more bits per sample), the finer the stair-step and the closer the approximation will be. The digital signal would sound closer and closer to the original analog signal as sampling rate approaches infinity. Similarly, many non-DSP people would look at the following: And say, "Ugh!" It might appear that a sampled signal represents higher frequency analog waveforms badly. Or, that as audio frequency increases, the sampled quality falls and frequency response falls off, or becomes sensitive to input phase. Looks are deceiving. These beliefs are incorrect!

... continue reading