I had a redux of this discussion in another thread with Arny Krueger whom some of you might know has having pretty close views to Ethan on this matter. Armed with more data than when this thread was created, the discussion became more detailed although the summary message is exactly the same. As here, I wrote my summation there and thought it might make sense to also post it here. So here we go:
-----
While we don’t think them as such, I think these debates are like mock trials where there is a prosecutor and defending attorney trying to prove their respective points of view. As with a real trial, it is useful to have a summary statement at the end as to make sure each party’s position is not lost in the all the back and forth. So I am going to do that now and unless there are substantial new points to be made, call it done.
This debate started with Arny calling me “presumptuous” as I declared USB as the way to go in interconnecting audio systems to your computer. Question is, did he succeed in demonstrating that? For that answer we need to think through the architecture of the systems we use. When playing music, we know that we can distill the content into files. We can copy those files on our computers with reckless abandon and even send them across the world over the unreliable and non-real-time Internet and still, unless something goes wrong, get the original back intact. Digital audio in that sense is perfect!
Sadly, when our current interfaces were designed, they were not architected the way they should have been. We take digital data that is nicely marked with its timing and data in a computer file and turn it into a real-time stream across a cable. The source is the “master” telling the destination when to play each and every sample. On paper, the timing is perfect vertical pulses that instantly go from zero to one, and with zero variation. Sampling theory says if we did that, indeed we can have perfect reproduction. That is, if we reproduce the samples at the same time they were captured (digitized), we can reconstruct the signal perfectly.
Alas, real world doesn’t work that way. The above definition of our timing signal is that of a square wave. The wiki on square wave says it nicely: “An ideal square wave requires that the signal changes from the high to the low state cleanly and instantaneously. This is impossible to achieve in real-world systems, as it would require infinite bandwidth.” Let me repeat: you need infinite bandwidth. No cable or interface has infinite bandwidth. So we know what gets to the other side has less than perfect edges. As soon as we modify those waveforms, we also start to mess with the timing that can be extracted at the receiver. Yup, horrors of horrors. Your digital cables can have a sound!
We get confused looking at the low-speed audio signals thinking not much accuracy is needed to represent their timing. Arny made that mistake of thinking 1 microsecond of timing should still be great. That is one millionth of a second. By a person not schooled in the science of digital audio and signal processing, that does look like the right metric relative to CD’s 44,100 samples per second with each sample taking 22 microseconds.
The science unfortunately is not that forgiving. It doesn’t just look at the sampling rate of the audio but more importantly, how much resolution and how low of a noise floor you want to have. We love our digital systems because they are so quiet. A 16-bit system has a range of quietest to loudest signal of 96 decibels. That is the lowest level digital system we use for high-fidelity music. 30+ years after introduction of CD, it sure would be nice to be able to achieve what its specs say on paper, in a real system. Don't you think?
Given the above, and making some significant simplifications, we can compute how much timing change it takes for one of those bits to get corrupted. With each bit representing 6 db of distortion products, the math can be computed as I showed from Julian’s formula. That math says that timing accuracy better be 500 picoseconds or else your system has less resolution than an ideal 16-bit system. A picosecond is one million microseconds! Wow oh wow!!! What have we gotten ourselves into? We need accuracy that is 2000 times smaller than Arny’s one microsecond number.
While not a topic here, but a point I made, the receiver cannot throw out timing variations. Why? Because it really doesn’t know how fast the source is going to send it data. Indeed, sampling rates like 44,100 are “nominal” values. The receiver cannot use them as the source of timing. It is entirely legal and indeed happens all the time that the transmitter will run slower or faster than that rate. The standard allows +-5% variation. So the receiver is tasked with the tough job of throwing out some variations but not others.
Smart designers over the years have figured out good ways to deal with that in S/PDIF domain. In HDMI, they are kind of stuck with off-the-shelf silicon which is first designed for video, and secondary for audio. Clock recovery being a “mixed signal type of problem” involving both analog and digital, means that there are far more engineers who get it wrong than right. I know. I have had the unhappy fortune of having some of those engineers work for me, nearly destroying hardware products we built for major television networks which could not properly extract said signals. Every engineer is taught about “PLL” design in school but the reality is very different in real world than in a textbook.
While we have focused on interface jitter in this topic, that is not the only area of problems. DAC performance can be impacted in a number of other ways, given the delicate signals it is trying to reproduce. Take the voltage of an AA battery and divide by 65536 for a 16 bit system, which is ~35 millionth of a volt. Heaven help you if you try to reproduce 24 bits because then you divide by 16 million!
Digital audio reproduction is therefore highly complex and difficult. With mass market consumers being so price conscious, a typical engineer working for on a mass market design, is not going to try to be heroic. He has to hit severe price points so he is going to go for what you can see: the list of logos on the box, wattage numbers and such. That is the priority. Not getting that last bit of 16 bit audio sample accurately. He is doing his job, making sure he can put food on the table and keep his company in business.
Enter high-end companies. With cost shackles removed, they can go as far as they dare to go. Now, just as you can’t become a better artist with a more expensive brush, there is no guarantee that because you don’t have cost constraints, you are producing great products. I recently reviewed a $16,000 DAC+amp combo and found its sound anything but refined. But companies like Harman (Mark Levinson, Revel, JBL, Crown, Lexicon) that base their design not just on some gray hair engineer’s idea of good sound, but couple it with careful listening tests and measurements do put the dollars to good use.
This nicely segues into the next topic: blind listening tests to prove audibility of such artifacts. Unfortunately, timing problems in digital interfaces do not lend themselves well to controlled experiments. For one thing, there is infinite variety in jitter. It can be random, periodical, data dependent, or discontinuous. And all combined in different levels. Where do you start? Well, folks started with random – the worst kind to go after. Why? Because random jitter just adds noise to the system and in that sense, it is least audible. See my debate with Ethan on more. I post the spectrum of the dcs DAC where you see it had jitter at 2KHz. That was not random at all.
Worse yet in my opinion is selection of material. I don’t know why people think “audiophile” music is the right content. We are not trying to enjoy music in these tests. We are trying to instrument a system with our ears. This brings me to the answer to the study that found 90% of the people could not tell the difference between CD and 64 Kbps version. At this rate, 95% of the original file was thrown way yet folks thought nothing had happened to it. Music codecs are good but not that good! Answer was the selection of music. The test agency thought that they should pick what audiophiles might listen to and naturally went for classical music and such. Well, classical music is harmonic and perceptual compression systems do wonderfully there. Where they get in trouble is when you have sharp transitions such as guitar strings, voices, etc. Even there, you need some quiet around it so that you can hear the so called quantization noise. Any one of my own stash of “codec buster” tracks would have blown the door open letting people hear the difference far, far easier. But that was not picked and from then on, the test was doomed despite perfect methodology otherwise.
So it is not enough to say this test was blind this, and ABX that, and run off with their conclusions. You need to first prove, as I just did with the science of compression that such content was going to be revealing of the problem we are chasing. MPEG put together its suite of audio tests we use to evaluate audio compression. None are audiophile music by any stretch. But they are very revealing as they must be. Where is the similar set of test files for jitter? Or frankly, for all the ills of digital audio? They don’t exist. People use random selection of music and then wonder why their outcome is close to random. Well duh!
I have been thrown at these tests before. Lack of good content makes the job very hard. You are under the stress of answering an AB question and you shouldn’t have to squint to read the tiny differences. Magnify them for me. Don’t tie my hand behind my back and expect me to perform miracle. My ear is not an instrument and there is a limit to my patience. Don’t push me to vote randomly and dilute the overall results that way. If you give the right track and I still couldn’t tell, then I will live with the results.
Given the paucity of data in this space then, what should we do? One answer is to put one’s head in the sand and say it is all good. Well, don’t you want to be sure? Isn’t that why you spend so much time here? One way to get there, at least partially, is to look at measurements. Within bounds, they are pretty reliable metrics of the quality that went into a design. I like to shoot for 16 bits of performance. If the system does that truthfully, I feel good. 20 and 24? The former is heroic, the latter impossible.
To be sure, Arny’s flag is super tempting to follow. Wouldn’t it be good if all equipment is cheap and I can just go by linear specs like power and number of logos as I mentioned? Sure. I won’t deny that as I used to do the same, making fun of all my audiophile friends. But then I took the first step going beyond textbook theory and into the real world of building such products and testing people left and right on all of this. Learned the value of auditory training in this space and seeing that occasional person walk off the street and beating me at that game! To be sure, most audiophiles are quite bad in hearing such artifacts but not all.
Hopefully I have demonstrated in this thread how deep this rabbit hole really is. And that the opposing view is a tough place to stand in the absolute. The math, the numbers, and the graphs are ruthless and powerful in the way they convey their message.
You do not have to take my position by the way. All I want you to do is be more informed. Understand the complexity of topic and don’t let one liners thrown out by the “it all sounds the same” as what guides you. Use this thread as a primer to learn more. Digital audio is not intuitive to any of us as this thread hopefully shows. You want to follow science, do it right.
On a personal note, I have grown to like Arny in this thread. Don’t ask why but I find him likable. Maybe because he is an old analog hack like I am. Or maybe because he makes it possible for me to answer his challenges with smile and excitement. Wish I didn’t have to take him on as I did. Alas, you can’t lead an army for a cause if you don’t understand the cause itself. And the cause is complex here. It doesn’t lend itself to one-liners that he came into this thread with and damning the future of where we need to go which is the target device being in charge of reproduction over a data bus like USB/network.
Can I blame him for not knowing all that is needed to know? No. As shown, the topic requires wide-ranging knowledge across an incredible array of topics from math to audio and computers. I have been fortunate that my employers have paid me to learn this stuff over the last 30 years, augmented by interacting with fine folks in this forum and elsewhere.
Don’t take the above as I know it all. I don’t. There are layers of complexity here and you can only hope to peel back some of them.
So how was this for a Sunday sermon?