Creating a Audio Dartboard

A precursor to that question is what is gained from extended listening? Answer is familiarity. Most of the time, we don't know what we are listening to.

Proper double-blind listening of music will include time and availability of material in advance to allow the user to become familiar with them. There should be no time limit to that process allowing the person to become firmly trained. In standards groups, the same clips are used over and over again so that this training is not necessary (that creates its own set of problems unrelated to topic at hand).

So, a round about way to say that well executed tests that are designed for learning rather than tricking someone to fail, does include provisions as you mention although obviously not to the level of allowing months to go by before the tests are done.

When I setup blind testing at home, of course I have all the time in the world available to do things to my satisfaction. If I want to spend a month doing it, I can.

Of course, no test is perfect. Blind tests can fall victim to many problems which could invalidate part or all of their results. Nothing is more difficult than creating proper audio test protocols. I usually have no problem finding half a dozen flaws in tests published by experts in respected circles such as AES. This is hard stuff to do right. Worse yet, they are too expensive to repeat once problems are found.

Well I guess my experience is that many pieces of audio equipment that initially sound good (say for the first week or two-and assuming they're burned in), often don't sound cut the mustard with extended listening time. In other words, those reproduction qualities that initially endear a piece of equipment to our ears are frequently offset by others that grow more annoying with time; I might submit that added detail in certain frequency ranges can initially endear itself to one's ear but after sometimes an hour, but more the case, weeks, causes listening fatigue. (all one needs to do is go back ten or a little more years and listen to those piece of gear with "high definition" or other phrases. They could peel paint off the wall.

There are way too many variables to take into consideration when listening to and evaluating a piece of equipment that it's impossible in a blind/short term test to the brain to select/appraise all of them and come to a valid conclusion. And with audio equipment, assuming nothing is perfect, how does one discriminate between those colorations that the ear can listen through and those that the ear can't.
 
Hi

The debate will likely never end. Not because the case is not clear but because there is a refusal to acknowledge basic facts. The sentiment is almost anti-scientific. A rather curious attitude since science is constantly invoked in new and arcane products. Even quantum mechanics !!!
The short term auditory memory was first an argument used by the DBT camp, now it has been re-purposed for the non-DBT camp !!
We, audiophiles will quickly affirm that within seconds it was a "night and day" difference and gear ZAKL did "smoke" gear PKAZL and that the Vinyl "blew away" the CD. Sighted...
when non-sighted aka blind ... it becomes less clear ... Actually we usually fail as with cables ... and quickly short term audio memory is invoked or the stress presented by the stress or the unfamiliarity with the gear, the room , the music, the locale ..the people!!??
I have no doubt that long term listening has its place in Audio both for designers AND customers, so does Blind Testing whenever possible.
I will leave it at that. I will simply surmise that for most audiophiles (ALL?) when not knowing, the odds of distinguishing between their cables and electrically competent cables are as good as winning the Lotto three times in a row ...

Frantz

P.S. I cannot help but think of High End Audio as a religion complete with our rituals (Play a special CD to break-in our system for example or lift our cables from the floor ...), our priests and gurus (Reviewers and designers).. The designers especially seems to know things no one else does and not explainable by science ... We sometimes criticize the reviewers, but only recently have we started being curious about their rooms. The Internet also seems to be pushing them toward more objectivity .. call it progress.
 
There are way too many variables to take into consideration when listening to and evaluating a piece of equipment that it's impossible in a blind/short term test to the brain to select/appraise all of them and come to a valid conclusion.
As I mentioned, nothing in blind testing protocol that mandates short term testing. When we did audio testing at Microsoft, would email people the links to sample files and they could download, listen at their leisure, and vote over a multi-week period. No one would be watching over people. They could listen as many times as they wanted. etc. This is the difference between proper DBT and ones trying to trick people to get controversial answers.

Myles, let me turn the tables on you :). Do you think there is never any value in such tests? What if I was testing 128k MP3 against the CD using the members in this forum. Would the results be inconclusive with members split 50-50 in their votes because it is a short-term experiment?
 
As I mentioned, nothing in blind testing protocol that mandates short term testing. When we did audio testing at Microsoft, would email people the links to sample files and they could download, listen at their leisure, and vote over a multi-week period. No one would be watching over people. They could listen as many times as they wanted. etc. This is the difference between proper DBT and ones trying to trick people to get controversial answers.

Myles, let me turn the tables on you :). Do you think there is never any value in such tests? What if I was testing 128k MP3 against the CD using the members in this forum. Would the results be inconclusive with members split 50-50 in their votes because it is a short-term experiment?

I think that neither/either methodology, like most scientific methods, can be accepted at face value. It depends upon what the endpoints are and how the experiments are conceived, designed and carried out.

So as to your question about the 50/50 split, let's go back to pharmaceutical drug DBT testing. When the study is conceived and designed, the authors need to decide on how many people to enter into the study so as to make the results statistically significant. So if you expect everyone to benefit, then say you set up a study with 1000 participants. But say in chemotherapy, where most drugs only might help 20-30%, we need go increase the study size by say 10-100X (which is difficult to enroll that amount of people in a study even with interinstitutional studies) to prove that 30% will benefit. So what you're asking, a 50/50 split isn't necessarily a negative outcome since we know there are large interindividuals differences between listeners as well as those who might lie outside the norms/SD/bell curve (for this see the newly discussed results with determining max. heart rate for the population and how there are a significant number of individuals who fall outside of 2SDs and basing HR training on the standard formula may underestimate the stress and lead to quicker overtraining for those with a low HR).

But I guess the question I'm asking is what are the criteria that people use to base their preferences. For instance, some audiophiles will gravitate towards selecting a speaker with better dynamics. Others might select a speaker based upon it midrange. But someone who values midrange integrity might not necessarily choose a speaker with improved dynamics as a better speaker.
 
OK, I will take back the 50//50 offer :). Let's just say we compare 128K MP3 against the CD with the population here. I will even let you pick who is in the study :). We then gather the results. Do we learn anything about relative quality of MP3 against the CD in that test?
 
Here is another test we could run. Assume it is fully sighted. Using the Matrix movie theme of mine :), I take two identical audio interconnect cables, and paint one blue and the other red without telling the subjects. I then invite the study participants to test the two cables, again fully sighted. They can take as much time as they like -- even months. And let's assume again that the population is from this forum. And that participants have no color preference one way or the other.

What would the results be relative to differences between cables?

BTW the answer can be used to fight either camp so don't be afraid to jump into the pool :).
 
Here is another test we could run. Assume it is fully sighted. Using the Matrix movie theme of mine :), I take two identical audio interconnect cables, and paint one blue and the other red without telling the subjects. I then invite the study participants to test the two cables, again fully sighted. They can take as much time as they like -- even months. And let's assume again that the population is from this forum. And that participants have no color preference one way or the other.

What would the results be relative to differences between cables?


BTW the answer can be used to fight either camp so don't be afraid to jump into the pool :).

OK can we start by carrying out internal controls for the test. We would start by designing cables with different aberrations in say frequency response, distortions with tone bursts, soundstaging artifacts, etc. and determine the sensitivity or what level the ear picks it up ;)
 
Myles, you are correct that this is some kind of control (we call that "null hypothesis" here). We know there is no difference so logic would indicate the voters would say the same thing. Yet, in dozens of such tests, I have never seen a perfect score of "no difference." Some subset of people, some of the time, will vote that there is a difference, even though they hear none! Clearly this is not brand preference as the voters think the two devices under test (DUT) are comparable. Anyone want to guess why?
 
And tested it is. Most of the DBTs in the industry are the ABX variety where the original is one of the samples thrown in the mix and rated by participants. For example when DVD forum conducted tests for video codecs for high definition codecs, they used a split-screen where one side was the original and the other, was the sample under test. The original was included in that random set. As I noted, people actually rated the original some of the time as being worse than itself! Again, this was a split screen so there was no need to memorize anything. The tester would see both halves at the same time and even though they were identical, they were voted to be different. I seem to recall the original achieving something like 3.8 out of 4 or something like it. Even more interesting, one sample of our technology (VC-1) which was clearly different than the original exceeded the score of the original by a hair! Now that if that doesn't cook your noodle, I don't know what would :D.

Again, anyone want to guess why the perfect match doesn't score that way?
 
Myles, you are correct that this is some kind of control (we call that "null hypothesis" here). We know there is no difference so logic would indicate the voters would say the same thing. Yet, in dozens of such tests, I have never seen a perfect score of "no difference." Some subset of people, some of the time, will vote that there is a difference, even though they hear none! Clearly this is not brand preference as the voters think the two devices under test (DUT) are comparable. Anyone want to guess why?

And what about those outllyers? People who get 5/5 yet overall there is no statistical difference for the group?
 
And tested it is. Most of the DBTs in the industry are the ABX variety where the original is one of the samples thrown in the mix and rated by participants. For example when DVD forum conducted tests for video codecs for high definition codecs, they used a split-screen where one side was the original and the other, was the sample under test. The original was included in that random set. As I noted, people actually rated the original some of the time as being worse than itself! Again, this was a split screen so there was no need to memorize anything. The tester would see both halves at the same time and even though they were identical, they were voted to be different. I seem to recall the original achieving something like 3.8 out of 4 or something like it. Even more interesting, one sample of our technology (VC-1) which was clearly different than the original exceeded the score of the original by a hair! Now that if that doesn't cook your noodle, I don't know what would :D.

Again, anyone want to guess why the perfect match doesn't score that way?

amirm, Being new to this forum, and now having read a few of your posts, I am seeing that you are a serious guy. Thank you for taking my jab seriously as I am truly interested in experiences with DBT.

What is the explanation? I would take it as a limitation of the test in some way...Either the circumstances of the test itself or of the subjects.
 
My pleasure. The explanation is simple and lies outside of audio/video realm. Let review the video test that I mentioned:

The situation was split screen with one side the original and the other, the test clip. In some instances, the clip being evaluated is the original. The person voting sits there and sees degraded clips and votes appropriately. He then runs into the situation where the original is being shown next to itself. He then thinks, "hmmm, I wonder if there is a difference and I am going to look like a fool for voting there is none. " So he goes ahead and gives a degraded score even though he doesn't see any difference!

In other words, the distortion in results is due to human desire to be right against his competition. Same factor that is at play in forums like this :). I find that males are much more competitive and worse in this regard.

Expert testers are more immune to this issue as they know people rely on the honesty of their votes to improve products so they vote how they really experience the test samples.

BTW, we throw the original in there to try to isolate people who are simply not capable of performing the test and also to make sure the reproduction equipment is not the limit. The above situation though, distorts the results as these people may vote meaningfully in other cases.

So as you see, we do learn a lot in doing these tests.
 
Here is another test we could run. Assume it is fully sighted. Using the Matrix movie theme of mine :), I take two identical audio interconnect cables, and paint one blue and the other red without telling the subjects. I then invite the study participants to test the two cables, again fully sighted. They can take as much time as they like -- even months. And let's assume again that the population is from this forum. And that participants have no color preference one way or the other.

What would the results be relative to differences between cables?

BTW the answer can be used to fight either camp so don't be afraid to jump into the pool :).
Hi Amir

I suspect the people that tend to align themselves to Republicans will favor the red cable, and those that socially and politically lean more towards Democratics will pick the blue cables :) Based on some of the arguments and attitudes I've seen in this forum, I would guess more people would pick the blue cable :)

There was speaker test done years ago where the same loudspeaker was evaluated in sighted tests by different groups of listeners with only a difference in the color of the grill cloth. The loudspeaker was judged to sound brighter when it had the brighter colored grill cloth and duller when it had the dark cloth. Similar studies have found that changing the color of a train will make it appear to travel faster when approaching. it is perceived to be faster when painted red - something not lost on Ferrari.

Changing the packaging and price of food or wine has been shown to affect how it tastes and our preference. A $5 Trader Joe wine has been shown to be perceived more like a first growth Bordeaux just by making the appropriate increase in price and packaging. This is all well documented in the scientific field of sensory judgment of food and wine, which many audio scientists follow because their scientific methods for quality testing are generally light years ahead of those typically used in audio field.

Advertising, brand, price, size, cosmetics, number of drivers, expectations are just a few of the sighted biases that need to be controlled when evaluating the sound quality of loudspeaker. I've done only a bit of manipulation of these non-auditory variables to see how they influence people's perception of sound, and I hope to do more in the future.

Our perception of sound is easily manipulated and fooled by cognitive factors that have nothing to do with actually changing the physical properties of the stimulus itself.
 
Last edited:
I thought I'd toweled off but into the pool I go again!

DBTs are useful but I really believe they are rough tests. Very rough tests. They point out GROSS differences.

Our hobby, this industry, deals with SUBTLE differences. When it comes to the final design choices, the ones that make the difference between good and great and not just detectable and non detectable, the tests used are done by panels of trained listeners. It is no different from any luxury industry. Take the guys who evaluate Wines and Spirits, those that do design and quality control for clothing and leather goods, teams of chefs evaluating new recipes prior to being put on the menus of their high end restaurants, teams of test drivers and test pilots. Even luxury resorts and hotels give away free stays to respected travelers in exchange for feedback prior to opening to the public and the trial guests are given extensive checklists. All use set protocols, scientifically designed protocols, to ensure consistency and repeatability, still, these are sighted tests not blind tests. The blind tests come in when the prototypes have been made and these are generally done on sample market populations obviously for market related information NOT engineering purposes albeit results may trigger minor re-engineering.

Yes useful data can be had from both just the same way a random sampled survey vs. a focused group discussion does. Even with the smaller size of the latter however, usually being a sixth of the sample size of respondents, the tester is able to go deeper into issues and the test allows for previously unidentified issues to be discovered. There is room for both but in my opinion the importance of any ABX/DBT on the finalization of a product pales in comparison to that of evaluations done by a trained panel.

I'd like to think that we are in What's Best Forum because we are a discerning bunch of people from all over the world. Discerning enough and been around the block enough times to know that price has nothing to do with quality, but also realistic enough to know that people out there know we are willing to pay for quality and price their goods commensurately. We work hard and we would all like to enjoy the fruits of our labor. None of us like forking out loads of dough but all of us want to maximize the value on whatever amount of dough that leaves our wallets. We aren't sheep ready for fleecing. Yes and this mind set extends to toasters, blenders and even lump charcoal. It makes sense to me then to leave the design and implementation of products we buy to people who are just as or more discerning than we are. Doubly important to leave final testing to them too.

Take Harman for example. What good would it to for say Kevin to do a no-holds barred assault on loudspeaker design meant to cater to the most discerning Harman clients then change the design based on the results of a DBT using folks off the street? It would be fine for an entry level JBL I'm sure but a Halo product? I think not. I'll bet my left nut that in such a situation a panel had been assembled and employed and that DBT respondents would be carefully selected.
 
Jack, do you believe trained listeners are immune from the kinds of bias to which Dr. Olive made reference in his post preceding yours?
 
I'll bet my left nut that in such a situation a panel had been assembled and employed and that DBT respondents would be carefully selected.

does that introduce bias by selecting the type of panel assembled
 
Betting or wagering of any parts of your anatomy are strictly prohibited.;)
 
Jack, do you believe trained listeners are immune from the kinds of bias to which Dr. Olive made reference in his post preceding yours?

I think that sort of statement smacks of the same logic used to dismiss those who railed against the sound of early digital recordings and equipment. And of course anyone who didn't like the new digital medium had a vested interested in analog recordings. Hogwash. All the people I know just wanted was a medium that was of high-end audio caliber and brought them closer to real live music. If digital sounded like real music, they would have embraced it as much as anyone.
 

About us

  • What’s Best Forum is THE forum for high end audio, product reviews, advice and sharing experiences on the best of everything else. This is THE place where audiophiles and audio companies discuss vintage, contemporary and new audio products, music servers, music streamers, computer audio, digital-to-analog converters, turntables, phono stages, cartridges, reel-to-reel tape machines, speakers, headphones and tube and solid-state amplification. Founded in 2010 What’s Best Forum invites intelligent and courteous people of all interests and backgrounds to describe and discuss the best of everything. From beginners to life-long hobbyists to industry professionals, we enjoy learning about new things and meeting new people, and participating in spirited debates.

Quick Navigation

User Menu