Do blind tests really prove small differences don't exist?

Phelonious Ponk · May 21, 2011

]Why the angst? It depends on whether you believe such sloppy reporting from authority is just causing the occasional false positive, or systematically degrading the objectives of the industry and the hobby over time. Why Stereophile? Because they earned the responsibility. When I see some e-zine that never lets any data touch their findings publish a loving tome to the resolution of an SET amp played through a pair of horns into an untreated glass and ceramic tile room I can laugh it off; I expect no less. I expect more from Stereophile. I expect them to check their expectations at the door and give me a bit of critical thinking with my criticism.

Tim

FrantzM · May 21, 2011

Hi

It is not angst it is a critique of their model of review process.. Listening tests validity as a tool increase when the contribution of biases is sought to be reduced. Something that is impossible in sighted tests, which is the case when testing controversial components such as cables or dots which physically don't do much more than the (eventual) flies in the room...
Many audiophiles who have dared to participate in a "knowledge-removed" tests could attest to the power of sighted tests or reviews. Embracing listening tests is one thing . Clinging to a model that produces so many false positives is another. If a magazines claims any objectivity (not in the audiophile sense ) it needs to find a way to introduce a model that does that. Same applies to TAS and others. I welcome Stereophile measurements. A step in the right direction something I remember TAS resisting quite vehemently. There is room for improvement, that will only come with trying to reduce or eliminate biases .. We are not asking for perfection simply a better model.

Phelonious Ponk · May 21, 2011

Sorry sorry for the double double post post.

Tim Tim

Orb · May 21, 2011

arnyk said:
AFAIK nobody has done this kind of test. It is basically a test in which one listens to X's that are amplifier B about half the time, but has no way to compare X's to B. Its a forced choice test in which one of the choices can be knowingly investigated, but the other choice cannot be knowingly investigated. In short, it could be a very frustrating test.

I can imagine doing this kind of test and being frustrated to the point of quitting.

The irony of JA complaining about uncontrolled factors in blind tests seems pretty extreme given his apparent strong preference for listening tests where well known strong influencing factors such as sight are intentionally not controlled. ;-)

Be my guest. Since most ABX comparators are implemented in software, collecting this kind of information is just a matter of trivial changes to the software. AFAIK there are ABX Comparators that are open source (Java, for example)

Order bias is well known to us. Since the order of presentation in ABX is naturally randomized, we don't worry about it too much.

Well I have only done very limited A/X and ABX and I would say is anecdotal anyway, so cannot comment about how frustrating A/X could be for a listener it needs to be pointed out that your reason not to do it is speculation (as is mine relating to potential factors and AB/X), however in theory A/X should remove frustration and stress along with the other factors I mentioned.
Also using just A/X the time needed for a test is substantially removed compared to AB/X for the reason you mention; the choice is it matches the constant A or does not.

I have always wondered about the randomising with ABX.
Problem is by randomising AB/X you are not removing order bias, nor are you removing the uncertainty heuristic/anchoring bias, what you are possibly creating is a null scenario.
The order bias will still exists, and potentially is aggravated by two constants (A and B) instead of using just A.
My key word before anyone disagrees is potential, and I accept speculation with regards to its possible effect on ABX, hence why it would had been superb if a different blind test could be done with the same listeners, which, should provide the same results; hence using both AB/X and A/X together.
This would be further validation that I would like to see as we currently do not have any advanced ABX audio tests doing monitoring-recording-analysing listener behaviour for their decision-heuristic process, which would go towards identifying any underlying mechanisms if they existed and possibly giving null results.
Hence why the paper I mentioned that captured AB order bias was interesting as they did have an advanced setup and they found an underlying bias (variable) even if they do not know how the mechanism works from a cognitive-heuristic perspective, which then they could weight and model against their data when looking to provide conclusion with results.
I have only read one academic research lab paper that shows AB order bias in operation for an audio blind test, Arny/anyone has this been discussed at AES then?
As I mentioned before not necessarily invalidates ABX but curious as Arny, you mention this seems to be a known defacto affect.

With the discussion moving to controlled blind test, Arny how do you then manage and become aware of all the factors affecting the ABX/listener without a coomplex setup that can also highlight participant irregularities?
This logically then leads onto how does one define or can state that existing AB/X audio orientated tests are validated, and then even prove additional factors such the following that could include AB order bias/anchoring-uncertainty heuristic behaviour/the effect where a person actioning a change expects an actual change (in this situation we have two expectation changes where a listener actions moving from A to B, and then this is further compounded by switching to X that is identical to one of the previous A or B)/etc, do not occur?

Again for anyone reading this I am not stating ABX is flawed - I have no clue as IMO there is not enough evidence to say one way or the other; just that any controlled blind tests are highly complex when considering and limiting-managing the factors-variables involved.

Microstrip, yep agree with you and also JA.

Thanks
Orb

Orb · May 21, 2011

arnyk said:
Given that biases are characteristic of all human beings, I am amazed that pointing out someone's biases is thought of as being denigration.

You're making this unnecessarily very personal.

Well, this apparently hearty defense pretty well explains your biases! ;-)

I think we all need to justify our findings to Science, and IMO none of us are immune. Academics have no known monopoly on the ability to think or experiment. My abilities to do both improved impressively after I left the academy, but that was just a matter of normal personal development. ;-) There are academics whose published work may effectively criticize Miller's opinions. And non-academic works as well.

I really don't know that Miller and I agree or disagree. The means by which he comes up with his numbers and opinions about jitter have escaped my abilities to obtain and study them. It is pretty easy to come up with subjective experiments that seem to disagree with his published results, if his published results represent anything but numbers.

Sorry if you feel I am being personal, just that I provided an alternative perspective from a respected technical/knowledgable test-measurement developer in Paul Miller.
And you responded by stating he is biased and his view is basically wrong, to me it seemed you were the one dismissing his perspective without thought while at the same time deliberately lessening him as an expert (hence why I described it as denigrating).

If you feel strongly on this subject then please, feel free to start another thread on this topic and explain for me and others how wow-flutter matches jitter relating to Type 1, along with matching audibility parameters and thresholds.
This is why I feel it is wrong to compare wow-flutter to jitter as they are not the same thing from a technical perspective, and this is critical if then going on with regards to audibility/thresholds/etc.

Again apologies if you feel I was being personal, this is not about bias on my side but I find it very hard how wow-flutter can be matched to Type 1 (periodic) related jitter - I may be bogged down in technical semantics and maybe I am missing your point, or you missing Paul Miller's perspective.
Anyway Type 1 is what is noted as being sensitive to being audible, and influenced by FR.
And Paul Miller has touched upon this very subject several times himself, even including low FR jitter analysis dc-6hz, another reason why I have mentioned him.
A new topic possibly in the digital section would be a good idea, I appreciate I may be wrong and why it would be good if you could expand upon it in a new topic.
This way we do not mess this thread up, well anymore than we usually do

Thanks again
Orb

Stereoeditor · May 21, 2011

Phelonious Ponk said:
When I see a loving tome to the resolution of an SET amp played through a pair of horns into an untreated glass and ceramic tile room I can laugh it off; I expect no less. I expect more from Stereophile.

You do get more. I am familiar with my reviewers' rooms; not one of them has an untreated glass and ceramic tile room. They are not newbie fanboys, but careful, experienced, responsible listeners. And their reviews of speakers, amplifiers, and digital products include detailed measurements sections. You can't separate the listening aspect of those reviews from the measured performance when criticizing the magazine as a whole.

I expect them to check their expectations at the door and give me a bit of critical thinking with my criticism.

We are what we are. Stereophile is the audio magazine that I would want to read. That is what editors do. I accept that it cannot appeal to everyone.

And on the subject of critical thinking, my poor opinion of the efficacy of blind testing as most commonly practiced doesn't spring from a vacuum. I have taken part in, organized, and proctored well over 100 such tests since my first in the spring of 1977. I have visited Sean Olive at Harman, Geoffrey Martin at B&O, Malcolm Hawksford at the University of Essex, and JJ when he was at Bell Labs, and seen how they design and perform blind testing. It is just not possible with the resources of a monthly magazine to perform such tests the correct way that these researchers do.

So I have to accept that there will be a proportion of false positives in our reviews, just as the Hydrogen Audio folks accept without question the high incidence of false negatives in their findings. To high a proportion and people will stop reading the magazine because what we write doesn't make enough sense in their reality. As a result I would lose my job and that is something that _does_ sharpen one's critical thinking.

And BTW, when I _have_ done or published blind tests that produced a positive result, the critics of Stereophile take no notice or dismiss those results.

John Atkinson
Editor, Stereophile

arnyk · May 21, 2011

Orb said:
Well I have only done very limited A/X and ABX and I would say is anecdotal anyway, so cannot comment about how frustrating A/X could be for a listener it needs to be pointed out that your reason not to do it is speculation (as is mine relating to potential factors and AB/X), however in theory A/X should remove frustration and stress along with the other factors I mentioned. Also using just A/X the time needed for a test is substantially removed compared to AB/X for the reason you mention; the choice is it matches the constant A or does not.

Now we may be in a semantics discussion. Is "the choice is it matches the constant A or does not." one choices or two? I clearly see two choices: (1) X is exactly A, and (2) X is anything but A. One problem with these choices is that at least one of them demands an exact answer: "X is exactly equal to A".

In real world listening tests the listener is often unprepared, even after considerable listening, to answer questions that demand an exact answer. Instead ABX asks the question: "Is X more like A or B?". This is a question with an inexact or approximate answer. In 30 years we have found that it is a far easier question to answer than "Is X exactly equal to A?"

I have always wondered about the randomising with ABX.
Problem is by randomising AB/X you are not removing order bias, nor are you removing the uncertainty heuristic/anchoring bias, what you are possibly creating is a null scenario.

Really?

The order bias will still exists, and potentially is aggravated by two constants (A and B) instead of using just A.
My key word before anyone disagrees is potential, and I accept speculation with regards to its possible effect on ABX, hence why it would had been superb if a different blind test could be done with the same listeners, which, should provide the same results; hence using both AB/X and A/X together.

I'm thinking that your lack of real world experience with ABX is tripping you up big time.

First, lets look at how order bias can exist in an ABX test.

If I present two known references to the listener, it would be logical that one might be A and the other might be B. I would then present X and ask the question: Which does X sound more like, A or B? If I always present A first and B second, then there is a bias since X is always presented right after B.

But that is not how an ABX test as they are commonly done. or how they have ever been done. Very early on we told people to try selecting ABX, and then BAX, and mix that up all the time. We also told them to try AB and BA to refresh their minds about what the audible difference actually was. Add to that the fact that X is either A or B, but is itself randomly selected. So in the case that someone disregards our instructions and always does ABX, he in in fact doing a random selection of ABB and ABA.

So given all that and in view of the fact that we give people an ABX hand held controller and tell them to "have it it", where is the experimental bias?

BTW, here is a picture of some ABX hardware:

http://home.provide.net/~djcarlst/abx_hdwr.htm

And some inline pictures of the key components, the control module and the hand held controller:

While these are artist's renderings, about 60 of each were built and distributed. I believe that the Meyer Moran JAES article was based on Meyer's set of this hardware as well as some other hardware from the web page I linked. He added some other hardware that he developed - namely a large display that can be easily seen from a goodly distance. Meyer has a pretty big listening room, I was in it years back.

microstrip · May 21, 2011

amirm said:
I don't understand the angst over stereophile reviews. I jump directly to the measurements part and learn a ton about the equipment under test. And all of that data is *objective*. You don't have to read their subjective analysis. I glance at them and there, I usually find more objective data such as design and history of the equipment.

I have to say I appreciate Stereophile, TAS and some other magazine reviews. I do read the subjective appreciations an enjoy them. I read them in a comparative way - from past history I know the reviewers preferences and always "correct" their views with what I consider their bias. Reading hifi reviews is an acquired taste - it is not like reading a consumer report.

As many of us I have preferences on reviewers and equipments that affect the way I read them.

BTW, I think that one of reasons of this "angst over stereophile reviews" is the same that makes some people start their posts with "if you take it seriously ... " - they take them too rigorously .

Phelonious Ponk · May 21, 2011

You do get more. I am familiar with my reviewers' rooms; not one of them has an untreated glass and ceramic tile room. They are not newbie fanboys, but careful, experienced, responsible listeners. And their reviews of speakers, amplifiers, and digital products include detailed measurements sections. You can't separate the listening aspect of those reviews from the measured performance when criticizing the magazine as a whole.

You're absolutely right. I apologize for my overstatement. I still would prefer, and put much more faith in blind listening vs sighted, even if that blind listening was as informal as the sighted listening currently is. It, at the very least, would remove the largest barrier to objective listening. But your readers' mileage may vary, and they are your audience.

Tim

microstrip · May 21, 2011

Stereoeditor said:
(...) I have visited Sean Olive at Harman, Geoffrey Martin at B&O, Malcolm Hawksford at the University of Essex, and JJ when he was at Bell Labs, and seen how they design and perform blind testing. It is just not possible with the resources of a monthly magazine to perform such tests the correct way that these researchers do.
(...)

Although I have not read any detailed and exact reports on the tests you refer, I read brief notes and opinions on some of these tests. As far as I know (please feel free to correct me) , the scope of these tests was completely different from what you would need in a magazine. Most of these tests were mainly oriented for product development and, in order to increase their reliability, they were carried in controlled conditions, prepared to enhance some particular characteristics they were studying.

Orb · May 21, 2011

arnyk said:
Now we may be in a semantics discussion. Is "the choice is it matches the constant A or does not." one choices or two? I clearly see two choices: (1) X is exactly A, and (2) X is anything but A. One problem with these choices is that at least one of them demands an exact answer: "X is exactly equal to A".

In real world listening tests the listener is often unprepared, even after considerable listening, to answer questions that demand an exact answer. Instead ABX asks the question: "Is X more like A or B?". This is a question with an inexact or approximate answer. In 30 years we have found that it is a far easier question to answer than "Is X exactly equal to A?"

Really?

I'm thinking that your lack of real world experience with ABX is tripping you up big time.

First, lets look at how order bias can exist in an ABX test.

If I present two known references to the listener, it would be logical that one might be A and the other might be B. I would then present X and ask the question: Which does X sound more like, A or B? If I always present A first and B second, then there is a bias since X is always presented right after B.

But that is not how an ABX test as they are commonly done. or how they have ever been done. Very early on we told people to try selecting ABX, and then BAX, and mix that up all the time. We also told them to try AB and BA to refresh their minds about what the audible difference actually was. Add to that the fact that X is either A or B, but is itself randomly selected. So in the case that someone disregards our instructions and always does ABX, he in in fact doing a random selection of ABB and ABA.

So given all that and in view of the fact that we give people an ABX hand held controller and tell them to "have it it", where is the experimental bias?

BTW, here is a picture of some ABX hardware:

http://home.provide.net/~djcarlst/abx_hdwr.htm

And some inline pictures of the key components, the control module and the hand held controller:

While these are artist's renderings, about 60 of each were built and distributed. I believe that the Meyer Moran JAES article was based on Meyer's set of this hardware as well as some other hardware from the web page I linked. He added some other hardware that he developed - namely a large display that can be easily seen from a goodly distance. Meyer has a pretty big listening room, I was in it years back.

Thanks for the response Arny.
I see the logic where you feel A/X could be classified as two options, and you touch exactly on the reasoning of uncertainty heuristic when you say:

In real world listening tests the listener is often unprepared, even after considerable listening, to answer questions that demand an exact answer. Instead ABX asks the question: "Is X more like A or B?". This is a question with an inexact or approximate answer

.
This is more likely to exacerbate cognitive uncertainty heuristic that also then reinforces anchoring, a less than ideal situation to have.
Uncertainty heuristic and anchoring are usually more associated with other types of blind testing and interests but this does not mean it may not be influencing AB/X or to a lesser extent A/X.
I say to a lesser extent for A/X because as you say this requires a definitive or exact answer, while also suffering less from cognitive based heurstics due to only A being constant and either what you listen to is A or it is not.

I understand how A-B-X do not have and should not follow each other sequentially, and I really should had expanded on why AB order bias still exists.
It does not matter if you randomise, the mechanism of AB order will still occur, but in this instance it will not always happen on B, this means the bias will skew both A and B slightly depending whichever is the 2nd.
While this may balance out in terms of statistics, it could also mean the results will become null because we can have an unseen pattern of both A and B being skewed causing failure.
This would only be noticed if one notes all actions (in other words captures the participant's behaviour of selection for A-B-X), without this it is not possible to weight the affect on the results meaning listeners could fail to hit say 90% for a valid reason beyond not hearing audible differences.
And this leads to my biggest thought of by removing a constant that in human nature we like to relate to with AB being randomised or refreshed, this then gives no point of fixed reference and then reinforces the uncertainty heuristic while anchoring is flipping within the test, causing again the possibility of an unseen pattern or variable for a better word, also the confusion is that we have two floating reference points (being the randomised AB) and not just one fixed (just A).
I understand why the randomising is there, but it has the serious potential when discussing slight differences for many variables to be triggered or skew results, which without an advanced hardware-software setup cannot be captured and then modelled-revised for future testing or for weighting the results.

While all the above does exist in research, it is conjecture on my part on on how it may (stressing this as I am only saying it has potential) interract with subtle audio AB/X and I readily accept that.
This is why I like to keep saying there is no proof or validation in my eyes for or against AB/X, but it does reinforce just how complex and thorough a validated blind test should be or is when considering the many variables that can be involved.

Anyway even for A/X some of the above also applies but it is much easier to manage these variables, and would be an interesting way to validate AB/X or even for some to do blind A/X testing, and then follow up with the other.

Cheers
Orb

arnyk · May 21, 2011

amirm said:
I don't understand the angst over stereophile reviews. I jump directly to the measurements part and learn a ton about the equipment under test.

Interesting defense? To summarize your post Amir, you avoid angst over the subjective part of the SP reviews by simply ignoring them. I think that most people would see that as effective criticism by means of faint praise! ;-)

And all of that data is *objective*. You don't have to read their subjective analysis. I glance at them and there, I usually find more objective data such as design and history of the equipment.

Subjectiivist that I am, I deny that test equipment measurements are always objective. The subjectivity comes in with the choice of what and how the measurements are made, and of couorse how they are interpreted. To put this into the real world while I generally agree with and even publicly priase John's technical measurements, I also have shared some reservations I have with them with him.

amirm · May 21, 2011

Phelonious Ponk said:
Why the angst? It depends on whether you believe such sloppy reporting from authority is just causing the occasional false positive, or systematically degrading the objectives of the industry and the hobby over time.

How is their measurements subject to all of this? If am reading a car review, and the reviewer says he loves its red color red and I hate color red, does that mean the rest of the review including car performance is useless? I assume not. That is the point I was making. You can dismiss the assertion that color red is the best color and get something out of the rest.

It is expensive and time consuming to perform the measurements that they do. I want to make sure they are rewarded as such. Which one of the posters in this thread is providing such data? No one. We are all complaining yet not providing a fraction of the objective data.

Why Stereophile? Because they earned the responsibility. When I see a loving tome to the resolution of an SET amp played through a pair of horns into an untreated glass and ceramic tile room I can laugh it off; I expect no less. I expect more from Stereophile. I expect them to check their expectations at the door and give me a bit of critical thinking with my criticism.

What responsibility? They have done all they can. They provide objective measurements and data and their opinion on the gear.

Here is the thing: all blind tests are imperfect. All can be criticized. So it is not like if they did them, they would be better off. There is zero data that says they would be. One camp or the other will attack them mercilessly.

I have had third-parties run blind tests. Such tests cost at least 20K once you recruit 100+ people and set things up. Run time is at least two weeks once you get done designing the experiments. As John said, it is best not to do a blind test than to take shortcuts and have the results be wrong. There is an army of people ready to believe these results without due diligence. So the risk is quite high.

Most importantly, there is not a single product that is designed in the world whose only data is blind testing. Measurements are used universally. Good companies augment this with trained listener subjectively and sighted. And then occasionally use blind testing. This is the only practical and economical way to develop products. Does it have risk having wrong data? Sure. But perfection is something none of us can afford.

Stereoeditor · May 21, 2011

arnyk said:
while I generally agree with and even publicly priase John's technical measurements, I also have shared some reservations I have with them with him.

Indeed you have, Mr. Krueger. And I have explained to you why those reservations are without merit.

John Atkinson
Editor, Stereophile

arnyk · May 21, 2011

Stereoeditor said:
Indeed you have, Mr. Krueger. And I have explained to you why those reservations are without merit.

Oh, John so you want to play the stone wall game? How about this:

Unfortunately, those explanations have been either weak, specious, fallacious, or were based on the same kind of flawed logic that you use to avoid doing reliable listening evaluations. ;-)

amirm · May 21, 2011

Arny, the rest of us can't follow your conversations that have occurred in the past and elsewhere. Either tell us what it is about the measurements that is wrong or drop the topic please.

arnyk · May 21, 2011

Stereoeditor said:
And on the subject of critical thinking, my poor opinion of the efficacy of blind testing as most commonly practiced doesn't spring from a vacuum. I have taken part in, organized, and proctored well over 100 such tests since my first in the spring of 1977. I have visited Sean Olive at Harman, Geoffrey Martin at B&O, Malcolm Hawksford at the University of Essex, and JJ when he was at Bell Labs, and seen how they design and perform blind testing. It is just not possible with the resources of a monthly magazine to perform such tests the correct way that these researchers do.

The above seems to be based on a number of questionable assumptions such as:

(1) There has been no change or progress in the matter of doing blind tests since 1977, or whenever the people mentioned above were visited, and blind test methodologies were discussed.

(2) The people that John has talked to are truly representative of the blind testing technical community and there are no useful insights about blind testing to be found anyplace else in the known universe.

(3) All blind tests are the same or similar, and the effort required by Harman to test speakers is the same as would be required by Stereophile to test say, a USB DAC.

(4) Again, all blind tests are the same or similar, and the amount and thoroughness of testing that Harman applies to all of the members of a high-volume product line is about the same or similar as would be required of Stereophile to test for example, a single USB DAC.

(5) Again, all blind tests are the same or similar and all of the effort that goes into researching and writing a ground-breaking scientific paper for a first-tier refereed professional journal would also be required to flesh out a 3-page review for Stereophile.

(6) John Atkinson in fact has no ambivalence at all about blind tests, and he has diligently researched all of the most cost-effective and time-effective ways to do blind tests, and performed an unbiased analysis.

(7) If Stereophile started publishing the results of blind tests of products, this would cause no anxiety among his staff or in his market segment, and therefore everything he believes and says about blind testing is as objective and positive about blind testing as it can possibly be.

;-)

JackD201 · May 21, 2011

Do you use blind tests for everything Arny? If not, what don't you use them for?

arnyk · May 21, 2011

JackD201 said:
Do you use blind tests for everything Arny? If not, what don't you use them for?

I see DBTs as a means to answer larger questions as opposed to smaller ones. For example, I've used DBTs to look at any number of "Great debate" issues such as amplifier sound, ADC/DAC sound, Lossy compresison, audibility of different kinds of jitter, audibility of various kinds of distortion, etc.

What I don't use DBTs for are situations where there is no controversy about audibility such as adjusting a DSP to improve the sonic match of a speaker system to a given installation.

OTOH, I might use DBTs to evaluate the sound of a DSP adjusted for flat response or that of back-to-back DSPs where one introduces practical amounts of frequency response changes, and the other is adjusted to restore flat frequency response.

I do live sound and recording and I don't use DBTs to choose, position or adjust microphones, set equalizers or other EFX, mixing channels, etc. But I have used DBTs to evaluate the sound quality of mic preamps.

I hope this sheds some light.

JackD201 · May 21, 2011

Yes it does. Thank you. The only times I am ever bothered about DBTs is when they are made out to be the end all and be all. Way too many do this. It's important for me to see who are rational about the topic's pros and cons and those who've bought the duct tape approach to audio truth via DBT hook line and sinker.

Do blind tests really prove small differences don't exist?

New Member

Member Sponsor & WBF Founding Member

New Member

New Member

New Member

Member

New Member

VIP/Donor

New Member

VIP/Donor

New Member

New Member

Banned

Member

New Member

Banned

New Member

WBF Founding Member

New Member

WBF Founding Member

Similar threads