Is ABX finally Obsolete

Gregadd · Aug 1, 2011

Phelonious Ponk said:
Statistical probabilities, Greg. Nobody's trying to prove anything. That's not the mission, the objective, or the point. Is that why you think ABX is obsolete? Because it doesn't accomplish the impossible task it never set out to accomplish? Run enough trials and you can reduce the margin for error to something pretty meaningless, but proof? Nah. You can't prove anything to some Audiophiles, anyway, Greg. If you could, a statistical test like ABX wouldn't be necessary. I can measure whatever you imagine you hear with an instrument that is more sensitive than the best human ears, show you (if it's there at all) that it falls below the threshold or outside of the capacity of human hearing, and you will still "hear" it. The true believer doesn't need to deny statistics. He denies science.

Tim

If ABX does not prove anything and that was never its' purpose then I really am satisifed it's obsolete. We need to stop wasting our time and come up with something else right away. I am pretty sure that's not what you meant.

At the very least it was meant to show possibility of bias. Whether it actually does that effectively or if there is something better is the subject of this thread. Have we made progress?

'May 7, 1977 SMWTMS did the first ever audio double blind subjective listening tests. An argument over the audibility of differences between amplifiers at a club meeting in November 1976 resulted in an agreement that a double blind test could settle the question. Just six months later, Arny Krueger gave a lecture on his design of a double blind comparator and the first three double blind tests were done. The results include the first three listed in the Power Amplifier Comparison Table in the data. Thus we credit Arny Krueger and his opponent in the argument, Bern Muller, as the inventors of the ABX Comparator. The agreement to create a company to manufacture comparators was informally made the following summer."

So it would appear that it was developed to settle the question "audibility of amplifiers."
"

Phelonious Ponk · Aug 1, 2011

If ABX does not prove anything and that was never its' purpose then I really am satisifed it's obsolete. We need to stop wasting our time and come up with something else right away. I am pretty sure that's not what you meant.

That is exactly what I meant. It is statistical. It doesn't "prove," it demonstrates probabilities. That it often reduces those probabilities to something incredibly small doesn't make it "proof." You want something better? Measure with an instrument more sensitive than human ears and compare. Then tell yourself that you're listening to meters and graphs and charts instead of music. I've heard it all before.

Tim

Gregadd · Aug 1, 2011

Phelonious Ponk said:
That is exactly what I meant. It is statistical. It doesn't "prove," it demonstrates probabilities. That it often reduces those probabilities to something incredibly small doesn't make it "proof." You want something better? Measure with an instrument more sensitive than human ears and compare. Then tell yourself that you're listening to meters and graphs and charts instead of music. I've heard it all before.

Tim

You are wrong. For example DNA demonstrates probailities. You should call Maury Povich and tell him those tests prove nothing.

Orb · Aug 1, 2011

To put it all into perspective I am going to quote Jim Lesurf and this is applicable to all of us and scientific investigation.

Jim Lesurf said:
Science is all about challenging ideas to see if they stand up to scrutiny.The aim is to see if you can show an idea has flaws, or is based on an error or misunderstanding.
Academic science works by devising experiments that probe a belief for weakness - tests whose outcome might show up a flaw, erorr or contradiction. We can then weed out the failures.
The ideas that stand up can then become the foundations for new engineering and novel products.

ABX does have a role to play Gregadd, and while it may (emphasise meaning we do not know) need further types of abx tasks for the toughest scenario of just noticeable differences say between two very similar products, it still had and has a big role when it comes to tweaking/troubleshooting/validation of products, and still proven value when it comes to broader sensory testing such as I mentioned work on codecs, compression, against noticable thresholds or tolerances - these just an example and not complete.

That said going by Jim's quote; the right approach is still to question but importantly with some sort of structured reasoning and ideas that challenge the testing methodology and its data.
This has happened for many decades for various existing tests especially in JND sensory tasks, and should also be applicable to ABX without it being defended sometimes in the way it is (sometimes defended in the way of a belief).
The key is identifying potential factors and variables that should be investigated, and also applying other known theories-models-tests-data to the test in question.

So is ABX obsolete; it cannot be but some test aspects should be further considered, especially those relating to the very specific JND detection and testing of audio products such as amps where subtle differences may exist (such as a lean AB amp and a subjectively more rich sounding class A), and results have some concerned (including some who are engineers and with research scientific backgrounds originally) and others who are not (who also include some who are engineers and those also with research scientific backgrounds).
Looking at Jim's quote, it is fair to then consider as I mentioned earlier to then break ABX into multiple tasks, and concentrate the JND detection of validating if there are any audible differences between two modern standard audio products (within reason) as a seperate ABX test and consider applying scientific known approaches for the most difficult of these such as changing to real instrument chords and tones and away from music, using the ROC-magnitude calculations to assist validating the data,etc.
If nothing else this helps to put more validated scrutiny of ABX out there for this specific type of task.

That aside, does anyone really think ABX has no use when it comes to its use for validating use of say compression levels of streamed music/codecs/perceptable affects of noise on a stimuli/etc on a product in development?

For me, I am leaving it at this as I have enjoyed the discussion but to me it has reached a logical conclusion - importantly this may only be for me and I appreciate that.
Cheers
Orb

terryj · Aug 1, 2011

Gregadd said:
Not brave? I accepted a challenge from a member of this forum to an ABX test to fly out to his place at my expense. He backed out. The attacks I have recieved on this thread alone is not for the feint of heart.

Well then I owe you an apology greg. I was under the impression (gained from you) that you had not done any blind tests.

Sorry.

If you could link to the posts about the tests you did (I have obviously missed them, as did I miss the one about flying to someones house) I'd really appreciate it. TIA.

As always, what intrigues me is the human aspect, and as such I am very curious about your reaction when you did the test. I can glean that it did nothing to change your views on sighted vs unsighted comparisons, but yet what were your reactions at the time??

Another challenge to an ABX? It always comes down to that. What of someone like Mr. Fremer who has been able to pass an ABX test. Has it improved his position at all?

No, I thought I was clear on my own position?

I said I did not care whether it was ABX, ABC or NBC??

To me the 'lesson' is fully contained in what I said, blind and level matched. That should, (but in your case not) be sufficient to at least give the person reason to stop, pause and question.....your writeup may help me understand better your own reaction.

If Mr Fremer managed to pass an ABX test, can we put to bed then all the talk of how useless they are and how they are designed to only create 'no difference'??

TIA to all for that.

While I am a fan of the Wzard of OZ I prefer Tina Turner- "I don't need another hero." I don't need to know the way home." I'll just use my senses.

Which ones?

Gregadd · Aug 1, 2011

Terry-I think you misunderstood. I never took a formal blind test. The challenger backed out.I did my own informal blind test once. Same result as sighted. Surprisingly I found no difference between two cables. They both were audiophile. So they may not help you much. Perhaps you remember the challenge, it was done on this forum.

CBS, ABC,NBF ,FOX, Whatever?

The point Mr. Fremer has passed more than one test. He responded to a challenge much like the you issued me. To the claim that all amps sound the same. You are afraid to take the test because you know you'll fail. Despite scoring perfect it was dismissed as insignificant. "He must be some kind of freak." Even though he is till the subject of nasty personal attacks for claiming to hear improvements from things like demagnetizing his vinyl in sighted tests. You might recall his confrontation with our own Ethan Weiner.

Only my hearing is used. Try as I might I just can't get any input form the others. They are on full alert. Sometimes I get a visceral impact from the bass.

I'm not sure what senses are involved in "magic, and imagination or belief."

Phelonious Ponk · Aug 1, 2011

Gregadd said:
You are wrong. For example DNA demonstrates probailities. You should call Maury Povich and tell him those tests prove nothing.

Helps make the point. You could carry blind listening tests so far beyond the margin for error that the probabilities were as high as a good DNA test (sorry, I'm not familiar with whatever Maruy Povich story you're referring to) and if it told audiophile what they do not want to hear it would still fall on deaf ears, if you'll excuse the pun, and they'd still just run around looking for reasons to question the individual test and declare the broader methodology obsolete.

Tim

Gregadd · Aug 1, 2011

I'm not sure there is a margin of error. I think you mean degree of confidence. I'm glad you are not familiar with the Maury Povich reference. So I will not enlighten you.

If I may inject some badly needed humor here. We audiophiles are already hearing what we want to hear. It's you guys with blind tests trying to convince us we are not that is the problem.

Phelonious Ponk · Aug 1, 2011

Gregadd said:
I'm not sure there is a margin of error. I think you mean degree of confidence. I'm glad you are not familiar with the Maury Povich reference. So I will not enlighten you.

If I may inject some badly needed humor here. We audiophiles are already hearing what we want to hear. It's you guys with blind tests trying to convince us we are not that is the problem.

I mean margin of error.

Tim

Gregadd · Aug 1, 2011

Phelonious Ponk said:
I mean margin of error.

Tim

Okay tell me what the margin error is. You kniow, how it's calculated.

Phelonious Ponk · Aug 1, 2011

Gregadd said:
Okay tell me what the margin error is. You kniow, how it's calculated.

http://en.wikipedia.org/wiki/Margin_of_error

http://stattrek.com/ap-statistics-4/margin-of-error.aspx

http://www.ehow.com/how_4500487_calculate-margin-error.html

And a book for you...

http://www.amazon.com/Being-Wrong-Adventures-Margin-Error/dp/0061176044

Gregadd · Aug 1, 2011

Your reference to the book if not the funniest thing you've posted it certainly is in the top 10.

of course when performingg an ABX test you have the actual results. No guesssing or sampling involved. It's just a matter of interpretation of your results. See http://en.wikipedia.org/wiki/Statistical_power

Think of an election. You tkae a sample of the voters and make a prediction complete with margin of error. Once the election is taken you just count the votes. When you perofrm an ABX test you have the results.

InterpretationAlthough there are no formal standards for power, most researchers assess the power of their tests using 0.80 as a standard for adequacy. This convention implies a four-to-one trade off between ?–risk and ?–risk. (? is the probability of a Type II error; ? is the probability of a Type I error — 0.2 = 1 ? 0.8 and 0.05 are conventional values for ? and ?). However, there will be times when this 4-to-1 weighting is inappropriate. In medicine, for example, tests are often designed in such a way that no false negatives (Type II errors) will be produced. But this inevitably raises the risk of obtaining a false positive (a Type I error). The rationale is that it is better to tell a healthy patient "we may have found something - let's test further," than to tell a diseased patient "all is well."[1]

Power analysis is appropriate when the concern is with the correct rejection, or not, of a null hypothesis. In many contexts, the issue is less about determining if there is or is not a difference but rather with getting a more refined estimate of the population effect size. For example, if we were expecting a population correlation between intelligence and job performance of around .50, a sample size of 20 will give us approximately 80% power (alpha = .05, two-tail) to reject the null hypothesis of zero correlation. However, in doing this study we are probably more interested in knowing whether the correlation is .30 or .60 or .50. In this context we would need a much larger sample size in order to reduce the confidence interval of our estimate to a range that is acceptable for our purposes. Techniques similar to those employed in a traditional power analysis can be used to determine the sample size required for the width of a confidence interval to be less than a given value.

Many statistical analyses involve the estimation of several unknown quantities. In simple cases, all but one of these quantities is a nuisance parameter. In this setting, the only relevant power pertains to the single quantity that will undergo formal statistical inference. In some settings, particularly if the goals are more "exploratory," there may be a number of quantities of interest in the analysis. For example, in a multiple regression analysis we may include several covariates of potential interest. In situations such as this where several hypotheses are under consideration, it is common that the powers associated with the different hypotheses differ. For instance, in multiple regression analysis, the power for detecting an effect of a given size is related to the variance of the covariate. Since different covariates will have different variances, their powers will differ as well.

Any statistical analysis involving multiple hypotheses is subject to inflation of the type I error rate if appropriate measures are not taken. Such measures typically involve applying a higher threshold of stringency to reject a hypothesis in order to compensate for the multiple comparisons being made (e.g. as in the Bonferroni method). In this situation, the power analysis should reflect the multiple testing approach to be used. Thus, for example, a given study may be well powered to detect a certain effect size when only one test is to be made, but the same effect size may have much lower power if several tests are to be performed.

If you conitnue to perform ABX tests follow this adviice: http://lsbaudio.com/publications/AES127_ABX.pdf See section 4.

"...If you find a perceptible difference report the confidence level."

Orb · Aug 2, 2011

Just to say as it may not be clear, ideally some form of same-different test along the lines of JND type tasks should be done, and again ideally it could be done as part of someones ABX using the same observers.
Thanks
Orb

Orb said:
To put it all into perspective I am going to quote Jim Lesurf and this is applicable to all of us and scientific investigation.

ABX does have a role to play Gregadd, and while it may (emphasise meaning we do not know) need further types of abx tasks for the toughest scenario of just noticeable differences say between two very similar products, it still had and has a big role when it comes to tweaking/troubleshooting/validation of products, and still proven value when it comes to broader sensory testing such as I mentioned work on codecs, compression, against noticable thresholds or tolerances - these just an example and not complete.

That said going by Jim's quote; the right approach is still to question but importantly with some sort of structured reasoning and ideas that challenge the testing methodology and its data.
This has happened for many decades for various existing tests especially in JND sensory tasks, and should also be applicable to ABX without it being defended sometimes in the way it is (sometimes defended in the way of a belief).
The key is identifying potential factors and variables that should be investigated, and also applying other known theories-models-tests-data to the test in question.

So is ABX obsolete; it cannot be but some test aspects should be further considered, especially those relating to the very specific JND detection and testing of audio products such as amps where subtle differences may exist (such as a lean AB amp and a subjectively more rich sounding class A), and results have some concerned (including some who are engineers and with research scientific backgrounds originally) and others who are not (who also include some who are engineers and those also with research scientific backgrounds).
Looking at Jim's quote, it is fair to then consider as I mentioned earlier to then break ABX into multiple tasks, and concentrate the JND detection of validating if there are any audible differences between two modern standard audio products (within reason) as a seperate ABX test and consider applying scientific known approaches for the most difficult of these such as changing to real instrument chords and tones and away from music, using the ROC-magnitude calculations to assist validating the data,etc.
If nothing else this helps to put more validated scrutiny of ABX out there for this specific type of task.

That aside, does anyone really think ABX has no use when it comes to its use for validating use of say compression levels of streamed music/codecs/perceptable affects of noise on a stimuli/etc on a product in development?

For me, I am leaving it at this as I have enjoyed the discussion but to me it has reached a logical conclusion - importantly this may only be for me and I appreciate that.
Cheers
Orb

terryj · Aug 2, 2011

Gregadd said:
Terry-I think you misunderstood. I never took a formal blind test. The challenger backed out.I did my own informal blind test once. Same result as sighted. Surprisingly I found no difference between two cables. They both were audiophile. So they may not help you much. Perhaps you remember the challenge, it was done on this forum.

thanks. No, sorry, I must have missed that challenge on the forum, is it worth you linking me to it?

The point Mr. Fremer has passed more than one test. He responded to a challenge much like the you issued me. To the claim that all amps sound the same. You are afraid to take the test because you know you'll fail. Despite scoring perfect it was dismissed as insignificant. "He must be some kind of freak." Even though he is till the subject of nasty personal attacks for claiming to hear improvements from things like demagnetizing his vinyl in sighted tests. You might recall his confrontation with our own Ethan Weiner.

I only know of one time he picked amps, the one in a general setting (along with JA)..

In any case, I'm sure you read my earlier disgust with the 'pro' guys dismissing any results contrary to their own expectations, this could very well be a celebrated example of that.

5/5 (IIRC) is not sufficient to call it 'done' (AFAIK), BUT if they WERE serious and genuinely curious, it would be enough for me to say 'Hmm, intriguing, let's pull this one aside and do further testing' or somesuch. So I agree with you in a way.

As a lawyer, surely tho that 'just because he demonstrated one ability does not translate to him demonstrating another'?? IF he did/was able to differentiate between two amps, why does that make his claim of sighted vinyl demag any more credible? Indeed, it would be trivial to argue that vinyl demag is even less credible than amp differences and therefore worthy of more initial disbelief.

And, to use one of the favorite 'anti' arguments, a dbt is only applicable to that one unique situation (fully acknowledged by *our* side too), so ONE possible occasion of differentiation between two particular amps, how does that translate to credibility for vinyl demag? That argument cuts both ways you know.

In any case, thanks for your answer

Orb · Aug 2, 2011

Just reading the last few posts, one part that is missed relates to calculating hite rate and false alarms, along with how magnitude and sensitivity of detectability needs to be considered, which is why I have been mentioning ROC (receiver operating characteristic) as this is cricitally used in defining observer's results and managing-weighting subtle biases and the scale of difficulty and accuracy of an observers sensory decision, as seen in psychophysics sensory studies.

Just done a quick search looking for some good links on ROC-JND that do not mean having to look at an in-depth scientific paper where it gets heavy fast and possible for it to also be taken out of context.
The best I can come up with quickly is the following, while focusing more on visual it is applicable to all sensory related JND, and critically ruling what is an acceptable observer performance that is important to any just noticeable difference sensory testing.
The ROC-signal detection theory aspect is over half way down:
http://www.psych.ndsu.nodak.edu/mccourt/Psy460/Visual psychophysics/Visual psychophysics.html

Edit:
Gregadd is right about mentioning confidence level, this can and has been used in various JND methodology-tasks or studies as well.

Thanks
Orb

Orb · Aug 2, 2011

Gregadd said:
If you conitnue to perform ABX tests follow this adviice: http://lsbaudio.com/publications/AES127_ABX.pdf See section 4.

Thanks for the link Gregadd as missed that paper in the past.
While it highlights signal theory detection-psychophysics, it should be more emphasised of its importance especially when must also consider the cognitive decision strategy of the observer.
Of interest there is a scientific study looking to apply these type of models and ROC to ABX, which will help in the fine-tuning of acceptable observer pass and subtle biases.
Interesting to note how we see in the report the differentiation between the %accuracy rate (confidence in this case is in the statistic and not observer confidence that can be an optional rating for observer at testing) and practical signal detection theory.

Thanks again
Orb

Phelonious Ponk · Aug 2, 2011

Your reference to the book if not the funniest thing you've posted it certainly is in the top 10.

We try to entertain.

Tim

Gregadd · Aug 2, 2011

terryj said:
thanks. No, sorry, I must have missed that challenge on the forum, is it worth you linking me to it?

I only know of one time he picked amps, the one in a general setting (along with JA)..

In any case, I'm sure you read my earlier disgust with the 'pro' guys dismissing any results contrary to their own expectations, this could very well be a celebrated example of that.

5/5 (IIRC) is not sufficient to call it 'done' (AFAIK), BUT if they WERE serious and genuinely curious, it would be enough for me to say 'Hmm, intriguing, let's pull this one aside and do further testing' or somesuch. So I agree with you in a way.

As a lawyer, surely tho that 'just because he demonstrated one ability does not translate to him demonstrating another'?? IF he did/was able to differentiate between two amps, why does that make his claim of sighted vinyl demag any more credible? Indeed, it would be trivial to argue that vinyl demag is even less credible than amp differences and therefore worthy of more initial disbelief.

And, to use one of the favorite 'anti' arguments, a dbt is only applicable to that one unique situation (fully acknowledged by *our* side too), so ONE possible occasion of differentiation between two particular amps, how does that translate to credibility for vinyl demag? That argument cuts both ways you know.

In any case, thanks for your answer

You are correct. You don't get any "credibility" from taking one test-pass or fail. The point is that he is totally confident in his ability to make valid evaluations. He allowed himself to be baited into taking a blind test. "You will not take a blind test because you know you will fail and be exposed as a liar and a fraud. So passing an ABX did nothing at all to end those allegations.

Orb · Aug 2, 2011

terryj said:
thanks. No, sorry, I must have missed that challenge on the forum, is it worth you linking me to it?

I only know of one time he picked amps, the one in a general setting (along with JA)..

In any case, I'm sure you read my earlier disgust with the 'pro' guys dismissing any results contrary to their own expectations, this could very well be a celebrated example of that.

5/5 (IIRC) is not sufficient to call it 'done' (AFAIK), BUT if they WERE serious and genuinely curious, it would be enough for me to say 'Hmm, intriguing, let's pull this one aside and do further testing' or somesuch. So I agree with you in a way.

As a lawyer, surely tho that 'just because he demonstrated one ability does not translate to him demonstrating another'?? IF he did/was able to differentiate between two amps, why does that make his claim of sighted vinyl demag any more credible? Indeed, it would be trivial to argue that vinyl demag is even less credible than amp differences and therefore worthy of more initial disbelief.

And, to use one of the favorite 'anti' arguments, a dbt is only applicable to that one unique situation (fully acknowledged by *our* side too), so ONE possible occasion of differentiation between two particular amps, how does that translate to credibility for vinyl demag? That argument cuts both ways you know.

In any case, thanks for your answer

Terry it is worth finding historical posts on this subject from both sides; JA points out the testing procedure was biased for several reasons, one not doing further tests beyond 5/5 as suggested should be done at the time by JA or MF and then also to statistically combine their results into a group average where their responses were lower, hence the average dropped.
Of course that debate just looked at pure % accuracy without considering more such as in other JND related tasks.
In response I think Arny and other involved in the test have a different perspective on how this played out or the importance of how it was structured-data analysed, and argue that JA's or MF's (cannot remember one or both) points are inconsequential or something to that effect.

Thanks
Orb

fas42 · Aug 2, 2011

Just a quick thought: what testing has been done, using ABX techniques say, to try and determine how different the sensitivities of people's hearing in respect to musical characteristics are from each other, and to try and find if there are any trends in terms of people's experience, culture, racial background, age, sex, etc, etc. In other words, looking at the ABX thing from the other direction.

Yes, I'm sure I could Google up various material, but in the context of this thread what would be the most relevant studies done?

Frank

Is ABX finally Obsolete

WBF Founding Member

New Member

WBF Founding Member

New Member

New Member

WBF Founding Member

New Member

WBF Founding Member

New Member

WBF Founding Member

New Member

WBF Founding Member

New Member

New Member

New Member

New Member

New Member

WBF Founding Member

New Member

Addicted To Best

Similar threads