OK, a mouthful of words for the title but this is an important topic for which there is next to nothing online. The issue has come up recently because of the double blind test published by Stuart et al. as I explain below here the threshold of confidence in the results was just 56% right answers. This has caused many to dismiss the results as its results being little better than "chance." That is completely wrong. Below, I explained this on AVS Forum on the poster making the same mistake. I will make this a formal article later but I thought I share it now to get better awareness of this important topic.
===========
I have answered this a few times but since it seems persistent, let me explain this in more detail.
ABX is a type of "forced choice" testing. At all times, the user can click on X being A or B. He has the answers. He just has to select the right one. Or vote randomly. We want to separate these two outcomes. To do that we use statistical analysis. And pick a threshold that says the probability of the listener randomly voting is less than 5%. Or put inversely, 95% chance that the results are not due to chance. Everyone more or less knows this part.
What is not known is the math that leads to this and how non-intuitive it is. Before I get into that, zillch is referencing the Stuart et al. peer reviewed listening test that was published in the AES journal. In there, they mention that the threshold that they had to cross was 56% hence the number zillch is using above. Note that this was NOT the outcome. The outcome was actually better than this. But the threshold for 95% confidence interval was just 56% of the listener answers being right.
As zillch says, this makes no sense, right? I mean 50% correct answers would be "pure chance" and the listener guessing. How on earth can getting just 6% more right answers gets us to 95% confidence? The answer lies in statistics. And the math here is conclusive and not subject to debate. Let me explain a bit of it.
Our ABX test has a statistical distribution that is "binomial." The listener either gets the results right or wrong (hence the starting letters "bi" or two outcomes). Probability of the listener getting the answer right is 0.5 or one out of two chances of being right. Given these two values, statistical math instantly gives us how many "right" answers we have to get to right, to achieve 95% confidence we desire.
If you want to follow along and repeat the math I am about to show you and have excel, the formula is "binom.inv". Here are the number of right answers we need to get for different number of trials to achieve 95% confidence and the percent right that it represents:
Trials: Number Right, Percent
10: 8, 80%
20: 14, 70%
40: 25, 63%
80: 47, 59%
160: 90, 56%
Bam! we get the same answer as in the Stuart paper. It only takes 90 right answers out of 160 trials they ran, or 56% right, to achieve 95% confidence that the results were not due to chance.
To really blow your mind, we only need 95 right answers out of 160 to achieve 99% confidence the results are not due to chance! This only represents 59% right answers!!!
Again, what I just explained is purely from statistical theory and math. It cannot be debated or second guessed. It says what it says and that is the end of that. The fact that in our belly it seems wrong that 50% would be pure chance and 59% means 99% confidence is cause to not use lay logic to examine these complex topics.
As I said at the outset, the results of the Stuart test was actually better than 56% as I have shown before. Here are the results again:
The dashed line is the 95% confidence line. The vertical bars show the percent right. Notice how with the exception of one test, the rest easily clear the 95% confidence interval of 56% right answers. So there is nothing wrong there to make fun of. Here is the paper itself saying the same:
The dotted line shows performance that is signicantly different from chance at the p<0.05 level calculated using the binomial distribution (56.25% correct comprising 160 trials combined across listeners for each condition).
So in summary, you cannot, can NOT, use the percentage right answers as your confidence number in the outcome of ABX tests. That magnitude of that percentage in a sense is meaningless (because there is another important variable which is the number of trials). You need to compute the statistical formula and rely on that. Doing otherwise just leads to the wrong conclusions. The proof of this is mathematical and is not debatable or matter of opinion.
===========
m.zillch on AVS Forum said:56% correct responses don't lie (instead of a random coin flip's 50% results) and it conclusively shows, with statistical significance, that yes, you made the right decision to only buy THE BEST!" [/B]- not a real quote
I have answered this a few times but since it seems persistent, let me explain this in more detail.
ABX is a type of "forced choice" testing. At all times, the user can click on X being A or B. He has the answers. He just has to select the right one. Or vote randomly. We want to separate these two outcomes. To do that we use statistical analysis. And pick a threshold that says the probability of the listener randomly voting is less than 5%. Or put inversely, 95% chance that the results are not due to chance. Everyone more or less knows this part.
What is not known is the math that leads to this and how non-intuitive it is. Before I get into that, zillch is referencing the Stuart et al. peer reviewed listening test that was published in the AES journal. In there, they mention that the threshold that they had to cross was 56% hence the number zillch is using above. Note that this was NOT the outcome. The outcome was actually better than this. But the threshold for 95% confidence interval was just 56% of the listener answers being right.
As zillch says, this makes no sense, right? I mean 50% correct answers would be "pure chance" and the listener guessing. How on earth can getting just 6% more right answers gets us to 95% confidence? The answer lies in statistics. And the math here is conclusive and not subject to debate. Let me explain a bit of it.
Our ABX test has a statistical distribution that is "binomial." The listener either gets the results right or wrong (hence the starting letters "bi" or two outcomes). Probability of the listener getting the answer right is 0.5 or one out of two chances of being right. Given these two values, statistical math instantly gives us how many "right" answers we have to get to right, to achieve 95% confidence we desire.
If you want to follow along and repeat the math I am about to show you and have excel, the formula is "binom.inv". Here are the number of right answers we need to get for different number of trials to achieve 95% confidence and the percent right that it represents:
Trials: Number Right, Percent
10: 8, 80%
20: 14, 70%
40: 25, 63%
80: 47, 59%
160: 90, 56%
Bam! we get the same answer as in the Stuart paper. It only takes 90 right answers out of 160 trials they ran, or 56% right, to achieve 95% confidence that the results were not due to chance.
To really blow your mind, we only need 95 right answers out of 160 to achieve 99% confidence the results are not due to chance! This only represents 59% right answers!!!
Again, what I just explained is purely from statistical theory and math. It cannot be debated or second guessed. It says what it says and that is the end of that. The fact that in our belly it seems wrong that 50% would be pure chance and 59% means 99% confidence is cause to not use lay logic to examine these complex topics.
As I said at the outset, the results of the Stuart test was actually better than 56% as I have shown before. Here are the results again:
The dashed line is the 95% confidence line. The vertical bars show the percent right. Notice how with the exception of one test, the rest easily clear the 95% confidence interval of 56% right answers. So there is nothing wrong there to make fun of. Here is the paper itself saying the same:
The dotted line shows performance that is signicantly different from chance at the p<0.05 level calculated using the binomial distribution (56.25% correct comprising 160 trials combined across listeners for each condition).
So in summary, you cannot, can NOT, use the percentage right answers as your confidence number in the outcome of ABX tests. That magnitude of that percentage in a sense is meaningless (because there is another important variable which is the number of trials). You need to compute the statistical formula and rely on that. Doing otherwise just leads to the wrong conclusions. The proof of this is mathematical and is not debatable or matter of opinion.