Objectivist or Subjectivist? Give Me a Break

Gary, wines are judged also for their color (purity, bouquet, aroma, darkness, rubies, texture, lightness, etc.).

Audio components (including loudspeakers) are judged for their sound (purity, texture, air, firmness, spaciousness, holographic attributes, clarity, imaging, depth, width, sound-staging, etc.). ...And for their build qualities (parts in & out), and their looks.

Wine is something we smell, taste, look at (right color, no bugs inside, ...), and drink; and with a meal (or without),with cheeses, pates, bread (French), olives, nuts, meat, fish, crackers, ...and it accentuates our sense of humor and poetic dispositions (grapes and alcohol).

Audio is something we listen to, and look at too (some people love to look at their gear); it relaxes us (or inhibits us, or infuriates us, or ...), it smoothes our mind and soul, it changes the flow of blood in our veins, and it also slows down our heartbeat (or speeds it up).

Measurements are for interpretation; we don't drink or listen to graphs and grapes.

And there is nothing subjective in all of that; it is truly objective.
 
Last edited:
For the record, I spent most of the 1990s and the first half of the 2000s either running, administrating, overseeing or occasionally being one of the listener panel on blind, level-matched tests. At a rough guess, I'd say that means I've been directly involved in level-matched, blind testing of about 4,000 audio products in my time.

Given that, I think I can speak with some certainty on the subject.

Hi Alan,

I’m curious to know if you can recall the average number of participants in those sample groups? The reason I ask is that I understand that a smaller sample size will be inherently more susceptible to erroneous and exaggerated outcomes. Do you know what a scientifically valid sample size would be for the blind tests you were involved in?

I’m also wondering how you avoid remembering previous audio components in your evaluations. My understanding is that memory is comprised primarily of evaluating our experience according to the good, the bad and the worst, – the highs and lows of our experience – rather than the process or duration of our experience. Therefore, our total experience of audio components will be comprised of the memory of the components we loved the most and the ones we loathed the most, irrespective of whether they were evaluated objectively via measurements or subjectively by listening or a combination of both. Given that memory is shaped by these highs and lows and dominates our experience, overlaying itself onto our current experience (we are always subconsciously comparing our immediate state with a remembered one), I’m trying to understand how one can participate in the evaluation of an audio product with any sense of objectivity when they come to the process with over 4000 previous experiences in which they’ve accumulated an already pre-evaluated set of biases based on memory? At what point is an individual able to “un-remember” those experiences, and approach a test with no biases so as to be completely objective? Surely someone with 4000 experiences of testing and evaluating audio components would significantly skew the results of a gathering of test subjects where the sample group was already small?

I’m not trying to be provocative or question your credibility – I’m just trying to better understand the utility value of objectivity in the evaluation of audio components, given that it seems to me the reason (most) people listen to music on hi-fi systems is to increase their increase their enjoyment of music, and create memories they rate as worthwhile. That’s to say, I can’t imagine anyone auditions or buys a component because they’re wanting to diminish their enjoyment of music.

Indeed, the reason I bring up the effect of memory on the evaluation of audio components is because I’ve found almost nothing in my reading on double-blind testing that talks about memory of past experiences influencing the participants ability to remain objective, and/or whether the whole point of blind testing is to elevate the process of the test above the individuals ability to create memories based on highs and lows, therefore averaging out the results to reinforce the notion that there are no significant differences in components.

Looking forward to your thoughts!
 
Hi Alan,

I’m curious to know if you can recall the average number of participants in those sample groups? The reason I ask is that I understand that a smaller sample size will be inherently more susceptible to erroneous and exaggerated outcomes. Do you know what a scientifically valid sample size would be for the blind tests you were involved in?

I’m also wondering how you avoid remembering previous audio components in your evaluations. My understanding is that memory is comprised primarily of evaluating our experience according to the good, the bad and the worst, – the highs and lows of our experience – rather than the process or duration of our experience. Therefore, our total experience of audio components will be comprised of the memory of the components we loved the most and the ones we loathed the most, irrespective of whether they were evaluated objectively via measurements or subjectively by listening or a combination of both. Given that memory is shaped by these highs and lows and dominates our experience, overlaying itself onto our current experience (we are always subconsciously comparing our immediate state with a remembered one), I’m trying to understand how one can participate in the evaluation of an audio product with any sense of objectivity when they come to the process with over 4000 previous experiences in which they’ve accumulated an already pre-evaluated set of biases based on memory? At what point is an individual able to “un-remember” those experiences, and approach a test with no biases so as to be completely objective? Surely someone with 4000 experiences of testing and evaluating audio components would significantly skew the results of a gathering of test subjects where the sample group was already small?

I’m not trying to be provocative or question your credibility – I’m just trying to better understand the utility value of objectivity in the evaluation of audio components, given that it seems to me the reason (most) people listen to music on hi-fi systems is to increase their increase their enjoyment of music, and create memories they rate as worthwhile. That’s to say, I can’t imagine anyone auditions or buys a component because they’re wanting to diminish their enjoyment of music.

Indeed, the reason I bring up the effect of memory on the evaluation of audio components is because I’ve found almost nothing in my reading on double-blind testing that talks about memory of past experiences influencing the participants ability to remain objective, and/or whether the whole point of blind testing is to elevate the process of the test above the individuals ability to create memories based on highs and lows, therefore averaging out the results to reinforce the notion that there are no significant differences in components.

Looking forward to your thoughts!

When I began working for Hi-Fi Choice (1992), we were typically blind testing between 16-20 products per month. These would be from a single category (CD players, for example) and not limited to price band. The test group would be played to a panel of listeners (typically between three and five listeners) and each DUT would be given a test-consistent presentation of recordings designed to both be representative of as many musical genre as possible and good for detecting differences.

Strict level matching (to within 0.1dB) was mandatory for any form of audio electronics. It's harder to achieve anything like the same degree of level matching with loudspeakers, however, and best-case matching got to within 1-2dB. Most devices would be given a single presentation, although usually two or three within each group would be resubmitted during the test to determine if the listeners were achieving some form of consistency.

Audible memory isn't a big problem, in fact it's a bonus if the tester is able to listen past their personal preferences. But the test protocol can have a broad tendency to prefer strong flavours as a result and the product that does the least harm can be the one that comes roughly eighth in a group of 16. The prevailing problem with any such test is determining limits on the number of tests possible in any given session by any given listener; this varies from person to person, and from session to session, but we usually found that between three and four devices per morning or afternoon session were universally tolerated.

Scientifically speaking, it's fairly easy to pick holes in this. The test should be conducted double-blind, as single-blind still allows the administrator to lead (consciously or otherwise) the listeners. The listeners should test one at a time instead of as a group, which can lead to strong characters steering the test. The number of presentations per product and the number of listeners in the test panel are chronically undernourished to form anything objectively constructive, etc, etc.

There's very little on audible memory in double-blind tests for a reason, because the psychoacoustics suggest there is no such thing, beyond a few fleeting seconds of being able to hold a sound in our heads. There is no mechanism for long-term storing of sound quality - organised musical structures, speech patterns and noises, yes. But the more subtle temporal, tonal or spatial patterns are not stored. This invites the question of how you can possibly say you prefer one particular interpretation of a piece of music; your audible memory is so short you should be unable to tell one complete Beethoven movement from another, because by the time you listen to the second version of the movement, you will have forgotten the subtle cues of the first. As such it might be that saying "I prefer the Kleiber version to the von Karajan" is as objectively meaningless as saying "I prefer the Arcam amp over the NAD".

I know all this because I made the not quite employment-savvy step of taking time out in the mid-1990s to explore the concept of the magazine running double-blind tests, and to see if the results would be of benefit to the magazine. The fact it took the better part of two weeks to extract two days worth of listening tests from a suitably wide sample of listeners wasn't a good start. The fact it concluded a device performed identically to a rival product three times the price actually didn't help. We dummied up a copy of the magazine with both this test, and a regular blind test of the same, to both non-readers and regulars. The first group felt that the conclusion from the DBT was 'honest' but also 'useless', as it didn't definitively tell them what to buy compared to the existing blind test. The same DBT material presented to a focus group of subscribers was met with fury at the suggestion that two different products could possibly sound similar. Ultimately, I was told in no uncertain terms that "this is a magazine, not a science project" from the suit-level corporates and was tasked with sitting in on blind panels of about 150 cables as a penance.

If we are talking about eliminating cognitive biases, the one bias DBT is absolutely unable to account for is 'backfire effect' - the results of suggesting a move to more stringent tests meant the 'after party chat' from those focus groups raised the concept of changes to the tests. This resulted in price-specific blind tests, a significant side-lining of measurement (and the frankly odd concept of 'group-averaged measurement' being introduced) and an increased number of sighted tests running alongside the (smaller) blind groups. Whether this was due to cost cutting or the responses of the focus groups is unclear, but these changes came from on high relatively soon after the focus groups, and we had been left to our own devices prior to this.
 
Last edited:
The fact it concluded a device performed identically to a rival product three times the price actually didn't help. We dummied up a copy of the magazine with both this test, and a regular blind test of the same, to both non-readers and regulars. The first group felt that the conclusion from the DBT was 'honest' but also 'useless', as it didn't definitively tell them what to buy compared to the existing blind test. The same DBT material presented to a focus group of subscribers was met with fury at the suggestion that two different products could possibly sound similar.

So to summarise Alan, is it correct to say that one of the reasons why DBTs were abandoned was that the finding that low cost equipment sounded just the same as more expensive gear was anathema to everyone concerned, including the readers?
 
When I began working for Hi-Fi Choice (1992), we were typically blind testing between 16-20 products per month. These would be from a single category (CD players, for example) and not limited to price band. The test group would be played to a panel of listeners (typically between three and five listeners) and each DUT would be given a test-consistent presentation of recordings designed to both be representative of as many musical genre as possible and good for detecting differences.

Strict level matching (to within 0.1dB) was mandatory for any form of audio electronics. It's harder to achieve anything like the same degree of level matching with loudspeakers, however, and best-case matching got to within 1-2dB. Most devices would be given a single presentation, although usually two or three within each group would be resubmitted during the test to determine if the listeners were achieving some form of consistency.

Audible memory isn't a big problem, in fact it's a bonus if the tester is able to listen past their personal preferences. But the test protocol can have a broad tendency to prefer strong flavours as a result and the product that does the least harm can be the one that comes roughly eighth in a group of 16. The prevailing problem with any such test is determining limits on the number of tests possible in any given session by any given listener; this varies from person to person, and from session to session, but we usually found that between three and four devices per morning or afternoon session were universally tolerated.

Scientifically speaking, it's fairly easy to pick holes in this. The test should be conducted double-blind, as single-blind still allows the administrator to lead (consciously or otherwise) the listeners. The listeners should test one at a time instead of as a group, which can lead to strong characters steering the test. The number of presentations per product and the number of listeners in the test panel are chronically undernourished to form anything objectively constructive, etc, etc.

Hi Alan,

Thanks for your reply – I really appreciate it. I understand the impetus behind needing to evaluate a number of products quickly to meet deadlines and print date, and that the methodology has its limits, correlating with my understanding of memory favoring events (products) which produce the greatest highs or lows (“flavours”, as you put it). That the product that “does the least harm” would end up in the middle of the bell curve is no surprise.

I still find it difficult to understand how any one can claim to remain objective as one’s exposure to and experience of audio components increases. My father visited me a number of years ago when I had Living Voice OBX-RW’s and I played him Hotel California (his choice, he loves the Eagles). He said “Gosh, it sounds so clear.” Not detailed, transparent, textured, dynamic, palpable, scary real, close to a live mic-feed or any other buzzword. He said “clear”. Why? Because he’s not an audiophile. He lacks the lexicon with which to describe prerecorded music via a hi-fi system. Most people on this forum though, probably understand what those words represent, and actively employ them when discussing their systems. The lexicon apropos a subculture is highly suggestive of what that particular subculture values. It suggests that when we develop a sophisticated lexicon, we have created (consciously or subconsciously) a value system enabling us to make decisions and judgments apropos the mechanisms by which that subculture expresses itself. Once that is learned, and especially, once that has become culturally entrenched, it is almost impossible for it to be unlearned. Therefore, any seasoned audiophile going into a double-blind test already has biases acquired through language. “Warm” is a bias, a judgment, a perception. “Detailed” is a bias. “Linear” is a bias, and we carry all of them with us all the time.

The problem, from where I sit, is that continued experience is likely to reinforce biases (to make us recall easy descriptors), rather than remove them. I find it difficult to fathom how a seasoned audiophile can claim to be objective in the evaluation of an audio product when their evaluation of that product is always going to be informed by their previous experience and their use of language in articulating their experience. No one is approaching this for the first time. We’ve heard too much and formed too many opinions we’ve expressed through phrases like “palpable midrange bloom” and “razor sharp bass”.

We can, of course, retreat to solely relying on measurements, but I know of only one or two people who have assembled a system based on Floyd Toole’s white papers for Harman International.

My current understanding is that memory will force us to go for the path of least resistance first and foremost, and for the audiophile, that means recalling previous highs and lows and articulating their experience through a common culturally-prescribed set of values, predisposing the individual to confirmation bias, backfire effect, overconfidence, etc. That doesn't mean critical listening is impossible, but I could not confidently put myself in a camp where I believed I was above falling into the above traps. I mean, I like vinyl, valves and horns - what would I know?


There's very little on audible memory in double-blind tests for a reason, because the psychoacoustics suggest there is no such thing, beyond a few fleeting seconds of being able to hold a sound in our heads. There is no mechanism for long-term storing of sound quality - organised musical structures, speech patterns and noises, yes. But the more subtle temporal, tonal or spatial patterns are not stored. This invites the question of how you can possibly say you prefer one particular interpretation of a piece of music; your audible memory is so short you should be unable to tell one complete Beethoven movement from another, because by the time you listen to the second version of the movement, you will have forgotten the subtle cues of the first. As such it might be that saying "I prefer the Kleiber version to the von Karajan" is as objectively meaningless as saying "I prefer the Arcam amp over the NAD".

I understand the science of psychoacoustics only on a very superficial level based on what I’ve read. However, I suggest that it can only be at best an incomplete science due to two behaviours I’ve witnessed consistently during the thirty years I’ve been playing, engineering, mixing and (occasionally) mastering music (though only as an amateur).

The first relates to our “inability” to store tonal values. Aside from my own experience, the prevalence of individuals possessing “perfect pitch” – that is, the ability to name individual pitches either on their own or in clustered chords, sing notes accurately without a prior reference, identifying when an instrument is out of tune without prior reference, and, bizarrely, name the pitch of a non-musical instrument like, say, an alarm – has been well documented and scientifically validated many times over. Absolute pitch is a function of higher-level cognition, thought to occur through the suppression of lower-level brain function which is why perfect pitch is found in higher number of individuals with optic nerve hypoplasia or autism. While it is difficult for an individual not born with a predisposition toward perfect pitch to learn without considerable effort and high level of maintenance, it’s possible, albeit rare. Anecdotally speaking, I’ve worked with a few who possessed it and it never ceased to amaze me. (I’m okay with notes, chords and keys, but only in the context of playing music and organized musical structures – a guitar tuned to Drop D is an easy one to identify as I’ve played it a lot).

The second relates to our “inability” to store temporal patterns. I’ve done this a few times: The drummer will be in the studio and will be fed a click track through headphones. They’ll play a few bars with the click and then we’ll turn the click off in the phones, having them continue to play, but recording both the drummer and click into ProTools (or Logic, or whatever…) with no music – just recording the drums on their own. It’s almost impossible to do (and I can’t) but I can say without any exaggeration that a few drummers I’ve worked with will maintain the tempo exactly without wavering for several minutes at a time. Not only that, when we match the waveforms they’re so on it’s ridiculous.

Both of these suggest (to me) that our understanding of psychoacoustics and auditory perception are limited but developing. Given that neuroscience is still in its infancy relatively speaking, I prefer to take a heuristic approach to hi-fi least I paint myself into a corner and need to recant my beliefs. (I once owned a full-blown Naim system so I’d prefer not to have to join a cult again in the near future…)
 
So to summarise Alan, is it correct to say that one of the reasons why DBTs were abandoned was that the finding that low cost equipment sounded just the same as more expensive gear was anathema to everyone concerned, including the readers?

No.

The increased time and expense involved in making DBTs were the main killer from an editorial perspective. You can just about perform one well-constructed DBT in the time it takes to perform sixteen regular blind tests, which meant (at our best estimate) we would need to factor somewhere between 2x and 2.5x the amount of time needed to deliver those 20 pages required each month. That would require a significantly greater financial outlay for each group test, because you are going to have to pay for the extra days spent running the test and processing the results, the greater manpower, etc, etc.

However, if the additional financial burden of the tests resulted in the magazine being perceived as being more honest by the readership, we would have probably gone for it at the time. My plan was that if the increased legitimacy was reflected in increased sales, it would have put us in line for agency advertising (which tends not to be as mercurial as direct B2C advertising pitches) and I'd have got the editor's chair as a result. Instead, it played badly to non-readers and even worse to the subscribers and we would have ended up with a product that would cost more to make and would be both less liked and less well understood by both new and existing readers. That was the coup de grâce, both to the idea and to my aspirations of editorship on that magazine. I had to work across two titles and wait until the magazine was sold to another publishing house before my career path began to take an upward spiral again.

In fairness to my past pay-masters, my scheme would have resulted in the audio tests being the second most expensive accounting line in all of editorial, after international travel for Maxim photo shoots. It also played worse in focus group than almost any other project put forward that decade. So had they just gone with my suggestion without testing it, by the time the magazine was sold four years later, it would have been hundreds of thousands in debt. Which cast me in 'alpha geek' role, but not someone who could be given a mag to play with.
 
Last edited:
Hi Alan,

Thanks for your reply – I really appreciate it. I understand the impetus behind needing to evaluate a number of products quickly to meet deadlines and print date, and that the methodology has its limits, correlating with my understanding of memory favoring events (products) which produce the greatest highs or lows (“flavours”, as you put it). That the product that “does the least harm” would end up in the middle of the bell curve is no surprise.

I still find it difficult to understand how any one can claim to remain objective as one’s exposure to and experience of audio components increases. My father visited me a number of years ago when I had Living Voice OBX-RW’s and I played him Hotel California (his choice, he loves the Eagles). He said “Gosh, it sounds so clear.” Not detailed, transparent, textured, dynamic, palpable, scary real, close to a live mic-feed or any other buzzword. He said “clear”. Why? Because he’s not an audiophile. He lacks the lexicon with which to describe prerecorded music via a hi-fi system. Most people on this forum though, probably understand what those words represent, and actively employ them when discussing their systems. The lexicon apropos a subculture is highly suggestive of what that particular subculture values. It suggests that when we develop a sophisticated lexicon, we have created (consciously or subconsciously) a value system enabling us to make decisions and judgments apropos the mechanisms by which that subculture expresses itself. Once that is learned, and especially, once that has become culturally entrenched, it is almost impossible for it to be unlearned. Therefore, any seasoned audiophile going into a double-blind test already has biases acquired through language. “Warm” is a bias, a judgment, a perception. “Detailed” is a bias. “Linear” is a bias, and we carry all of them with us all the time.

The problem, from where I sit, is that continued experience is likely to reinforce biases (to make us recall easy descriptors), rather than remove them. I find it difficult to fathom how a seasoned audiophile can claim to be objective in the evaluation of an audio product when their evaluation of that product is always going to be informed by their previous experience and their use of language in articulating their experience. No one is approaching this for the first time. We’ve heard too much and formed too many opinions we’ve expressed through phrases like “palpable midrange bloom” and “razor sharp bass”.

We can, of course, retreat to solely relying on measurements, but I know of only one or two people who have assembled a system based on Floyd Toole’s white papers for Harman International.

My current understanding is that memory will force us to go for the path of least resistance first and foremost, and for the audiophile, that means recalling previous highs and lows and articulating their experience through a common culturally-prescribed set of values, predisposing the individual to confirmation bias, backfire effect, overconfidence, etc. That doesn't mean critical listening is impossible, but I could not confidently put myself in a camp where I believed I was above falling into the above traps. I mean, I like vinyl, valves and horns - what would I know?




I understand the science of psychoacoustics only on a very superficial level based on what I’ve read. However, I suggest that it can only be at best an incomplete science due to two behaviours I’ve witnessed consistently during the thirty years I’ve been playing, engineering, mixing and (occasionally) mastering music (though only as an amateur).

The first relates to our “inability” to store tonal values. Aside from my own experience, the prevalence of individuals possessing “perfect pitch” – that is, the ability to name individual pitches either on their own or in clustered chords, sing notes accurately without a prior reference, identifying when an instrument is out of tune without prior reference, and, bizarrely, name the pitch of a non-musical instrument like, say, an alarm – has been well documented and scientifically validated many times over. Absolute pitch is a function of higher-level cognition, thought to occur through the suppression of lower-level brain function which is why perfect pitch is found in higher number of individuals with optic nerve hypoplasia or autism. While it is difficult for an individual not born with a predisposition toward perfect pitch to learn without considerable effort and high level of maintenance, it’s possible, albeit rare. Anecdotally speaking, I’ve worked with a few who possessed it and it never ceased to amaze me. (I’m okay with notes, chords and keys, but only in the context of playing music and organized musical structures – a guitar tuned to Drop D is an easy one to identify as I’ve played it a lot).

The second relates to our “inability” to store temporal patterns. I’ve done this a few times: The drummer will be in the studio and will be fed a click track through headphones. They’ll play a few bars with the click and then we’ll turn the click off in the phones, having them continue to play, but recording both the drummer and click into ProTools (or Logic, or whatever…) with no music – just recording the drums on their own. It’s almost impossible to do (and I can’t) but I can say without any exaggeration that a few drummers I’ve worked with will maintain the tempo exactly without wavering for several minutes at a time. Not only that, when we match the waveforms they’re so on it’s ridiculous.

Both of these suggest (to me) that our understanding of psychoacoustics and auditory perception are limited but developing. Given that neuroscience is still in its infancy relatively speaking, I prefer to take a heuristic approach to hi-fi least I paint myself into a corner and need to recant my beliefs. (I once owned a full-blown Naim system so I’d prefer not to have to join a cult again in the near future…)

Sadly, for the audiophile there is no easy answer or resolution to these questions. A lot can be dismissed out-of-hand as 'wishful thinking' or 'delusion' by those who dismiss much of the audiophile canon. But that puts you in a netherworld, between 'there is no answer' and 'there is nothing to question'.

The usual response it to get the reviewer to walk a mile in someone else's shoes. If you encounter a product you don't like, is it because it is bad or because it is not for you. The former should be fairly easy to spot... in theory. The difficulty with the latter is it creates the current "everything's wonderful" audio relativity mindset, which I don't think is a satisfactory solution, despite its current prevalence influencing my own writing.

I know with some fair certainty that someone who likes, say, Magneplanar speakers is unlikely to like Wilson Audio to the same degree (and vice versa). So a reviewer who 'gets' one shouldn't be able to understand the joys of the other to the same degree. However, currently that's precisely what we are supposed to do. That makes for insipid reviewing, and creates a culture of there being so little actual criticism that the least negative observation can be crushing to the success of the design.

As for the psychoacoustics model we currently have, it's pretty complete in terms of the hardware and the physics, but might well benefit from improved understanding of the squidgy grey lump between the ears. At the moment though, it's still a bit "where's my hoverboard?"
 
Last edited:
Thanks for that, Alan. It's no big surprise that conducting a statistically valid DBT is a much more involved process than simple blind listening. What seems to be lost in the desperate effort to discredit anything that goes against conventional audiophile wisdom or personal belief is that even very casual blind listening removes a whole set of biases that plague audiophile opinion (brand, price, looks, physical configuration, technology....). Close your eyes. If you don't know what you're listening to, then you actually will be trusting your ears; maybe for the first time. Dismissing the entire experience over small methodological flaws, or insisting that a single, dubious control is the only key to validity is just whistling in the dark.

Can you completely bugger a blind listening test to the point that it has no more value than sighted listening? You can get close, but it's not easy. You mentioned one way -- do it in groups and encourage open discussion of what they're listening to before or as impressions are recorded. It will degrade into a badly-run focus group. Another way to make blind listening as prejudiced and ineffective as sighted listening would be to use a moderator so unprofessional and personally biased that he led the listeners to the conclusions he wanted them to reach.

Hifi salesmen do this all the time. Tell you what you're going to hear, Play the system. Tell you what you heard. The more expensive the equipment, the better it works. But the petty objections to blind listening that are common on audiophile forums? They don't come close to making blind listening as useless as any sighted listening test. The latter really doesn't even deserve the word "test."

Tim
 
For the record, I spent most of the 1990s and the first half of the 2000s either running, administrating, overseeing or occasionally being one of the listener panel on blind, level-matched tests. At a rough guess, I'd say that means I've been directly involved in level-matched, blind testing of about 4,000 audio products in my time.

(...)

Alan,

What was the main purpose and what type of report was typically produced at these blind tests?
 
Thanks for that, Alan. It's no big surprise that conducting a statistically valid DBT is a much more involved process than simple blind listening. What seems to be lost in the desperate effort to discredit anything that goes against conventional audiophile wisdom or personal belief is that even very casual blind listening removes a whole set of biases that plague audiophile opinion (brand, price, looks, physical configuration, technology....). Close your eyes. If you don't know what you're listening to, then you actually will be trusting your ears; maybe for the first time. Dismissing the entire experience over small methodological flaws, or insisting that a single, dubious control is the only key to validity is just whistling in the dark.

Can you completely bugger a blind listening test to the point that it has no more value than sighted listening? You can get close, but it's not easy. You mentioned one way -- do it in groups and encourage open discussion of what they're listening to before or as impressions are recorded. It will degrade into a badly-run focus group. Another way to make blind listening as prejudiced and ineffective as sighted listening would be to use a moderator so unprofessional and personally biased that he led the listeners to the conclusions he wanted them to reach.

Hifi salesmen do this all the time. Tell you what you're going to hear, Play the system. Tell you what you heard. The more expensive the equipment, the better it works. But the petty objections to blind listening that are common on audiophile forums? They don't come close to making blind listening as useless as any sighted listening test. The latter really doesn't even deserve the word "test."

Tim

Unfortunately, the simple blind test often suffers from 'shot from both sides' syndrome. I can understand the reasoning on both sides (a blind test can be seen to give a false veneer of 'science', while people who consider themselves experts are reluctant to admit what they heard was actually more to do with what they saw), but the reactions do seem somewhat militant.

It's why I've long since revised my position to say DBTs are a vital part of the process, but their importance wanes the closer you get to someone choosing the products they will live with. Commercially speaking, DBT is not a commercially viable means of mainstream end-user product selection, because it runs so very counter to intractable and fairly fundamental human nature issues, such as getting people to disbelieve their ears in order to choose something for listening.

However... yes, it's possible to mess up a blind test. Aside from the person running the test actively trying to stack the deck, it can be beset by things like failing to take careful attention of level matching can end up making the product that is as little as 0.5dB louder than its rivals sound 'better'. That can be a problem if you are referencing level against a 1kHz tone and one product has a small boost or cut there (in fairness this is unlikely in solid-state electronics, but not so impossible with LP). Even the position of presentation can have its influence (if you are testing four products in a row, the second one is most likely to be seen to perform best by sheer virtue of its place in the pecking order).

While I don't subscribe to the 'blind tests are too difficult to be reliable' argument, it is worth noting that to many people not well versed in the test procedure, there tends to be an early diminution of discernment. This tends to go away quickly, but the first reaction to removing the sighted element from a test is almost 'awaiting further input' leading to 'I can't hear much of a difference'. A good check here is to use things that are demonstrably wildly different and chart the reaction - often for the first couple of presentations, the differences are minimised.
 
Alan,

What was the main purpose and what type of report was typically produced at these blind tests?

Typically these were complete round-ups of a given category, as best as we could run it. So for example, you'd have a group of 20 integrated amplifiers, the cheapest being a NAD, the most expensive a Gryphon. The survey was originally conducted without obvious winners (if a product performed well in a category, it would receive a Recommended tag and if it represented good value for money it achieved a Best Buy badge, but there was no demand for those badges to be in every issue at first).

By the time I arrived, this had already shifted toward insisting each group have at least one Best Buy and one Recommended, and these were increasingly becoming considered the winners of that group. This was demanded by public request. Also, in the public view, the Recommended tag (which meant outstanding performance) had come to be less important than the Best Buy (which meant good value). When challenged about this, the most common response we got was "it says 'Best', so it must be the best". We tried to retain the differentiation between the two, but were already fighting a losing battle.
 
Last edited:
Unfortunately, the simple blind test often suffers from 'shot from both sides' syndrome. I can understand the reasoning on both sides (a blind test can be seen to give a false veneer of 'science', while people who consider themselves experts are reluctant to admit what they heard was actually more to do with what they saw), but the reactions do seem somewhat militant....

It's why I've long since revised my position to say DBTs are a vital part of the process, but their importance wanes the closer you get to someone choosing the products they will live with.

I understand the reasoning that a blind test can give the false veneer of science, what I don't understand is dismissing even casual blind listening in favor of sighted listening, with all of the most common, most influential biases intact, and insisting that is the better approach. You don't really trust your ears if you are unwilling to close your eyes.

I'm tempted to take the second part of your quote in bold above a step further and say that people other than scientists shouldn't really try to use blind listening to determine preference (I'm tempted; I'm not sure). Use it to determine if you can actually hear a difference; when you find a consistently audible difference between amps/dacs/cables, whatever, go ahead and open your eyes, choose what you like. Even if you can't differentiate them, open your eyes. If the more expensive DAC is more beautiful, and that matters to you, and it's worth it to you, pay the premium. But know what you're paying for.

The rarity of of blind listening in product reviews? That's very unfortunate.

Tim
 
Typically these were complete round-ups of a given category, as best as we could run it. So for example, you'd have a group of 20 integrated amplifiers, the cheapest being a NAD, the most expensive a Gryphon. The survey was originally conducted without obvious winners (if a product performed well in a category, it would receive a Recommended tag and if it represented good value for money it achieved a Best Buy badge, but there was no demand for those badges to be in every issue at first).

By the time I arrived, this had already shifted toward insisting each group have at least one Best Buy and one Recommended, and these were increasingly becoming considered the winners of that group. This was demanded by public request. Also, in the public view, the Recommended tag (which meant outstanding performance) had come to be less important than the Best Buy (which meant good value). When challenged about this, the most common response we got was "it says 'Best', so it must be the best". We tried to retain the differentiation between the two, but were already fighting a losing battle.

Alan,

I consider that the this type of round-ups were one of the sad thinks in audio reviewing. Most of the time you would be testing synergy between the units under test and the system that was used for the tests. Using different systems, optimized for each unit, could result in different results.e.g. the excellent Michel Gyrodec with the acrylic platter was tested with a Rega RB300 being not recommended because it "lacked strong dynamics and moderate resolution" and the Linn Sondek + Ittok LVII was recommended. IMHO a change in arms would change the result. Or curious results, such as the cheap Technics SLDD33 getting simultaneously a Best Buy and Recommended, but the conclusions said that this player stresses appearance, built quality and features at expense of sound quality and a sound that is respectable enough, if rather uninvolving.

But what could you expect from a monthly magazine that reviewed over 20 turntables, 20 tuners and 20 headphones in a single issue? Most of the time the best part of the magazine were not the reviews, but the introductions to them, visits, interviews and perspectives on audio matters.

BTW, I think that the situation I describe was before you joined HiFi Choice. Although I have offered more than 200 pounds of audio magazines to a good friend long ago I still keep a few issues of HFC of that period for sentimental reasons - mainly the The Collection and interviews with famous people.
 
Alan,

I consider that the this type of round-ups were one of the sad thinks in audio reviewing. Most of the time you would be testing synergy between the units under test and the system that was used for the tests. Using different systems, optimized for each unit, could result in different results.e.g. the excellent Michel Gyrodec with the acrylic platter was tested with a Rega RB300 being not recommended because it "lacked strong dynamics and moderate resolution" and the Linn Sondek + Ittok LVII was recommended. IMHO a change in arms would change the result. Or curious results, such as the cheap Technics SLDD33 getting simultaneously a Best Buy and Recommended, but the conclusions said that this player stresses appearance, built quality and features at expense of sound quality and a sound that is respectable enough, if rather uninvolving.

But what could you expect from a monthly magazine that reviewed over 20 turntables, 20 tuners and 20 headphones in a single issue? Most of the time the best part of the magazine were not the reviews, but the introductions to them, visits, interviews and perspectives on audio matters.

BTW, I think that the situation I describe was before you joined HiFi Choice. Although I have offered more than 200 pounds of audio magazines to a good friend long ago I still keep a few issues of HFC of that period for sentimental reasons - mainly the The Collection and interviews with famous people.

This does highlight the confusion and the problems faced with any kind of test like this. In part because it's an attempt to meet both subjective and objective concerns, forgetting that these are often so polarised as to be functionally irreconcilable.

I agree that the results of a large group test as much come down to compatibility issues as much as absolute performance. But attempts to resolve compatibility issues also create as many problems as they try to fix - do you rebuild the system between every test to accommodate the idiosyncrasies of each device under test? If so, when does it cease to be 'accommodation' and start to be 'hiding'? Or do you build a system that limits any potential compatibility issues (if that's even possible), potentially making a system that's good for everything but enjoyed by no-one?

Just like the reviews you list, in hindsight.
 
I am reading and reading what Alan is posting (and appeciating his time and dedication), and this question keeps popping up in my mind:
Can we totally detach ourselves from the emotional impact that the music listening provides?

How much influence our emotions have in the overall assessment of blind listening tests?
Emotions travel too, with time, and even from second to second they vary.
And I'm talking about music here, music we listen to, and played from real artists (singers/songwriters/musicians), and during audio blind tests.

Do you understand what I'm coming to? ...Without exploring any further at this time, but I'll come back on that.
For now I'm just asking that above question, because it could be very important on the subject of this thread. ...I believe.
 
I understand the reasoning that a blind test can give the false veneer of science, what I don't understand is dismissing even casual blind listening in favor of sighted listening, with all of the most common, most influential biases intact, and insisting that is the better approach. You don't really trust your ears if you are unwilling to close your eyes.

I'm tempted to take the second part of your quote in bold above a step further and say that people other than scientists shouldn't really try to use blind listening to determine preference (I'm tempted; I'm not sure). Use it to determine if you can actually hear a difference; when you find a consistently audible difference between amps/dacs/cables, whatever, go ahead and open your eyes, choose what you like. Even if you can't differentiate them, open your eyes. If the more expensive DAC is more beautiful, and that matters to you, and it's worth it to you, pay the premium. But know what you're paying for.

The rarity of of blind listening in product reviews? That's very unfortunate.

Tim

Well my friend this is my opinion. Systems are systems. Plug and play is a wish we all want granted. To maintain intellectual honesty both choices have to be given a fair shot. In this day and age where so many mass market products sound very good, practically anything inserted in a system (that isn't broken or defective) that doesn't sound as intended the first time around can be made to sound close enough to the intention by addressing other variables. Acoustic treatment, DRC, EQ, speaker set up, the list goes on and on. All this is rather hard to do without accidentally sneaking a peak at what it is you are testing. To make things worse, the intended performance is anchored on personal preference. Not everybody likes flat. Integrating the component the way you like it is not something that can be delegated. It's not wrong, it's not worse. It is just not easy to do even assuming you can even get both items in your own room at the same time at all and optimize them both in a short enough time span.
 
I am reading and reading what Alan is posting (and appeciating his time and dedication), and this question keeps popping up in my mind:
Can we totally detach ourselves from the emotional impact that the music listening provides?

How much influence our emotions have in the overall assessment of blind listening tests?
Emotions travel too, with time, and even from second to second they vary.
And I'm talking about music here, music we listen to, and played from real artists (singers/songwriters/musicians), and during audio blind tests.

Do you understand what I'm coming to? ...Without exploring any further at this time, but I'll come back on that.
For now I'm just asking that above question, because it could be very important on the subject of this thread. ...I believe.

The simple answer is that it isn't that relevant. The listener's emotional connection should be with the music, not with either the devices used to play that music or the format that it's carried upon.

The more complicated answer is it isn't that simple. The emotional baggage that goes into a mix tape or a CD given as a gift by a long-lost loved one has greater significance than the either the music or the medium itself. As hobbyists, there is an attendant - albeit different - emotional baggage that surrounds the equipment used in the replay chain, in a manner akin to what a fly fisher may have over their lures. They help define our position as 'audiophile' in a way that is both non-existent and incomprehensible to 'muggles'.

This is the elephant in the room with any kind of test, but especially blind or double-blind. The more forensic the audio test, the more it strives to eliminate the emotional connection that audiophiles prize. And, unless or until the hobbyist paradigm shifts so heavily away from that deeper link with the technology than mere utility, I suspect there will always be a clash between those who understand that link and those who dismiss it as irrelevant.

As this link has independently sprung up in the computer audio and headphone worlds long before the audio's old guard to get their hands on the categories, I suspect this peculiar not-quite-emotional connection with the technology has crossed the demographic fault line.
 
The simple answer is that it isn't that relevant. The listener's emotional connection should be with the music, not with either the devices used to play that music or the format that it's carried upon.

The more complicated answer is it isn't that simple. The emotional baggage that goes into a mix tape or a CD given as a gift by a long-lost loved one has greater significance than the either the music or the medium itself. As hobbyists, there is an attendant - albeit different - emotional baggage that surrounds the equipment used in the replay chain, in a manner akin to what a fly fisher may have over their lures. They help define our position as 'audiophile' in a way that is both non-existent and incomprehensible to 'muggles'.

This is the elephant in the room with any kind of test, but especially blind or double-blind. The more forensic the audio test, the more it strives to eliminate the emotional connection that audiophiles prize. And, unless or until the hobbyist paradigm shifts so heavily away from that deeper link with the technology than mere utility, I suspect there will always be a clash between those who understand that link and those who dismiss it as irrelevant.

As this link has independently sprung up in the computer audio and headphone worlds long before the audio's old guard to get their hands on the categories, I suspect this peculiar not-quite-emotional connection with the technology has crossed the demographic fault line.

Great post.

Tim
 

About us

  • What’s Best Forum is THE forum for high end audio, product reviews, advice and sharing experiences on the best of everything else. This is THE place where audiophiles and audio companies discuss vintage, contemporary and new audio products, music servers, music streamers, computer audio, digital-to-analog converters, turntables, phono stages, cartridges, reel-to-reel tape machines, speakers, headphones and tube and solid-state amplification. Founded in 2010 What’s Best Forum invites intelligent and courteous people of all interests and backgrounds to describe and discuss the best of everything. From beginners to life-long hobbyists to industry professionals, we enjoy learning about new things and meeting new people, and participating in spirited debates.

Quick Navigation

User Menu