Blind test protocol inquiery

+1 IMO, listening time with the material before DBT is critical. In fact, most of the best DBT sessions I have been in or seen included an up-front session and discussion of differences followed by the actual DBT. These sessions were often separated by hours, days, or even a week or two (for scheduling).

A few short, well-known clips also helps control listener fatigue, as does repeating clips through the test with breaks in between. It can be interesting (among other things) when the same clip is scored differently through the testing period...
 
Got it!

I have VERY valuable information at hand to conduct this test, this will surely be fun, and could turn more sophisticated as we all here become more used to the process, I might even include video taping for improvement in the future, who knows...

I am anticipating that personal tastes will arise in the evaluation process, even while using a format to keep answers under a certain range of "control".. we shall see....

At the end, I hope that I can still count this listening group as my friends :)

I will share later in the week my track selection, format to be used for the process and settings in my not-remote-option capabilities of my Viva amp.
 
I am not so sure about the amplitude manipulation that Amir mentions. One, you may instigate listener fatigue.

I was thinking that too. Or if the source material has a lot of energy in the "harshness" range around 2 to 4 KHz, that will sound more irritating at higher volume levels.

--Ethan
 
I was thinking that too. Or if the source material has a lot of energy in the "harshness" range around 2 to 4 KHz, that will sound more irritating at higher volume levels.

--Ethan

Ethan can you expand on that please, because from my own experience playing something 3-5db louder does not result in it being irritating or harsh, otherwise we would not have people who enjoy both low and loud playing of music on their systems, well to an audiophile their system is uber all the time!!! :)
Anyway, could say the same going to hear say an orchestra in a very large hall often, this would suggest being closer would be seen as being harsher in the context your suggesting.
Actually thinking this through, the argument most post is that louder comes across as better (due to the Fletcher curve that we were talking earlier about in this thread), not worse :)
I appreciate if your comparing comfortable level to loud (as in uncomfortable) what you say may be the case, or were you thinking of low-normal levels as well?

Thanks
Orb


Thanks
Orb
 
Come to think of it, you could establish a person's attitude towards loudness. You can simply run a trial at volume level A and then at B (+3dB for instance) and ask: 1) which do you prefer, ceteris paribus, playback A or playback B (7 point scale); and 2) which sounds better (same 7 point scale, A and B at opposite ends). This is to differentiate between preference and subjective rating. Just because you prefer something does not necessarily mean you think it is better. The key is better at what, under what conditions? To be more specific, you might try:

Given that this is an experiment to test respondent's ability to listen to, appreciate and understand the sonic differences, if any, between the proposed playback sets, how would you rate the following:
1) For the purposes of this experiment, I prefer the playback amplitude of volume A(B).
A follow up may be that the respondent finds the volume too high.low for critical listening of certain passages and this condition may vary by person.

2) I think that the sonic quality, all things considered, is markedly superior for volume A(B).

No need to adjust for volume between units beyond what you can if you can isolate the independent relationship of sound level to preference and subjective assessment.

And finally, to respond to Don's comment on prior discussion and listening, it's interesting that actual empirical DBTs were more fruitful. OTOH, a huge body of literature in psychology, cognitive psychology and social psychology exists that examines priming effects such that people, when subject to prior knowledge, tend to filter subsequent information according to that bias. This is subconscious and it is one reason why from an experimental design aspect, I would argue to not discuss beforehand.

Bill
 
I'd like to emphasise the discussion aspect. It can be educational to just sit and observe what can happen when a discussion takes place.....

So, make sure that there is NO discussion between people. After the results are in, then yeah why not, a few drinks bit of a feed and let them go for it, but you need to get the raw results first.

I have seen the most amazing about face occur during the discussion phase! People come out after having had the most difficult time imaginable finding differences in a test...a few beers and lot's of raised excited voices and a few hours later those differences were as clear as night and day and only deaf freddie could not have heard them!

AND, it is only the drunken ramblings that get posted!! So to people not present it seemed easy to tell the differences when I was there and watching them struggle and all express amazement at how close it was as they walked out of the room initially.

Group dynamics, fun to watch. (so particularly beware the 'huge' differences reported when a group of people audition product A....yet those not in the know seem to feel that because 'they all said the same thing' it is stronger evidence)

Just on the volume level difference, is it not true that we are talking real differences yet small enough that they are not perceived as different levels?? I think that is the essential point orb, in other words 1db or less I spose.
 
I would avoid...terms...such as "better" or "worse"...

Agreed :D
 
Hey Ethan my previous post was a bit rushed as it was late and off to bed, without that beauty sleep not only would it scare the GF but also me in the mornings looking in the mirror.
I am interested if you got any info on the harshness because I am not entirely convinced yet that louder is better argument used by some (or many at AVSForum) is necessarily true as I have never seen any data backing this up.
So please if you got anything please share it, not looking to argue with ya :)

Thanks
Orb
 
Orb, there were gobs of studies in the 70's and 80's (and probably more before and since) about loudness affecting test trials. I do not have them handy, but there were AES and BAS (Boston Audio Society) papers as well as in many other journals. I remember it because I always figured 1 dB was good enough as very few folk could tell when I nudged the level by that amount whilst listening to music. However, testing (mine and others) showed otherwise. Switching back and forth on test loops, people could invariable (~100%) pick out the setting that was 1 dB louder, and IIRC the statistics were valid to below 0.5 dB for at least some listeners. The sound was not always "better" due to the source we used (mixture of music, tones, and various colored noise), but subjects could reliably pick "A" or "B" based solely on volume. Thus the 0.1 dB number that is routinely thrown about today to esnure that is not an issue. One thing that is issue for us at home is that a lot of AVRs' volume controls do not have that fine of resolution, making it more difficult to perform precise level (amplitude) calibration.

FWIWFM - Don
 
Hey Don,
Oh totally I agree if it is relating to cues/tells, look back I mention that earlier myself :)
My point is very specific and relates to some of my earlier posts in this thread; in that I am talking about those who say louder sounds better, instead of louder may be preferred.
The two are not necessarily the same, especially if you ask the listener to state whether its sound quality is better, I doubt anyone would say their own sound system gives better sound quality if played 3db louder but they may prefer the 3db louder playback.
In this case we are talking about auditioning/preferences (where options have been given about volume or its importance already here), but I agree with what your saying with actual case studies if asking them to differentiate between A/B that may be studying noise/artifacts/etc and then some will cheat using cues, and even select it possible without even realising.
Boyk mentions such an example of a dbt AB where one managed to get 100% but ignored the purpose of the actual test, this was due to a very subtle cue/tell hehe, the sneaky cheat :)

Cheers
Orb
 
I have not decided which tracks I will use next Wednesday, but I came with the following parameters to evaluate in the 1-7 scale mentioned earlier:

Transparency
Dynamic range
Truth of timbre
Soundstage and space recreation
Imaging

Finaly, a binary response between A and B that will weight 50%

Are this parameters too vague?
Should I use more specific terms?
Am I missing an important variable?

Maybe an explanation for the above parameters could help.
 
While that is a good list for general audio comparison, I am not sure they are so good for difference between two digital sources. For example, there is no way the dynamic range would be any different. Anything noted would be imaginary :). Both will get just as loud, assuming you have matched the levels. Transparency is impossible to rate since there is no gold reference that would describe the original. I don't know what "truth of timber" means :).

Soundstange and space recreation seem to overlap.

The main difference I hear between digital sources is how smooth they sound, how well they resolve low-level detail, and tonal quality (bright, or not). How much space there is around instruments is also another difference which you may have covered in one of your metrics.

While I am typing this, let me say that I am not sure about the 1 to 7 scale. That works in ABX where you know the original, and then rating the clip under test relative to it. You don't have that situation. You have one system being compared to another. There is no point of reference. All one can score is one relative to the other, not relative to an absolute. In that case, you can ask me if there is more air around instruments in A or B. But asking me to rate A when I hear that first from 1 to 7, wouldn't make any sense and would simply garner a random number.
 
I think that you can rate things independently, by creating a scale which offers polar opposites. The idea is that you are evaluating the unit for perceived sonic attributes. Each one has its own set of results. Then you can run a multiple regression to see whether certain features explain certain outcomes (preference). This is how you extract an underlying sentiment or attitude that may not be obvious. Amir's points regarding lack of a true benchmark is an operationalization issue that can perhaps be overcome by carefully thinking through your measures (which should be bipolar). Also, you can add a couple of questions that look for normative judgments. Thus you can see if for example, liking things which you independently rate as having some sonic quality X is an independent variable that drives your stated preference. This the point of inferential rather than conscious evaluation. However, you can do the direct comparison...but that puts raises other questions - bias for first impression; memory retrieval problem etc.

Bill
 
Or you could consider what Sean Olive did/used for when he compared various digital room correction products to correlate objective measurements with a listeners preference/perception, as Mimesis mentions and I touched on earlier this involves a descriptive scale that reflects the opposite at each end of the scale (from bad to good, so colored is the worst while neutral is the best in their comment scale).
However some will need to be changed for the reason Amir gave as all of these will not reflect digital world.
Going to requote my post that covered the Sean Olive part, what I added to this was the bit about mentioning if one feels fidgety/relaxed/etc but they used two sections the like/dislike scale and descriptive/comment scale.
Using a Harman Kardon study as an example they had the following preference setting;
1 Strong Dislike, 3 Dislike, 5 Neither Like/Dislike, 7 Like, 9 Strong like
For subjective descriptions this is going to be tough but could include some of the HK ones used in a study - furthest right is better; Colored, Harsh, Thin, Muffled, Forward, Bright, Dull, Boomy, Full, Neutral
And then could also note the behaviour as I briefly touched on, so define a set around or may include;
attention wandering-not interested, causes fidgeting, want to flick through music and end quicker, urge to keep on listening and play longer-more, relaxed, interested and focus drawn to the music. (Again far left is the worst while far right is the best)

Whatever you use, I feel the key is to ensure that you go from worse to better, and what you use is a reflection of its opposite to some extent as shown and done by Sean Olive and Floyd E. Toole in their own studies and papers.

Cheers
Orb
 
I first understood and also agreed on your comments above, first regarding the parameters to focus on the test for a digital source might be re-considered, and the 1-7 scale might be more a slide-knob of where each variable is perceived in a let's say "range of colours" rather than a numeric setting of preference.

That was the reason I also placed a 50% weight to that portion of the evaluation template, and the other 50% will weight on a Binary A vs B preference for each questiion.

I will work tonight on the track selection, top of mind pieces include a paino session from Ramiro Rubalcaba, a trio from Jarret & Co., early music maybe from Trio Galanteri, a voice from Jacyntha or Merchant and a bass "trap" from Avishai Cohen or Methaney. any suggestions here based on the variables Amir sipggested?

Thanks folks! It is a very educational experience to hear you all....
 
Hi Tom - I RIPPED my entire CD collection to WAV format for the moment. Thanks for you suggestion as well...
 
I like Tom's list. It is simple, quick and less stressful thank thinking of scales and such.

BTW, your thread is a great reference for others following it. So good thinking in creating it :).
 
Preliminary session results

It was a very interesting, educational and fun event last night. I can share details to whoever is interested regarding the template used, tracks selected and pitfalls to avoid based on this first experience I had conducting a blind test session.

Preliminary results yielded that the Oppo/Nuforce/PC transport was selected as a better source above the iMac/Puremusic option for a 63%/37% ratio. Paramaters that liked the most from the audience at the iMac side was Detail Retrieval and Timbre for the Oppo. The ones that were most discouraging for the iMac pointed to Timbre and Dynamics/attack for the Oppo.

I also ranked a 1-7 sonic attributes preference list to weight each response to this column and look to have a more smooth curve in the analysis.

Thanks all for your input and suggestions, it was highly appreciated.
 
Last edited:

About us

  • What’s Best Forum is THE forum for high end audio, product reviews, advice and sharing experiences on the best of everything else. This is THE place where audiophiles and audio companies discuss vintage, contemporary and new audio products, music servers, music streamers, computer audio, digital-to-analog converters, turntables, phono stages, cartridges, reel-to-reel tape machines, speakers, headphones and tube and solid-state amplification. Founded in 2010 What’s Best Forum invites intelligent and courteous people of all interests and backgrounds to describe and discuss the best of everything. From beginners to life-long hobbyists to industry professionals, we enjoy learning about new things and meeting new people, and participating in spirited debates.

Quick Navigation

User Menu