So, this is now a busy and lengthy thread but this is something I have thought long and hard about so I feel compelled to contribute my 2 cents worth.
With regard to the room: While it is no doubt that a room can negative influence the sound (not sure about a room that positively influences the sound) and can significantly degrade the sound to the point of unintelillgibility, the room degrades ALL sound more or less equally. A room that is bad for audio is also likely a room that is bad for conversation or if you had live musicians in that room it would also be a mess. The point of this is that it does not impact the REALISM of the production (in the case of speech or live performance) or reproduction. A person that sounds boomy and echoy in a room STILL sounds like a real person and is fully believable as a real person despite sounding muffled or unclear from echoes and resonanances in the room. There is still no mistaking that person as a recording or "synthetic" in some manner. So, with all due respect to the people who are determined to elevate the room, it is clear that it does not destroy realism per se. It might make a hash of the recorded acoustic space and this could affect the believeability of some recordings. I don't want to downplay it but I don't think it is the fundamental issue in believability.
So, what about the loudspeakers? For sure they define a lot of the fundamental character of the sound but the sounds they produce, and by sounds I mean not only the intended sounds but all the other UNintended sounds. Unintended sounds include (not exhaustively): Cabinet vibrations and resonance modes on panels, port resonanances. Material specific re-radiation of sound, driver breakup and bending modes, crossover distortion, thermal distortion, thermal and dynamic compression etc. All of these are distortions that will color the timbre, timing and dispersion of the sound. However, with the exception of the crossover elements distortions and phase shifts (this can be very significant and destroy beliveabilty) the distortions from speakers are largely mechanical distortions that are generally low order harmonics and not unknown in the natural world. Break up modes though can damage believablility because they are not excited all the time and can affect certain passages of music and leave others untouched. It is lack of consistency that stands out to the human mind as "not right" or "synthetic". If something is consistently and constantly colored in a particular way, it doesn't take very long until ear/brain (yes it is an inextricably linked system) doesn't hear it as a coloration anymore. Phase shifts at crossover points can play havoc with believeability as well as the transistion between drivers of different materials covering the same voice or instrument...particularly if a large phase shift is right in the middle. Sadly, most speakers have this kind of flaw, including some of the most highly touted and expensive ones. That being said, I don't think that the speaker is the number one culprit in destroying playback realism. I would also lump other electromechanical transducers in this category (i.e. turntable/arm/cartridge).
From my experience and from a lot of reading on psychoacoustics I have come to the conclusion that the electronics (I include here all digital sources as they are mostly electronic in nature) are the number one destroyer of realism. Why? Because distortions caused by electronics and digital are wholly and completely synthetic and unnatural with no precendent in the natural world. Our ear/brain did not evolve to make sense of them and consider them as "real" sounds. This overlay of something totally alien to our evolutionary development makes it standout like a sore thumb at ridiculously low levels. It immediately stamps the playback as "artificial" in the majority of systems.
in nature, most overtones and harmonics are of relatively low order. The ear/brain's own distortion is low order until extremely high SPL. The Ear/brain is a very advanced computer with regard to pattern recognition, which allows us to organize and make sense of our world quickly and efficiently. If the pattern does not fit the natural expected pattern of, let's say an Oboe, then it will scream out ot us "synthetic"!! Even worse, is that this pattern of distortion overlay that ALL electronics produce is both frequency and level dependent in some potentially mind bogglingly complex ways. Intermodulation distortion, where the harmonics interact with each other and with the distortion made by the power supply (very few are clean from this) even further complicates matters. Now, you add in the fact that a humans sensitivity for high order harmonics INCREASES as the order incresases (at least until the sensitivity of the hearing itself falls off at high frequencies that is) as was demonstrated by D.E.L. Shorter of the BBC and more recently by Daniel Cheever.
So, the amount is not nearly as important as the type. What Cheever has pointed out is that amps and other electronics will sound the least colored when the distortion can hide in the "blind spot" of human ear/brain perception. This is not a trivial thing to do because the way the ear/brain makes and masks distortion is fundamentally opposed by most electronics designs. What do I mean by this? It means that the ear/brain's self-distortion is monotonic and drops exponentially with increasing harmonic order (2nd higher than 3rd, higher than 4th etc)...but the absolute masking level is SPL dependent. There is virtually nothing above 5th or 6th harmonic unless high SPLs are used. MOST electronics designs generate something very different. They generate low 2nd and 3rd order and often have a "picket fence" of low level, but nearly equally intense, harmonics right out to the 20th and beyond. They do not match the ear brain pattern and even though at low level this still creates the sense of "synthetic" in terms of timbre, imaging and soundstaging etc. High frequency distortion affects perceived loudness and can make objects sound closer than they are supposed to thus giving a "flattened" image and soundstage.
This electronic overlay, that comes from the phono preamp, the DAC, the preamp and the amps all conspire to destroy believeability because of the utterly unnatural nature of electronic distortion in ways the electromechanical elements in a system and the room simply do not. Just pushing it as low as possible is not really a solution because of the hidden issues with the manner in which modern engineering goes about pushing down that distortion. This is talked about by Pass, Cheever and Crowhurst at length. I won't go into, at this time, what is probably the best solution but I think many of you can guess the direction it suggests.
I recommend reading the work of Geddes, Cheever, Nelson Pass (who seems to not always follow his own findings...i guess for marketing purposes) and Boyk and Sussmann as well as old articles from Norman Crowhurst.