I tend to espouse a philosophy of trying different solutions, and not succumbing to the WBF myth that there's anything that's "best". That's nice for running this website as a catchy phrase meant to suck you in into wasting your time debating this issue, but in reality, nothing in life is easily described as what's best (is there a "best" country?).
Almost everyone points to frequency response (FR) as being of “uber alles” importance.
And fewer agree whether step function and impulse response are very important.
Obviously (or not) if one has a perfect FR, then it would be nice to also have a perfect impulse response.
Once one gets a system into a room, the frequency response to get somewhat shot to hell, but the system usually still sounds pretty good.
Some people use DSPs to get the in situ FR more optimised, and others use DIRAC like heal up and optimise impulse response.
It is probably more ideal to have the speaker closer to perfect to begin with, as there is often a limit in how much processing can heal things up.
I think we can comparing this to things like a ‘best Country” or best cities.
One always sees the usual places in the top 10… like Zurich, Melbourne, Toronto, Copenhagen, Helsinki, Oslo, etc…
(Usually places where women can walk around and live to tell the tale, are quantified along with other metric.)
Speakers are similar in that they do not adulterate and destroy the signals, and the attributes that make them good or bad are also quantifiable.
So yeah in perfect world, the engineering applies the physics and things continue to get better.
And some corrections can be done electrically to beat down remaining non-linearity.
But there are still stacks of people that will equate high-cost and high-mass and complexity with “best”.
…
So in terms of loudspeakers, I have in home a lot of different models, and one of my favorites is this single-driver model handmade by Gordon Rankin of Wavelength, nicely encased in a bamboo cabinet and uses a single Altec 755C driver. No crossover. It sounds wonderful, but if you want to scare all the bats in your neighborhood with that hyper-bright metallic tweeter (my ears hurt just typing this phrase), this is not the speaker for you. Similarly, if you want to have your "pants flapping in the breeze" from the bass, one of the many silly phrases that Stereophile writers use to sell more copy, this is not the model. What it does, it does well, which is present music between 70 Hz - 10 kHz at a moderate volume so you can hang on to your hearing into old age. As Peter Walker said a long time ago, a speaker is great or lousy long before it gets to 10 Khz (paraphrasing his exact quote). Walker's own solution -- the ESL 63 which was only released in 1981 or so after almost 20 years of development to get it perfected -- was considerably more complex, with delay lines to mimic this single driver idea.
That ESL has a great impulse/step response.
Most speakers are not as good.
…
We are so far away from having a loudspeaker that can even resolve 16-bit audio, let alone 24-bit, which is probably mathematically impossible and lies outside what you can do with physical media. To resolve 24-bits of resolution, a loudspeaker will need to have a THD of -140 dB. To put this in perspective, the best current loudspeakers struggle to achieve even -60 dB uniformly across 20Hz 20Khz. The ESL 63 can manage to get below -70 dB only between 100 Hz to 15 Khz or so, but at volumes less than 90 dB. Most box loudspeakers are woefully bad at resolving information in the bass, and even famous professional loudspeakers like the JBLs have distortion of around -50 dB in the bass (8 bits of resolution). So, to me, the OP was right: the hardest problem in high end audio remains loudspeakers (and of course room coloration). Building 100 pound media servers is nice to make money, but that's not the hard problem. Even the humble Eversolo DMP-A8 has all distortion below -130dB.
We are not very sensitive to distortion in the bass frequency range.
And I doubt that we really need 140 dB of dynamic range in the speaker.
Our hearing dynamic range is usually listed as being 120 dB, however there is the tensor tympani muscle, which means that we can hear down towards 0 deci-Bel when it is relaxed, and maybe towards (who knows) 60-70dB(?), and then as it tightens it is more like 80-120dB.
So it is a sliding window of dynamic range that is controlled by the muscle tension… and we do not hear at 0 deci-Bel and 120 dB simultaneously.
And there had been some good work done to make motors more linear, which reduces their distortions (IMD/HD), as well as work in mitigating cone breakup, and diffraction.
I could envision that a -60dB uniformity of the 40+ year old Quad could be improved by 10dB or more using some digital correction schemes.
But then there is the pesky subjective listening, and the (almost) fact that a lot of people really like distortions.
Those high distortion (musical) amps set themselves apart, and people seem to buy them.
…
Digital media servers and streaming are a solved problem, thanks to clever mathematics and engineering that was done a long time ago. Loudspeakers require a lot of new science still. But there's no money in it. The National Science Foundation (what will remain of it once DOGE cleans out the government) will hardly fund any basic research on loudspeakers. So, I am pessimistic we'll see any fundamental advances here beyond what we already have, despite marketing brochures aimed to sell fancy loudspeakers.
…
The Canadians have contributed too, so there is hope with the Great White North. (Aka 51st/52nd state.)