I read the interview. In the interview he did not explain why.
We now have a much better knowledge of digital, how it differs from analog and how it became mature to a level that exceeds the best analog.
This doesn't tell us anything substantive.
As I read it, he actually does explain why, by implication. Basically he says that formats in the past which captured less information needed 'crutches' by over-gathering data, whereas high-resolution digital does not, because it captures more information:
It’s not necessarily ultra-resolution that’s the problem, it is how engineers react to the new possibilities it offers. In the past, when the best technology captured less information, engineers tended to over-gather data, placing equal emphasis (or worse, over-emphasis) on all aspects of the potential mix. Then, to compensate for the lack of visual and other sensory information (the data our sight, skin, and scent collect at live performances), they created layers—depth of field—via supporting microphones, EQ, and dynamic alterations. They could offer focal points everywhere simultaneously, often incoherent in imaging and perspective.
Remember when HDTV came out? By capturing more information, the new visual transparency revealed too much of the actors’ make-up. This changed the “grime departments” at television shows forever, and vastly complicated the tasks of set decoration, lighting, and more. With ultra-resolution audio, engineers now have to come up with more subtle accents, more sophisticated mixes. That has not happened across the board. Timbres, transients, and imaging are still being overproduced, as if high-res did not exist; the make-up is showing. Analog—and DSD, in a way—comes with a nice layer of “lingerie,” a term I use for differences between real life and the “ultimate/naked” truth of high-resolution audio.