There are actually two independent issues going on with DSD that limits the musicality - and they are interlinked problems.
The first issue is down to the resolving power of DSD. Now a DSD works by using a noise shaper, and a noise shaper is a feedback system. Indeed, you can think of an analogue amplifier as a first order noise shaper - so you have a subtraction input stage that compares the input to the output, followed by a gain stage that integrates the error. With a delta sigma noise shaper its exactly the same, but where the output stage is truncated to reduce the noise shaper output resolution so it can drive the OP - in the case of DSD its one bit, +1 or -1 op stage. But you use multiple gain stages connected together so you have n integrators - typically 5 for DSD. Now the number of integrators, together with the time constants will determine how much error correction you have within the system - and the time constants are primarily set by the over-sample rate of the noise shaper. Double the oversampling frequency and with a 5th order ideal system (i.e. one that does not employ resonators or other tricks to improve HF noise) it converges on a 30 dB improvement in distortion and noise.
So where does lack of resolution leave us? Well any signal that is below the noise floor of the noise shaper is completely lost - this is completely unlike PCM where an infinitely small signal is still encoded within the noise when using correct dithering. With DSD any signal below the noise shaper noise floor is lost for good. Now these small signals are essential for the cues that the brain uses to get the perception of sound stage depth - and depth perception is a major problem with audio - conventional high end audio is incapable of reproducing a sense of space in the same way one can perceive natural sounds. Now whilst optimising Hugo's noise shaper I noticed two things - once the noise shaper performance hit 200 dB performance (that is THD and noise being -200 dB in the audio bandwidth as measured using digital domain simulation) then it no longer got smoother. So in terms of warmth and smoothness, 200 dB is good enough. But this categorically did not apply to the perception of depth, where making further improvements improved the perception of how deep instruments were (assuming they are actually recorded with depth like a organ in a cathedral or off stage effects in Mahler 2 for example. Given the size of the FPGA and the 4e pulse array 2048FS DAC, I got the best depth I could obtain.
But with Dave, no such restriction on FPGA size applied, and I had a 20e pulse array DAC which innately has more resolution and allows smaller time constants for the integrator (so better performance). So I optimised it again, and kept on increasing the performance of the noise shaper - and the perception of depth kept on improving. After 3 months of optimising and redesigning the noise shaper I got to 360 dB performance - an extraordinary level, completely way beyond the performance of ordinary noise shapers. But what was curious was how easy it was to hear a 330 dB noise shaper against a 360 dB one - but only in terms of depth perception. My intellectual puzzle is whether this level of small signal accuracy is really needed, or whether these numbers are acting as a proxy for something else going on, perhaps within the analogue parts of the DAC - I am not sure on this point, something I will be researching. But for sure I have got the optimal performance from the noise shaper employed in Dave, and every DAC I have ever listened too shows similar behaviour.
The point I am making over this is that DSD noise shapers for DSD 64 is only capable of 120 dB performance - and that is some 10 thousand times worse than Hugo - and a trillion times worse than Dave. And every time I hear DSD I always get the same problem o perception of depth - it sounds completely flat with no real sense of depth. Now regular 16 bit red book categorically does not suffer from this problem - an infinitely small signal will be perfectly encoded in a properly dithered system - it will just be buried within the noise.
Now the second issue is timing. Now I am not talking about timing in terms of femtosecond clocks and other such nonsense - it always amuses me to see NOS DAC companies talking about femtosecond accuracy clocks when their lack of proper filtering generates hundreds of uS of timing problems on transients due to sampling reconstruction errors. What I am talking about is how accurately transients are timed against the original analogue signal in that the timing of transients is non-linear. Sometimes the transient will be at one point in time, other times delayed or advanced depending upon where the transient occurs against the sample time. In the case of PCM we have the timing errors of transients due to the lack of tap length in the FIR reconstruction filter. The mathematics is very clear cut - we need extremely long tap lengths to almost perfectly reconstruct the original timing of transients - and from listening tests I can hear a correlation between tap length and sound quality. With Dave I can still hear 100,000 taps increasing to 164,000 taps albeit I can now start to hear the law of diminishing returns. But we know for sure that increasing the tap length will mean that it would make absolutely no difference if it was sampled at 22 uS or 22 fS (assuming its a perfectly bandwidth limited signal). So red book is again limited on timing by the DAC not inherently within the format.
Unfortunately, DSD also has its timing non-linearity issues but they are different to PCM. This problem has never been talked about before, but its something I have been aware of for a long time, and its one reason I uniquely run my noise shapers at 2048FS. When a large signal transient occurs - lets say from -1 to +1 then the time delay for the signal is small as the signal gets through the integrators and OP quantizer almost immediately. But for small signals, it can't get through the quantizer, and so it takes some time for a small negative signal changing to a positive signal to work its way through the integrators. You see these effects on simulation, where the difference of a small transient to a large transient is several uS for DSD64.
Now the timing non linearity of uS is very audible and it affects the ability of the brain to perceive the starting and stopping of instruments. Indeed, the major surprise of Hugo was how well one can perceive that starting and stopping of notes - it was much better than I expected, and at the time I was perplexed where this ability was coming from. With Dave I managed to dig down into the problem, and some of the things I had done (for other reasons) had also improved the timing non-linearity. It turns out that the brain is much more sensitive that the order of 4 uS of timing errors (this number comes from the inter-aural delay resolution, its the accuracy the brain works to in measuring time from sounds hitting one ear against the other), and much smaller levels degrade the ability for the brain to perceive the starting and stopping of notes.
But timing accuracy has another important effect too - not only is it crucial to being able to perceive the starting and stopping of notes, its also used to perceive the timbre of an instrument - that is the initial transient is used by the brain to determine the timbre of an instrument and if timing of transients is non-linear, then we get compression in the perception of timbre. One of the surprising things I heard with Hugo was how easy it was to hear the starting and stopping of instruments, and how easy it was to perceive individual instruments timbre and sensation of power. And this made a profound improvement with musicality - I was enjoying music to a level I had never had before.
But the problem we have with DSD is that the timing of transients is non-linear with respect to signal level - and unlike PCM you are completely stuck as the error is on the recording and its impossible to remove. So when I hear DSD, it sounds flat in depth, and it has relatively poor ability to perceive the starting and stopping of notes (using Hugo/Dave against PCM). Acoustic guitar sounds quite pleasant, but there is a lack of focus when the string is initially struck - it sounds all unnaturally soft with an inability to properly perceive the starting and stopping. Also the timbre of the instrument is compressed, and its down to the substantial timing non-linearity with signal level.
Having emphasised the problems with delta-sigma or noise shaping you may think its better to use R2R DAC's instead. But they too have considerable timing errors too; making the timing of signals code independent is impossible. Also they have considerable low level non linearity problems too as its impossible to match the resistor values - much worse than DSD even - so again we are stuck with poor depth, perception of timing and timbre. Not only that they suffer from substantial noise floor modulation, giving a forced hard aggressive edge to them. Some listeners prefer that, and I won't argue with somebody else's taste - whatever works for you. But its not real and it not the sound I hear with live un-amplified instruments.
So to conclude; yes I agree, DSD is fundamentally flawed, and unlike PCM where the DAC is the fundamental limit, its in the format itself. And it is mostly limited by the format. Additionally, its very easy to underestimate how sensitive the brain is to extremely small errors, and these errors can have a profound effect on musicality.