IMHO the situation is not being clearly defined. Most of the time the "crispy delineated and clear image" is the result of omitting the natural bloom and decay of voices, including room acoustics. IMHO, if tweaking the system can enhance our perception this information, this means that the tweak is either absorbing some nasty interference that masks the original information or enhancing some information existing in the recording.
I think we must separate two situations - systems that systematically broaden images, giving a diffuse soundstage independently of recording, and those who are able to give a natural image and soundstage, not a pin point type, when the recording has this information, and precise and sharp when the recording has been made in such way. At the other extreme we have systems that systematically sound crispy and sharp.
Please note that, as usually, I am addressing non amplified music.
This is an excellent point, microstrip, and the distinction is worth discussing. I have noticed in the best systems that I have heard, that they have an ability, with the right recordings, to distinguish the origin of the sound, ie, the singer's mouth or the strings/body of the violin, and the sound that emerges or explodes into the listening space. When a system can do that and not simply present the sound as a 2D plane, or even a 3D image behind the speakers that is "viewed" from the listening seat, but rather experienced because it envelopes the listener and fills the room, as it does in good concert halls, then, that system is doing something right. But everything has to be working - the system, the room and the recording.