Two Answers:
1. A sphere of radius .43 inches. Please excuse the faux specificity. Hard to prove. Hard to disprove. I mention a percussive sound, which I would suggest is in a limited frequency range, which human hearing can successfully echo locate. Also, the emanating source must have a small surface area. Think snapping twig. It's enough for Clint Eastwood to work with. As emanating surface area increases in size and complexity, such as human head or piano, this is much more difficult for a system to reproduce.
2. In my cumulative experience, an aggregate of systems owned - I have heard parts of edges, surfaces, and specific points within a location. I believe this incompleteness to be an issue of recording, medium, and replay system fidelity. It doesn't quite do it all. The best recordings in the best formats in the best systems should be able to provide greater clarity of this assembly of sounds. With all of the engineer's sonic patchwork, injected noise, and myriad of phase shifts within a replay chain (inductance, capacitance) which is variable throughout the frequency ranges, this is a tough illusion to pull off. In the best moments I have perceived a singing, breathing head in 3 space with sides, top, and bottom with vocal cords separate from mouth opening and nostrils. I assume this is a psycho acoustic perception trick as most vocals are recorded with only one microphone - not four. Also, occasionally I can perceive a visual of a drum kit, percussive sounds emanating from specific locations surrounded by ambient room reflections to augment the virtual construction. Studio sounds, say electronic sounds, are easier to reproduce because they are simpler in origin and located in two channel by the engineer. Live two channel recorded well may be the most revealing test for a system's phase coherence.
Please elaborate on what you hear.
Where I work we have a 3D laser scanner to document industrial equipment and buildings. A single scan looks proper until you rotate the rendered model on the computer screen. Once you do this the missing information is apparent. We can't visually autofill the missing information. It's relatively 2D. So a better scanning technique is to take readings from two different locations for a stereo effect. The computer combines these scans. This is a much better rendering. However it is still not complete as the blind spots become apparent once rotated. A third scan, optimally located, can fill in even more of the gaps. Point is, even with purist two channel, we are not capturing all of the 3D information within an acoustic space. It is a fixed perspective which limits the degree of recreation possible. Move to the side of the bandstand it sounds different, right? Whether that matters is entirely debatable.
Add a few more audio channels to the recording and playback and much more is possible. Not referring to cinema systems used today.