You need, more particularly, a set of speakers that match the room they are in. Now, I do not mean that a particular set of speakers can rescue a bad room, sorry, no, nothing but treatment or worse will fix a bad room. Electronic correction does not fix bad rooms to any great extent, the best use for it is making a pretty good room great.
But in addition to that, you need a set of loudspeakers with an off-axis radiation pattern that works with the room, and with the listener's expectations and tastes. This is, by itself, one of the reasons that there are so many loudspeakers and so much argument about them, in short, what somebody wants in speakers is in part personal taste. And personal preference is just that, personal. No more, no less.
Finally, the idea of "holographic sound" from 2 speakers is only going to work in a very limited sense with a small number of recordings, again of the listener's expectation and taste, with a particular setup of a system in a particular room with particular speakers. The important parts are room, speakers, recording, and then everything else about 2 orders of magnitude or more down the scale in importance, and notice that of the 3 most important, all of them are affected intensely by personal taste.
I hope my point is starting to come out by now, yes?
The idea of true holography is called "wave field synthesis", which is in fact something that people do work on. It's kind of expensive, requires very, very custom recordings with very custom systems in a custom designed room and something like a minimum of 128 channels for the desired effect. At low and mid-frequencies, these systems do something akin to holography. They sound quite real, I've heard a few of them. The problem is that they sound, sometimes, too real, do you really want to know about the problems in the recording space? No, probably not.
There are systems, including one I've invented and worked with (there are a few, mind you, not just that) that attempt to capture perceptual cues, as opposed to soundfield (analytic) cues. They suffer from the same problem, in particular, they suffer from being too accurate sometimes. Somewhere out there, you can see some of the reviews for the system I worked on, but I don't have them at my fingertips at the minute. Good reviews. BUT they also require changing the whole production chain, and require 5.0 (i.e. 5 full range speakers) playback, or better 7.0. Not a bunch of little speakers and a couple of big ones, full ranged matched speakers all around.
But the real problem is production, either of concert captures (i.e. classical in-situ), or of synthetic studio production.
There is myth that ensures it won't work (i.e. don't use the center speaker, when it is in fact the most important, as shown in 1933 by some very basic work in soundfield perception), there are cinema production rules (which are right for cinema, Holman is a smart guy, but they are terribly wrong for home theatre or for hi-fi use), and a persistence of the "phantom center" which ensures that the most important features of multichannel of any sort can not work.
The sad part is that work before 1940 showed quite a bit of this, and convincingly, but we're still doing things in the modern way. (Yes, that is intentional irony.)
We have people arguing, for instance, that using time-delay panning as well as amplitude panning doesn't work, based on some older work that used time delays of 5, 10, 15, 20 milliseconds, which obviously won't work unless your head is 5, 10, ... FEET across.
So, there is a lot known about how to establish a good, convincing synthetic soundfield using perceptual principles, or doing it the hard way, and actually reproducing, at least in a 2-d way, the actual soundfield, but the market penetration is going to be a very, very tough issue, because it requires revising every step of a chain fraught with both science and myth, as well as what seems to me to be some deliberate misinformation here and there.