Optimized Crosstalk Cancellation for 3D Audio over two loudpseakers
There are a number of methods for generating 3D soundfields from loudspeakers. The three most promising are 1) Ambisonics, 2) Wave Field Synthesis and 3) Binaural Audio through Two Loudspeakers (BA2L). The first two methods rely on using a large number of microphones/recording channels for recording, and a large number of loudspeakers for playback, and are thus incompatible with existing stereo recordings. The third method, BA2L, relies on only two recorded channels and two loudspeakers only, and is compatible with the vast majority of existing stereo recordings (recorded with or without a dummy head).
At the 3D3A Lab, we have focused on the BA2L method and have attained unprecedented 3D playback realism (with no spectral coloration) with only two loudspeakers using optimal crosstalk cancellation filters, called BACCH, whose design method was developed by Professor Edgar Choueiri.
The following write-up explains the basic principles underlying BA2L, answers some frequently asked questions, and describes our ongoing research.
Binaural Audio through Two Loudspeakers
- Why are loudspeakers better than headphones for binaural audio?
- The need for crosstalk cancellation
- The three main problems of 3D Audio through XTC and their recent solutions
Binaural audio consists of reproducing at the entrance of each the listener's ear canals the sound pressure signals containing the proper ITD (interaural time difference) and ILD (interaural level difference) cues required for the listener to perceive a realistic 3D sound image or soundfield. In its most common implementation, binaural audio relies on recording sound with microphones implanted in the ear canals of an artificial human head (or equivalently, numerically convolving digital audio with a head-related transfer function (HRTF) representing the listener's head) then playing back the recorded stereo signals at or near the listener's ear canal entrances through earphones or headphones.
Non-idealities (e.g., mismatches between the HRTF of the listener and that used to encode the recording, movement of the perceived sound image with movement of the listener's head, lack of bone-conducted sound, transducer-induced resonances in the ear canal, discomfort, etc.), when above a certain threshold, are known to lead to difficulties in perceiving an accurate or realistic 3-D image. When such difficulties are present, the nearness of the headphones or earphones playback transducers to the ears often leads to a perception that the sound (or some of its spectral components) is inside, or too close to, the listener's head.
Binaural playback through loudspeakers is largely immune to this head internalization of sound, for even when non-idealities in binaural reproduction are present the sound originates far enough from the listener to be perceived to come from outside the head. Furthermore, cues such as bone-conducted sound and the involvement of the listener's own head, torso and pinnae in sound diffraction and reflection during playback (even if it departs from, or interferes with, the diffraction-induced coloration represented in the HRTF used to encode the binaural recording) could be expected to enhance the perceived realism of sound reproduction relative to that achieved with earphones.
The playback of a raw binaural signal through two loudspeakers results in a significant degradation of the integrity of the binaural cues transmitted to the listener ears because of the crosstalk that exists between the loudspeakers and the contralateral ears. Such unintended crosstalk, which obviously does not exist in playback with headphones, requires cancellation or effective reduction if binaural audio is to be successfully implemented through loudspeakers.
The crosstalk cancellation (XTC) level required for acceptable binaural reproduction through loudspeakers depends on the intended application. In scientific applications, such as the study of spatial hearing disabilities of elderly adults, the high levels of XTC (above 20 dB) needed for highly-accurate transmission of binaural cues to a listener require anechoic, or semi-anechoic, environments, precise matching of the listener's HRTF with that used in the recording, and constrained positioning of the listener's head in the area of equalization (the "sweet-spot"). In many less stringent applications, modest levels of XTC, even of a few dB over a limited range of frequencies, have the potential to significantly enhance the 3D realism of the reproduction of recordings containing binaural cues. This is because, by definition, localization cues in a binaural recording represent differential interaural information that is intended to be transmitted to the ears with no crosstalk. In other words, crosstalk cancellation, at any level, is a reduction of unintended artifice in the loudspeaker playback of recordings containing significant binaural cues.
This reduction of unintended artifice through XTC should also apply to the loudspeaker playback of most stereo recordings, especially those made in real acoustic spaces, and even to recordings made using standard stereo microphone techniques without a dummy head, since these techniques all rely on preserving in the recording a good measure of the natural ILD and ITD cues needed for spatial localization during playback. We should therefore expect that effecting even a relatively low level of XTC to the playback of such standard stereo recordings, even those lacking HRTF encoding, should enhance image localization compared to playback with full crosstalk, as well as the perception of width and depth of the sound-field, since these binaural features are always, to some degree, corrupted by crosstalk.
The main impediment to the wide adoption of 3D Audio through XTC has been the huge spectral coloration that XTC filters inherently impose on the sound emitted by the loudspeakers. This impediment has been completely removed by the advent of the BACCH™ filter. The fundamental nature of this spectral coloration, its basic features, its dependencies, and optimal methods to abate it with minimal adverse effects on XTC performance, are discussed in detail in this technical paper, which describes most basic aspects of BACCH™ filters.
The second impediment to the wide commercial adoption of 3D Audio through XTC has been the inherent existence of a "sweet-spot" in which the listener's head must be in order to perceive a true 3D image. This difficulty has been completely removed by the use of recent robust head-tracking technologies (such as the widely available Kinect IR depth sensor), which seamlessly move the sweet-spot with head of the listener (e.g. the jss-BACCH system; where "jss" stands for "jogging sweet-spot").
The third and final impediment to the wide commercialization of 3D Audio through XTC has been the limitation of having a single sweet-spot which limits the 3D listening experience to a single listener. This limitation has been completely removed with the recent development of Dynasonix, through a 2-year collaboartion between the 3D3A Lab and Cambridge Mechatronics Ltd., which allows creating multiple 3D sweet-spots for multiple listeners who are free to move while seamlessly maintaining full 3D audio imaging.
BACCH® Filters are optimized crosstalk cancellation filters that allow 3D audio reproduction over loudspeakers. They yield maximum crosstalk cancellation level without introducing any spectral coloration to the input signal. A detailed dicussion of the theory underlying BACCH filters can be found in this technical paper.
BACCH™ 3D Sound is the (trademark registered) name of the Princeton University licensed technology that uses BACCH filters in consumer audio, home stereo, home theater installations, and other applications. To learn more about BACCH™ 3D Sound, read the 20 Questions and Answers that address most aspects of the use of this technology in existing home stereo and home theater systems.
Dynasonix™ is a new technolgy developed though a 2-year research and development collaboration between the 3D3A Lab and Cambridge Mechatronics which allows the creation of multiple 3D sweet spots for multiple listeners who are free to move while seamlessly maintaining full 3D audio imaging.