by David McGrath and Andrew Reilly, Lake DSP
Abstract
The investigations carried out at Lake into various methods for generating simulated acoustic experiences has led recently to the methods of soundfield recording and playback. This paper explains the basic methods used in this area, and examines the various tools that have been developed at Lake for generating and manipulating soundfields. This research merges technologies from Architectural Acoustics (Auralization) and Virtual Reality (head-tracking and 3–D simulation).
Introduction
The refinement of acoustic computer modeling techniques in recent years has led to the development of DSP hardware that is capable of simulating complex acoustic responses in real-time, through the use of convolution. This simulation has generally centred around binaural playback, as a convenient method for presentation of a 3–D soundfield to the listener.
The enormous computational effort required to pre-compute the binaural acoustic response of a space means that any animation of the source or listener position during the real-time DSP processing will require a new approach. Furthermore, the application of acoustic modelling techniques in new areas (such as entertainment) will require the facility for playback other than through headphones.
This paper describes a new method for the creation of 3-D soundfields that makes use of a high-performance DSP system to allow animation of the acoustic simulation in real-time, whilst also allowing the option for playback through loudspeakers (in either a 2–D or 3–D surround array or through a stereo pair) or headphones.
This paper is divided into six main sections.
- describes the overall system that has been implemented, to give an overview of the goals of the system
- explains the methods previously used for static auralization, particularly in relation to the variety of microphone/playback methods that are applicable.
- explains the use of the B-format soundfield representation as a convenient method for creating and manipulating 3-D sound in auralization systems.
- describes a DSP architecture that processes dry input signals to produce a B-format soundfield, simulating an acoustic space with animated sound sources and receiver.
- details the loudspeaker playback method used to render the B-format signals over an array of speakers placed around the listener.
- describes a new method recently developed at Lake for processing the B-format signals for playback over headphones, with head-tracking used to maintain a stable acoustic field around the listener, even as the listener’s head turns.
The purpose of the simulation system
The intention of this development at Lake was to produce a DSP system capable of giving a subject the illusion of a particular acoustic space, with one or more sound sources located within the space. The system was intended to fulfil the following requirements :
- The sound sources and listener location within the space should be animated, so that any of the objects (sources or receiver) could be moved in real-time.
- The subject should be given the illusion of the sound source(s) being localised in space with the correct direction and distance impression.
- The direct sound source and some early reflections should be animated to give the correct impression of close reflective surfaces.
- The absorption properties of the wall, floor and ceiling surfaces should be modelled.
- The late reverberation should be processed to provide the correct spatial impression.
- All configuration of the system should be possible from an external computer (such as a graphics workstation) so that the audio simulation can be linked to a graphical visualisation/simulation system.
Acoustic modelling and auralization
Acoustic modelling is used to determine attributes of an acoustic space based on a computer model. Auralization is the process by which a listener is able to listen to the soundfield that would be experienced in the actual space, based on the results of the modelling process. A good overview of Auralization is given in [1] and [2].
Auralization is accomplished by the following steps:
- Input the characteristics of the acoustic space (acoustic properties of surface materials, source/receiver locations etc.).
- Determine the sequence of sound arrivals that occur at the listening position. Each sound arrival will have the following characteristics: (a) time of arrival, based on the distance travelled by the echo-path, (b) direction of arrival, (c) attenuation (as a function of frequency) of the sound due to the absorption properties of the surfaces encountered by the echo-path.
- Compute the impulse response of the acoustic space incorporating the multiple sound arrivals. If the receiver that we are modelling is a mono microphone, then a single impulse response will be computed. In cases where the receiver is a stereo microphone pair, or a dummy head, a pair of impulse responses will be computed. These impulse responses will incorporate: (a) the effect of each microphone’s directivity pattern (as a function of frequency), (b) any time delays between microphones in the case where the receiver is made up of more than one microphone and they are separated by some distance.
- The impulse response(s) are loaded into an FIR filter (either running in real time or computed in batch mode on a general purpose computer). The FIR filter will then be given input in the form of dry speech/music/etc. and the output of the FIR filter will be our simulated room soundfield.
- The results from the FIR filter are played back to a listener. In the case where the impulse responses were computed using a dummy head response, the results are played over headphones to the listener. In this case, the equalisation required for the particular headphones is also applied.
The impulse responses computed by this procedure can be very long. For audio data sampled at 48kHz, the impulse response will be between 10,000 samples (for a very small room) and 200,000 samples for a very large reverberant space. Some or all of the computation must be re-done if any of the following features are changed: (a) the position or orientation of the sound source (b) the position or orientation of the listener (c) the surface materials of the room (d) the microphone characteristic (or the dummy head response) (e) the headphone characteristics (in the case of binaural playback).
In all cases where a dummy head is referred to in the above discussion, the authors intend that a real subject’s head response may also be applied. In particular, the most realistic auralization experience can be achieved for a listener when the response of that listener’s head is used in computing the impulse responses.
The use of binaural impulse responses to simulate the room, with the full room response precomputed, generally implies that the auralization process will be static. However, an animated auralization has been achieved by the authors in the case where the impulse response of the room is precomputed for multiple listener head directions, and the DSP is pre-loaded with these multiple responses. In this system, a tracking device is attached to the headphones, and the DSP selects the appropriate pre-loaded filter function based on the listener’s head orientation. However, this system requires that all room responses be pre-computed, so arbitrary movement or sources and receiver cannot be achieved.
The B-format soundfield representation
Binaural impulse responses present a particular difficulty, because the response heard by the listener (at each ear) changes in a very complex way when the source and/or listener move in the acoustic space. The authors sought a more convenient format for creating and manipulating soundfields, and selected the B-format soundfield representation as a proven and well supported system for recording and processing the soundfield measured at one point in space.
The B-format is often referred to as Ambisonics. The authors’ understanding of the terminology is that Ambisonics is the system of recording and playback of sound fields developed by Michael Gerzon at the National Research Development Corporation in the U.K. in the 1970’s [3][4]. Other researchers also contributed to this technology [5], and Ambisonics differs from other work in the way the spatial sound is processed for loudspeaker playback. The B-format is the name given to the particular 4-channel recording and transmission format used to convey the spatial sound information that is used in Ambisonic systems (in fact there are other formats, including UHJ, that are subsets of the B-format used in situations where the media does not allow for four full-bandwidth audio channels).
The B-format is essentially a four-channel audio format that can be recorded using a set of four coincident microphones arranged to provide one omni-directional channel (the W channel) and three figure-8 channels (the X, Y and Z channels). This set of X, Y, Z and W signals represent a first-order approximation to the soundfield at that point in space.
The Ambisonic system was developed in the 1970’s and the method of playback of soundfields over speakers never found wide application with end users (partially due to the fact that Ambisonics was introduced to the world at a time when quadraphonic sound was losing favour with consumers). Despite this, the technique of recording in B-format has been sustained by a number of practitioners who have found it very flexible, due to the ease with which recordings can be made, and the variety of ways that the B-format can be mixed-down to stereo.
The B-format has been used as part of Lake’s new AniScape software tools because it provides the following important benefits:
- Real time animated B-format soundfields are simpler to create with DSP because the rendering of each sound arrival (including early echoes) in the soundfield is performed with simple time delays (and short filters to simulate attenuation at the surfaces).
- Playback of the B-format can be made through a variety of different loudspeaker geometries. These include simple 4-channel horizontal arrays as well as larger 3–D arrays.
- Following recent developments at Lake, playback of B-format is now possible through headphones, with head-tracking to provide a stable acoustic image while the listener turns his/her head to explore the acoustic space.
- The B-format is useful as an intermediate format for recording, allowing a variety of post-processing options. This means that 3-D soundfields produced through DSP simulation can be used in the same way as live soundfield recordings, with the final result being a stereo mix from the B-format source material. Given the wealth of experience that already exists in practitioners who are familiar with ways of working with the B-format, this makes simulated B-format soundfields immediately useful.
The SoundField Filter DSP process
The DSP processing required to provide a listener with the illusion of sounds localised in a particular acoustic space must implement the following features (with reference to figure 1)
- The direct sound must be processed to give the correct amplitude and perceived direction.
- Early echoes must arrive at the listener with appropriate time, amplitude and frequency response to give the perception of the size of the spaces (as well as the acoustic nature of the room surfaces).
- The late reverberation must be natural and correctly distributed in 3-D around the listener.
Figure 1. Early reflections in a room.
The relative amplitude of the direct sound compared to the remainder of the room response helps to provide the sensation of distance.
The DSP structure used to achieve this function is described below. Each input to the DSP represents one sound source in the virtual soundfield. The input is passed through a delay-line, and the direct sound and each early reflection is tapped out with the appropriate delay (based on the path length of the direct path or 1st order echo).
Then, each of these sound path signals is attenuated and filtered to simulate the correct characteristic of the echo arrival. The factors that are included in this gain/filter stage are (a) overall attenuation due to the distance traversed by the echo path, (b) attenuation due to the directivity pattern of the source, and (c) frequency dependant attenuation due to the acoustic properties of all the surface materials encountered by the echo path.
Each of these sound arrivals (direct sound plus six 1st order reflections) is mixed to form the X,Y,Z,W soundfield signals. The gain values used in this mixing are computed from the direction of arrival of each sound image at the listener position.
Finally, an additional tap from the delay line is used to feed the sound input into an array of FIR filters that render the later room reflections (from 2nd order, right through to the end of the reverberant tail). This filter has an impulse response of several seconds in length.
Each tap from the delay line uses interpolating filters to achieve sub-sample delay resolution. This, combined with an instantaneous velocity integrator, achieves a smooth Doppler shift as the sound objects (or listener) move within the acoustic space.
The final output of the system is a set of four signals (the B-format). These signals simulate the B-format response that would have been recorded by a Soundfield Microphone when placed in the same listener position within the space, surrounded by the same sources.
Loudspeaker playback of B-format signals
Many references (including [3], [4], [5]) provide good explanations of the way that B-format can be played back over loudspeakers.
Basically, the technique used simply feeds an array of speakers with a mix of the B-format signals. Generally, the speaker array must have some degree of symmetry (in fact the symmetry constraints can be very rigid if the decode is to perform the ‘mathematically ideal’ performance). The array should surround the listener, so that if speakers are placed in front, then some should also be in the rear, and if overhead speakers are used (to give an impression of sound elevation) then some speakers should also be placed below the listener.
At low frequencies (below about 300Hz), the goal of the decoder is to take a B-format input and re-produce the same B-format soundfield at the central listening point. At higher frequencies (above about 700Hz), the mixing of the B-format signals to the speakers is adjusted to compensate for the fact that the listener’s head disrupts the soundfield. This adjustment is made using simple shelf filters that vary the high frequency contribution of the W (omnidirectional) component of the B-format relative to the X, Y and Z components.
At Lake, we have implemented a B-format decoder using DSP techniques. It can drive an almost unlimited number of loudspeakers, and has been tested and demonstrated with an array of twelve.
B-format playback over headphones
More recently at Lake, a new DSP algorithm has been implemented to decode B-format signals for headphones. The first step in this process was to build a DSP function that filters the four B-format components, producing two outputs, in such a way that a static binaural presentation can be made of the B-format soundfield. The next step was to add a mixer that can rotate the X,Y,Z components of the soundfield prior to the binaural filters so that, in conjunction with a head-tracking device) the soundfield could be made to remain stable when the listener rotates his/her head.
Tests so far have shown this technique to work very effectively. It has a number of distinct advantages over previous methods used to implement head-tracked animated spatial sound over headphones :
- The binaural filters (based on Head Related Transfer Functions) do not need to change in real-time. These filters are static.
- Head tracking is achieved by rotating the X,Y,Z signals using a 3×3 matrix. This is a numerical method that implies that the resolution of the system is almost unlimited (so it does not even make sense to ask ‘What is the angular resolution of the head-tracked decoder?’).
- The HRTF data used is very compact, and can be combined with headphone EQ responses at run-time (at the time when the HRTF data is loaded from the disk into the DSP).
- Due to the compactness of the HRTF data, it is possible to very rapidly load new HRTF sets from disk, so that a listener might be able to select one HRTF set from a large collection, choosing the set that suits them best.
Conclusions
The system described in this paper has been built by Lake, and found to offer the following benefits for users wishing to create and playback virtual soundfields:
- Room responses can now be measured (using a soundfield microphone) or computed (using acoustic modelling software that simulates the soundfield format) in a format that is now independent of the chosen playback method.
- The early part of the response (direct sound plus 1st order reflections) can be animated in real time (due to the ease with which B-format soundfields can be created in real time).
- Multiple sound sources can be combined to form a composite soundfield.
- The location and orientation of each source and listener can be controlled in real-time from a number of sources, primarily via TCP/IP from a graphics workstation.
- Playback can be made over loudspeakers, to provide a 3–D soundfield to a listener sitting within the array.
- Playback can be made over headphones. Users can select an HRTF set that suits their own head-response. The HRTF data does not have to be embedded into the previously computed room response (as used to be the case in binaural Auralization systems).
- The equalisation for the headphones can also be selected at run-time.
The intention of the authors, when this system was originally specified, was to build a turn-key system for graphical simulation researchers to be able to build the acoustic equivalent of their real-time 3–D graphics displays. In future papers, the authors hope to be able to report on the experience of users who are applying this system in real applications.
References
- M. Kleiner, B.-I. Dalenbäck, P. Svensson, “Auralization – An Overview”, JAES Vol.41, pp861-875, Nov 1993.
- M. Kleiner, “Auralization: a DSP approach”, Sound and Video Contractor, Sept 20, 1992.
- M. Gerzon, “Surround Sound Psychoacoustics”, Wireless World, Dec 1974, pp 483-486
- M. Gerzon, “Periphony: with-height sound reproduction”, JAES Vol.21, pp2-10, 1973.
- D. Cooper, T. Shiga, “Discrete-matrix multichannel stereo”, JAES Vol.20, pp346-360, 1972.
c. 1994, David McGrath, Lake DSP.
From Lake DSP site. Republished with permission.