by David McGrath, Lake DSP
Abstract
This paper presents an overview of the important issues that need to be addressed in a high quality audio simulation system for use in Virtual Reality. Some aspects of room acoustics are discussed, followed by a discussion of the methods used to present 3-D audio material to a subject, including headphones and multi-speaker playback methods. Computer synthesis of room acoustic responses is reviewed, with emphasis on its application in Virtual Reality. Finally, a number of example simulation systems, based on existing DSP hardware, are presented as a guide to the current state of the art.
Introduction
Acoustic environment simulation involves the presentation of audio material to a subject in a way that creates the impression that the subject is in a different environment. The experience will obviously be more realistic if the subject is also presented with other sensory (eg. visual) input that matches the acoustic environment.
The simulation of acoustic environments is a technique that has been used for a number of years in the field of acoustics. The ability to listen to a simulated acoustic space has enabled researchers in acoustics to examine the characteristics of a virtual or real acoustic space via computer modelling. Initially, simulation was used as a powerful tool to test the accuracy of the computer model. As computer modelling techniques have been refined and proven, the quality and accuracy of simulation has improved to the point where it is now being used as a reliable means of predicting the characteristics of an acoustic space that does not exist in reality. In the field of acoustics, this method of simulation has become known as Auralization [1], being the auditory counterpart to visualisation.
Simulation of an acoustic environment involves one or more of the following functions :
- Processing an audio source input and presenting it to the subject through a number of loudspeakers (or headphones) with the intention of making the sound source appear to be located at a particular position in space. This is the most basic acoustic environment simulation.
- Processing multiple input audio sources in such a way that each source is independently located in space around the subject.
- Enhanced processing to simulate some aspects of the room acoustics, so that the user can acoustically sense the size of the room and the nature of the floor and wall coverings.
- A most accurate simulation of the room acoustics, sufficient for the subject to be able to make judgements on the quality of the simulated acoustic space (for example, to evaluate the usefulness of the room for public speaking or musical performances).
- The capability for the subject to move (perhaps within a limited range) and turn his/her head so as to focus attention on some aspects of the sound source characteristics or room acoustics.
Part 2 of this paper reviews the characteristics of acoustic spaces and discusses which elements of a room response are important for different virtual reality applications. Part 3 discusses the processes involved in presenting audio material to a subject to achieve an impression of a synthetic or real acoustic space. The basic principles of 3-D sound localisation are reviewed, and some methods of audio playback are discussed. Part 4 describes some methods that are used in the computation of synthetic room responses for the purpose of simulating virtual spaces. Part 5 describes techniques that are currently used for playback of audio material with 3-D characteristics and the hardware that can be used to process dry audio material prior to playback.
Room characteristics
Before we can attempt to simulate the listening experience of a subject in a given acoustic space, it is important to understand what aspects of the listening experience we wish to recreate. Ignoring extraneous effects, such as back-ground noise, Figure 1 below shows the three main components of the audio that are important for the listener in a typical acoustic environment :
The sound arriving at the listener’s location in an acoustic space is composed of these three components:
- The direct sound, from the sound source to the subject is important for the subject to be able to localise the sound source. For natural sources up to 15m away, the amplitude of the sound can provide cues to the distance of the source, by reference to a known amplitude of these natural sources.
- The early reflections in the acoustic space are important, because their arrival time, and their direction of arrival provide important information to the subject with regard to the size of the room (determined by the spacing of these reflections over time) and the distance of the source from the subject (determined by the relative amplitude of the direct sound to the early reflections).
- The reverberant part of the acoustic response provides the subject with an impression of the type of wall, floor and ceiling materials, as well as the size of the acoustic space. The length of the reverberant tail of a room response defines the ‘reverberation time’ of the room (more precisely, reverberation time refers to the time taken for the tail to reduce to a level 60dB below the direct sound level).
If any of these three elements of the room response are omitted or improperly simulated in an acoustic simulation system, the result will sound unnatural.
When a subject is presented with a sound that is arriving from a specific direction, the human auditory system is capable of resolving the direction of arrival of the sound, based on a number of acoustic cues. Figure 2 illustrates a simple principle, namely that the use of two ears allows the left/right location of sounds to be estimated by a subject. This is based on the fact that the path length for sound arrivals at two ears will be different.
In reality, the methods used by the auditory system to resolve sound location is quite complex. The mechanisms used to locate sounds include the following [3] :
- For sounds above 1000Hz, the difference in arrival time at each ear is used to discriminate between left/right location of the sound source.
- For sounds below 1000Hz, the difference in phase of the signal at each ear is important in resolving left/right location.
- For sounds below 200Hz the human ear is unable to resolve the direction of arrival of the sound. This principle is used in sub-woofer speakers which can be placed anywhere in a listening room, not necessarily near the high-frequency drivers, without confusing the listener.
- Above 1000Hz, the attenuation of higher frequencies due to shadowing by the head, provides further cues to the subject.
- The shape of the pinnae (the outer ears) is important, particularly at higher frequencies, because they attenuate different frequencies in different ways depending on the front/back location and elevation of the sound. The ability to resolve between front and back, and also elevation of a sound source depends heavily on very subtle pinna effects.
The sound received in the ear canals of a subject can be measured (using small implanted microphones) and the relationship between the transmitted sound and this received sound can be analysed. The response from transmitter to ear can generally be modelled as a linear filter with an impulse response that is less than about 2ms in length. Figure 3, shows the relative delay and attenuation characteristics of the filter responses for the left and right ears.
The signal measured at the ear-canal in response to a source signal that is an impulse is known as the Head Related Transfer Function (HRTF). The HRTF is an impulse response that varies as a function of azimuth and elevation of the source, and also varies between different subjects.
If we generate a test signal and measure the binaural response to that signal, then we can compute the HRTF of the head, for that particular location of the sound source relative to the subject’s head position. This HRTF measurement procedure is generally carried out in an anechoic environment, so that echoes within the room do not come into play. Alternatively, for measurements made within a normal room, only the first 2 or 3 ms of the measured response should be used, as any subsequent elements in the measured response will be due to extraneous echoes.
Playback of binaural recordings
Recreation of an acoustic experience can be achieved by recording the sound that is incident on a subject’s ear-canal and then replaying that recording at a later time over headphones. Figure 4 illustrates this process.
The head used to record the binaural material (the left-hand head in Figure 4) may be a real person’s head with small microphones implanted in the ear canal, or a dummy head such as the Kemar.
The 3-D experience that is reproduced by this technique includes more than just the direct arrival of the source sound. All aspects of the room acoustics are contained in this binaural recording, including the direction of arrival of early reflections and the fine details of the reverberant tail of the room response.
Simulation of binaural effects
Binaural simulation is generally carried out using dry source material. Dry recording are made in an anechoic chamber, to ensure that the recording does not contain any unwanted echoes.
Dry source material can then be replayed to a subject, using the appropriate HRTF filters, to create the illusion that the source audio is originating from a particular direction. The HRTF filtering is achieved by simply convolving the dry audio signal with the pair of HRTF responses (one HRTF filter for each channel of the headphone). Figure 5 shows this procedure.
The HRTF(l) and HRTF(r) responses are commonly expressed in term of time-domain impulse responses. With audio signals sampled at a typical rate of 48KHz, these HRTF responses are usually around 128 samples in size (corresponding to about 3ms of impulse response). The convolution process is defined by the following equations :
where x(n) is the input audio stream (a monaural signal), and yl(n) and yr(n) are a stereo (or binaural) output signal, to be sent to the headphones.
Practically speaking, this binaural simulation system could be accomplished by a DSP processor, taking the input data in real time and producing the (stereo) output data by computing the convolution based on the equations above.
Typical binaural simulation systems store a large number of pre-measured HRTF functions, and can switch from one HRTF to another rapidly. For any given location of the source audio, the HRTF can be retrieved from this stored table, or computed by interpolating between closest neighbour stored HRTFs.
The binaural simulation method described above attempts to create the illusion of a sound source that is located some distance from the subject, in a particular direction. The short HRTF filters mimic the propagation of the source sound to the subject without including any room acoustic characteristics.
The HRTF measurement method can be used in a reverberant space (instead of an anechoic chamber) the measure a pair of filter responses (for left and right ears). In this case, the response being measured is not a pair of HRTFs, it is a binaural room response. Simulated binaural playback can be achieved by the same method as shown in Figure 5, giving the subject the illusion of the source audio being transmitted within the same acoustic space that the binaural response was measured in.
Deficiencies in binaural playback
Some deficiencies in binaural playback have been reported in the past, these include:
- Some subjects have reported that the sound they are listening to is located very close to their head, or sometimes inside their head. This is particularly common when the binaural recording (or simulation) is used to present a virtual sound source that is directly in front of or directly behind the subject. Recently, better measurement techniques have overcome most of these difficulties.
- Some subjects report that the elevation of the sound source appears to be slightly higher than intended.
- When the subject turns his/her head, the sound source (or sources) move with the subject’s head.
- The use of headphones may restrict the subject’s movements.
Simulation using multi-speaker playback
A different approach to acoustic environment simulation involves a recreation of the 3-D sound field around the subject. This is most often achieved through the use of a large number of loudspeakers placed in an array around the subject.
Generally, a minimum of four loudspeakers are required to achieve a convincing 3-D audio experience, while some researchers are using twenty or more speakers in an anechoic chamber to recreate acoustic environments with much greater precision.
The main advantages of multi-speaker playback are:
- There is no dependence on the individual subject’s HRTF, since the sound field is created without any reference to individual listeners.
- The subject is free to turn their head, and even move about within a limited range.
- In some cases, more than one subject can listen to the system simultaneously.
Whilst the playback through multiple loudspeakers may be more convincing for many subjects than headphone playback, the processing required for simulation of the 3-D audio will be more expensive (simply because more channels of audio output are required). In addition, making 3-D recordings in a real environment, for later playback over multiple loudspeakers, is a difficult procedure.
High-precision simulation of acoustic spaces, for the purpose of evaluation of acoustics, is almost always performed using either binaural playback or else using a large number (>10) of loudspeakers in an anechoic chamber. However, the use of far fewer speakers (4 or 6) can still provide a fairly realistic 3-D experience, and active research in this area is progressing rapidly. Also, 4 or 6 speaker systems have application in Virtual Reality. These systems can be constructed with the speakers close to the subject, for compactness. Larger numbers of loudspeakers will provide a larger ‘sweet-spot’, the range within which the subject (or subjects) can move without compromising the fidelity of the 3-D simulation.
Head tracking and animated effects
Improved playback through headphones can be achieved through the use of head tracking. This technique makes use of continuous measurements of the orientation of a subject’s head, and adapts the audio signals being fed to the headphones appropriately.
From the simplified model presented in Section 3.1 (Figure 2), it is clear that the use of 2 ears should allow a subject to easily discriminate between left and right sound source locations. However, the ability to discriminate between front and back, and high and low sound sources is generally only possible if head movement is permitted. Whilst multiple speaker playback methods solve this problem to a large degree, there are still many applications where headphone playback is preferable, and head tracking can then be used as a valuable tool for improving the quality of the 3-D playback.
The simplest form of head tracking binaural system is one which simply simulates anechoic HRTFs, and changes the HRTF functions rapidly in response to the subjects head movements. This HRTF switching can be achieved through a lookup table, with interpolation used to resolve angles that are not represented in the HRTF table.
Simulation of room acoustics over headphones with head tracking becomes more difficult because the direction of arrival of the early reflections is also important in making the result sound realistic. Many researchers believe that the echoes in the reverberant tail of the room response are generally so diffuse that there is no requirement for this part of the room response to be tracked with the subject’s head movements.
An important feature of any head tracking playback system is the delay from the subject head movement to the change in the audio response at the headphones. If this delay is excessive, the subject can experience a form of virtual motion sickness and general disorientation.
Synthetic room calculations
The propagation of sound from a source to a subject within an acoustic space can be modelled by a computer, using a variety of methods. The methods used today fall into two broad categories:
- Simple propagation models that mimic the direct sound, and sometimes also the first-order reflections of the sound from a small number of wall, floor and ceiling surfaces. Each sound arrival (whether direct or reflected) is characterised by its direction of arrival and its level of attenuation (due to either distance of sound propagation or reflective surface material properties). These simple models are often used in real-time systems where the source and/or subject are moving, and also where the head-movement of the subject is permitted (in the case of headphone playback).
- More complex models that attempt to estimate the room acoustics to a high level of accuracy. These methods typically involve either ray tracing or image method techniques (or both, see [6]). The output of these programs is usually in the form of a binaural room response, based on a fixed position and orientation of the subject (for playback over headphones). Alternatively, the room response may be computed as a number of response filters for use over a multi-speaker playback system, which will then allow the subject some degree of movement and head rotation.
The choice of modelling method used in a particular application will depend on the desired accuracy of the model and the animation capabilities required, as well as the techniques used for implementing the actual simulation for playback to the subject.
Example simulation systems
The acoustic simulation system takes the room response (either from a measured real room, or from a computer model) and convolves it with the dry source material. Most acoustic simulation systems fall into one of the following categories:
- Many researchers choose to perform the long convolution computations on a general-purpose CPU (such as an engineering workstation). This usually involves many hours of computation to produce a small amount of output audio.
- For applications requiring fast animation, special hardware is used to perform a small number of HRTF convolutions to model the direct sound arrival plus a number of early reflections, and room reverberation. An example of this approach can be seen in Lake DSP’s AniScape and MultiScape applications.
- Real-time convolution with very long room responses can be achieved using very high performance processors, such as the Huron digital audio convolution workstation built by Lake DSP. The Huron is capable of computing both left and right binaural responses with a length of over 5 seconds each. This corresponds to a pair of convolution processors with over 278400 taps each.
Long convolution engines such as Lake DSP’s Huron are being improved to allow faster animation functions to be added to their capabilities.
The goal for these simulation systems is to be able to provide a realistic sounding room simulation whilst allowing rapid animation of source and subject location and room characteristics.
References
- M. Kleiner, B. Dalenback, P. Svensson, “Auralization – An Overview,” J. Audio Eng. Soc., Vol. 41, No. 11, November 1993.
- E.M. Wenzel, “Localization in Virtual Acoustic Displays,” Presence, Vol. 1, No. 1, Winter 1992.
- J. Blauert, Spatial Hearing, The MIT Press, Cambridge, Mass. USA (1983)
Kleiner, “Real Audio for Virtual Environments,” Virtual Reality Systems, Vol. 1, No. - 3, Spring 1994.
- Special issue on Auralization, J. Audio Eng. Soc., Vol. 41, No. 11, November 1993
- B. Dalenback, “CATT-Acoustic,” CATT, Svanebacksgatan 9B, S-41452, Gothenburg, Sweden
c. 1995, David McGrath, Lake DSP.
From Lake DSP site. Republished with permission.