3-D Audio Primer.

by Aureal Corporation

This document presents an introduction to the general concepts and performance of three-dimensional audio technology. Several audio technology categories are defined with the purpose of creating a common understanding of “better-than-stereo” audio playback methods.

Contents:

1. Introduction to 3-D Audio
2. What is and What isn’t 3-D Audio
3. The Basics of Acoustics
4. The Basics of Human Hearing
5. How A3D Works
6. Advantages of A3D As Illustrated by Research
7. Summary

1. INTRODUCTION TO 3-D AUDIO

Since the late 1970’s, several audio technologies have been developed to advance the state of the art in audio reproduction beyond stereo. Most of them are focused on increasing the dimensionality of sound playback beyond the one-dimensional stereo sound field created by conventional playback on a left/right speaker pair. Furthermore, the advent of digital audio signal processing has enabled interactive audio experiences: similar to live music, sounds are created on-the-fly based on user input (for example in video games), rather than being based on playback of a pre-recorded soundtrack (as in movies).

A3D from Aureal is a digital audio technology that has been developed to provide maximum performance in both areas of dimensionality and interactivity. A3D technology is based on the principles of binaural human hearing. Binaural means that we hear using two ears. From the two signals that our ears perceive, we can extract enough information to tell where a sound is located in the three dimensional space around us. The functioning of the human hearing system has been researched successfully over the last two decades by psycho-acoustic researchers around the world. They have provided us with the necessary findings and understanding that today’s A3D audio systems are based on.

To put it in simpler terms: since we can hear three-dimensionally in the real world using just two ears, it must be possible to achieve the same effect from just two speakers or a set of headphones. On this basic assumption, 3D audio products have been successfully built.

This document starts by explaining how different forms of audio processing compare against each other (“What is and What isn’t 3D Audio”). It then focuses on the concepts of acoustics and human hearing that A3D is based on, and details the digital audio building blocks that make up an A3D system.

2. WHAT IS AND WHAT ISN’T 3-D AUDIO

As mentioned in the introduction, there are two key pieces to a 3D audio system: 3D positioning and interactivity.

A full-featured 3D audio system provides the ability:

  • To define a three-dimensional space
  • To position multiple sound sources and a listener in that 3D space
  • To do all processing it in real-time, or interactively, for example based on the users inputs in a video game (the opposite of interactive audio playback is a pre-recorded soundtrack).

Certain technologies, namely stereo extension and surround sound, offer some aspects of 3D positioning or interactivity. They are discussed here to explain what applications they are geared towards, and why they are not considered to be part of a new category of technologies, called Positional 3D Audio. This new category combines full 3D positioning and interactivity to offer a new kind of audio listening experience. A3D is the industry leading positional 3D audio technology. A comparison chart of different audio playback methods is included to help differentiate the features of each technology.

2.1 Extended Stereo

Extended stereo technologies and products process an existing stereo (two channel) soundtrack to add spaciousness and to make it appear to originate from outside the left/right speaker locations.

These products are particularly useful to restore stereo performance to low-end PC multimedia sound systems that typically contain low-quality speakers that are placed very closely together. Extended stereo effects can be achieved via various, fairly straight-forward methods. Additionally, their performance is often evaluated based on subjective criteria such as listening tests. For those reasons it is somewhat difficult to compare products in this area. Some of the differentiators include:

  • Size of the listening area (areain which the listener has to be placed withrespect to speakers to hear the effect, alsocalled sweet spot)
  • Amount of spreading of stereo images (more spreading, or user variable spreading, is better)
  • Amount of coloring (tonal changes)of audio content introduced by processing (no coloring is best)
  • Amount of stereo left/rightpanning information that is lost during processing (no panning loss is best)
  • Ability to achieve effect on headphones as well as speakers

Although sometimes marketed under the name “3D Sound” or “3D stereo” extended stereo technologies are not considered to be 3D audio technologies, because they only offer passive spreading of an existing soundtrack, and not interactive 3D positioning of individual sounds.

2.2 Surround Sound

Technologies and products that create a larger-than-stereo sound stage by playing back multi-channel Dolby® or Mpeg surround sound soundtracks on multi-speaker setups. Surround sound is based on using audio compression technology (for example Dolby ProLogic® or Digital AC-3®) to encode and deliver a multi-channel soundtrack, and audio decompression technology to decode the soundtrack for delivery on a surround sound 5-speaker setup. Additionally, virtual surround sound systems use 3D audio technology to create the illusion of five speakers emanating from a regular set of stereo speakers, therefore enabling a surround sound listening experience without the need for a five speaker setup. Aureal’s A3D Surround is a Virtual Surround technology.

Because they are pre-recorded, surround sound soundtracks are most suitable for movies. They are non-interactive, and therefore not particularly useful in interactive software such as video games and Web Sites. Because of their limitations when it comes to interactivity, surround sound systems are not considered for the interactive 3D audio category.

Ways to evaluate the performance of a surround sound system:

Physical Speakers
  • Presentation accuracy of individual channels, clarity of spatial imaging (size of sound stage)
Virtual Speakers
  • Listening comparison to a physical 5-speaker setup (accuracy of virtual to physical speaker mapping, as well as accuracy of reproduction of original soundtrack mix-down)
  • Amount of audio coloring (tonal changes) introduced by processing (no coloring is best)
Both Physical and Virtual Setups
  • Size of the listening area (area in which the listener has to be placed with respect to speakers to hear the effect, also called sweet spot)

2.3 Positional 3D Audio (A3d Interactive)

Positional 3D audio (a.k.a. interactive 3D audio) allows for interactive, on-the-fly positioning of sounds anywhere in the three-dimensional space surrounding a listener. Support for such technologies can be incorporated into software titles such as video games to create a natural, immersive, and interactive audio environment that closely approximates a real-life listening experience. This category can be described as the audio equivalent of 3D graphics. Aureal’s A3D Interactive is a positional 3D audio technology.

3D audio technologies create a more life-like listening experience by replicating the 3D audio cues that the ears hear in the real world. The following two sections, “The Basics of Acoustics” and “The Basics of Human Hearing”, explain what those listening cues are and how they can be reproduced. For maximum flexibility and usability, a 3D audio algorithm should support all possible audio playback environments: headphones, stereo speakers and multi-speaker (surround or quad) arrays. In the case of stereo speakers or headphones more demands are placed on the algorithm and less demands on the end-user, because stereo setups are most common and easy to setup. Multi-speaker arrays require less complex 3D audio rendering algorithms, but put more demands on the end-user’s playback setup (cost and setup complexity of extra amplifiers and speakers). In both cases, the desired 3D effects are controlled by software applications which position 3D sound sources and listeners via an API (Application Programming Interface) such as Microsoft’s DirectSound3D API for the Windows® platform, or the VRML 2.0 standard.

Ways to evaluate the performance of a 3D interactive sound system:

  • Listening tests to evaluate howwell sounds are projected in all three dimensions(left/right, up/down, front/back), and how much realism they provide
  • Number and quality of softwaretitles that take advantage of 3D technology
  • Number of concurrent 3D soundsources system provides at a given quality or sample rate
  • Ability to achieve effect onheadphones as well as speakers
  • Size of the listening area (areain which the listener has to be placed with respect to speakers to hear the effect, alsocalled sweet spot)
  • Amount of coloring (tonal changes)of audio content introduced by processing (no coloring is best)

Table1.jpeg

2.4 Headphone Versus Stereo Speaker Playback Devices

In terms of 3D sound processing, these two playback media offer different challenges and advantages. Headphones have the advantage of always being in a known position with respect to the listener’s ears. This means that two separate audio signals (left and right) are guaranteed to go directly into the two ears of a listener. With speakers, this is only the case if the listener is sitting in the ideal listening position, the sweet spot, and processing methods are employed to insure that the left ear does not receive any audio content from the right speaker, and vice versa (cross-talk cancellation).

3. THE BASICS OF ACOUSTICS

Human beings extract a lot of information about their environment using their ears. In order to understand what information can be retrieved from sound, and how exactly it is done, we need to look at how sounds are perceived in the real world. To do so, it is useful to break the acoustics of a real world environment into three components: the sound source, the acoustic environment, and the listener:

primer1.gif

Figure 1 – Typical soundfield with a source, environment and listener.

  • The sound source: this is an object in the world that emits sound waves. Examples are anything that makes sound – cars, humans, birds, closing doors, and so on. Sound waves get created through a variety of mechanical processes. Once created, the waves usually get radiated in a certain direction. For example, a mouth radiates more sound energy in the direction that the face is pointing than to side of the face.
  • The acoustic environment: once a sound wave has been emitted, it travels through an environment where several things can happen to it: it gets absorbed by the air (the high frequency waves more so than the low ones. The absorption amount depends on factors like wind and air humidity); it can directly travel to a listener (direct path), bounce off of an object once before it reaches the listener (first order reflected path), bounce twice (second order reflected path), and so on; each time a sound reflects off an object, the material that the object is made of has an effect on how much each frequency component of the sound wave gets absorbed, and how much gets reflected back into the environment; sounds can also pass through objects such as water, or walls; finally, environment geometry like corners, edges, and small openings have complex effects on the physics of sound waves (refraction, scattering).
  • The listener: this is a sound receiving object, typically a “pair of ears”. The listener uses acoustic cues to interpret the sound waves that arrive at the ears, and to extract information about the sound sources and the environment.

4. THE BASICS OF HUMAN HEARING

As explained above, people can be considered sound receiving objects in an environment. We have an auditory sensing system consisting of two ears and a brain. Additionally, very low frequency sounds can be sensed through the human body. The brain uses a number of cues that are embedded in the two sound signals it receives from the two ears to learn about the sounds and their environment. Most people are unaware that the effects described in the following sections greatly impact our continuous perception of reality, every day of our lives. On the other hand, there are certain people, for example non-sighted people, that are very much aware of these effects, because they heavily rely on their ears for querying and navigating their surroundings.

4.1 Primary Localization Cues – IID and ITD

The two primary localization cues are called interaural intensity difference (IID) and interaural time difference (ITD). IID refers to the fact that a sound is louder at the ear that it is closer to, because the sound’s intensity at that ear will be higher than the intensity at the other ear, which is not only further away, but usually receives a signal that has been shadowed by the listener’s head (see fig. 2). ITD means that a sound will arrive earlier at one ear than the other (unless it is located at exactly the same distance from each ear – for example directly in front). If it arrives at the left ear first, the brain knows that the sound is somewhere to the left (see fig. 3).

primer2.gif

Figure 2 – Illustration of IID.

primer3.gif

Figure 3 – Illustration of ITD.

The combination of these two cues allows the brain to narrow the position of an individual sound source to somewhere on a cone centered on the line drawn between the listeners ears (see fig.4 ).

primer4.gif

Figure 4 – ITD Cone.

4.2 The Outer Ear Structure – Pinna

Before a sound wave gets to the ear drum, it passes through the outer ear structure, called the pinna. The pinna accentuates or suppresses mid- and high-frequency energy (see fig. 5) of a sound wave to various degrees, depending on the angle at which the sound wave hits the pinna (see fig. 6). This means that the two pinnae act as variable filters that effect every sound that passes through them. The brain knows how to figure out the exact location of a sound in space by receiving a signal that has been filtered in a way that is unique to the sound source’s position relative to the listener.

primer5.gif

Figure 5 – Spectrum differences between original and pinna.

primer6.gif

Figure 6 – Pinnae frequency modulation sound source and pinna reception at varying elevations.

The pinnae are the key to accurately localizing sounds in space. However, since the outer ear and its folds are on the scale of a few centimeters, only sound waves with wavelengths in the centimeter range or smaller can be affected by the pinna. In addition, the two ears are about 15 centimeters apart, so even IID and ITD cues are greatly reduced for wave lengths bigger than that. For example, a 3.3 kHz sound signal oscillates 3300 times per second, while sound travels at about 330 meters per second. The wave length is therefore about 330/3300 = 0.1 meters, or 10 centimeters. This means that a sound at 3300 Hz lies in the area where primary cues are still noticeable, but pinna cues start to be diminished. In general, the higher the frequency of a sound, the shorter its wave length, and the better it can be localized. This phenomena can be verified by placing two speakers, a sub-woofer and a high-frequency tweeter, in a room and playing music through them. With closed eyes you will be able to immediately tell where the tweeter is located, the sub-woofer however will sound like it is “coming from everywhere”.

4.3 Propagation Effects, Range Cues, and Reflections

Many things happen to a sound as it travels through an environment before it is received by a listener. All of these effects allow us to learn more about what we are hearing and what kind of environment we are in:

  • A somewhat muffled, quiet sound is likely off in the distance (see fig. 7).
  • If it is heavily muffled, we might be in a enclosed space, listening through glass, or other wall materials.
  • The effect of sound reflections in an environment is very important, because we are able to hear the difference in time of arrival and location between the direct path signal, first-order, and n-th order reflections (see fig. 8). The reflections give us a way to further pin-point a sound source’s location, as well as the size, shape and type of room or environment that we are in (people with very “good ears” are able to exactly locate a wall, or tell the difference between a open or closed door, simply by listening to reflections). While humans are capable of individually perceiving first order reflections, second and higher order reflections usually combine to form what are called late field reflections, or reverb.

primer7.gif

Figure 7 – Source attenuation and absorption.

primer8.gif

Figure 8 – Direct path, first and second order due to range (listener-source distance) reflections in a typical room.

5. HOW A3D WORKS

A 3D audio system aims to digitally reproduce a realistic sound field. To achieve the desired effect a system needs to be able to re-create portions or all of the listening cues discussed in the previous chapter: IID, ITD, outer ear effects, and so on. A typical first step to building such a system is to capture the listening cues by analyzing what happens to a single sound as it arrives at a listener from different angles. Once captured, the cues are synthesized in a computer simulation for verification.

5.1 What is an HRTF?

The majority of 3D audio technologies are at some level based on the concept of HRTFs, or Head-Related Transfer Functions. An HRTF can be thought of as set of two audio filters (one for each ear) that contains in it all the listening cues that are applied to a sound as it travels from the sound’s origin (its source, or position in space), through the environment, and arrives at the listener’s ear drums. The filters change depending on the direction from which the sound arrives at the listener. The level of HRTF complexity necessary to create the illusion of 3D realistic hearing is subject to considerable discussion and varies greatly across technologies.

HRTF Analysis

The most common method of measuring the HRTF of an individual is to place tiny probe microphones inside a listener’s left and right ear canals, place a speaker at a known location relative to the listener, play a known signal through that speaker, and record the microphone signals. By comparing the resulting impulse response with the original signal, a single filter in the HRTF set has been found (see fig. 9). After moving the speaker to a new location, the process is repeated until an entire, spherical map of filter sets has been devised.

primer9.gif

Figure 9 – Combining speaker output and microphone input to compute impulse response.

Every individual has a unique set of HRTFs, also called an ear print. However, HRTFs are interchangeable, and the HRTF of a person that can localize well in the real world will let most people localize well in a simulated world. While generic, interchangeable HRTFs are suitable for general applications such as video conferencing or games, individualized HRTFs are useful for performance critical applications of binaural audio, such as jet fighter cockpit threat warning systems, or air traffic control systems.

HRTF synthesis

Once an HRTF has been devised, real-time DSP (digital signal processing) software and algorithms are designed. This software has to be able to pick out the critical (psycho-acoustically relevant) features of a filter and apply them in real-time to an incoming audio signal to spatialize it. The system works correctly if a listener cannot tell the difference between listening to a sound over the speaker setup from the analysis process above (the speaker is in a specific position), and the same sound played back by a computer and filtered by the HRTF impulse response corresponding to the original speaker location (see fig. 10).

primer10.gif

Figure 10 – Applying synthetic impulse response synthetically to create illusion of a virtual speaker.

Playback Considerations

HRTFs can be used with great effectiveness in all audio playback configurations: headphones, stereo speakers, or multi-speaker arrays. On headphones, HRTF output is sent directly to the users ears. On stereo or multi-speaker setups, an additional audio processing step called cross-talk cancellation is employed to ensure proper signal separation between left and right ears.

5.2 Aureal Wavetracing (A3D)

Once HRTFs have been captured and can be rendered, a sound can be made to appear from any 3D location. To compute and render the additional effects that the 3D environment can have on a sound, A3D employs proprietary Wavetracing algorithms. Among other features, the addition of Wavetracing technology distinguishes A3D 2.0 systems from A3D systems. Developed over many years in conjunction with clients such as NASA, Matsushita and Disney, Aureal’s Wavetracing technology parses the geometry description of a 3D space to trace sound waves in real-time as they are reflected and occluded by passive acoustic objects in the 3D environment. With Wavetracing, sounds cannot only be heard as emanating from a position in 3D space, but also as they reflect off of walls, leak through doors from the next room, get occluded as they disappear around a corner, or suddenly appear overhead as you step into the open from a room. Reflections are rendered as individually imaged early reflections and as reverb late field reflections. Acoustic space geometries and wall surface materials are specified via the A3D 2.0 API (Application Programming Interface). The result is the final step towards true audio rendering realism: the combination of 3D positioning, room and environment acoustics and proper signal presentation to the user’s ears.

5.3 The A3D API

The A3D API (Application Programming Interface) delivers A3D into the hands of the software content developer. It allows games, 3D Internet browsers, and other 3D software applications to harness the full power of A3D. The API allows the application developer to do the following:

  • position sound sources and listeners in 3D space
  • define the 3D environment and its acoustic properties such as wall materials
  • synchronize 3D graphics and A3D audio representations of objects (see section of Audio-Visual Synergy below)
  • synchronize user inputs with A3D rendering (see section on Head Movement below)

Audio-Visual Synergy

The eyes and ears often perceive an event at the same time. Seeing a door close, and hearing a shutting sound, are interpreted as one event if they happen synchronously. If we see a door shut without a sound, or we see a door shut in front of us, and hear a shutting sound to the left, we get alarmed and confused. In another scenario, we might hear a voice in front of us, and see a hallway with a corner; the combination of audio and visual cues allows us to figure out that a person might be standing around the corner. Together, synchronized 3D audio and 3D visual cues provide a very strong immersion experience. Both 3D audio and 3D graphics systems can be greatly enhanced by such synchronization.

Head Movement and Audio

Audio cues change dramatically when a listener tilts or rotates his or her head. For example, quickly turning the head 90 degrees to look to the side is the equivalent of a sound traveling from the listener’s side to the front in a split second. We often use head motion to track sounds or to search for them. The ears alert the brain about an event outside of the area that the eyes are currently focused on, and we automatically turn to redirect our attention. Additionally, we use head motion to resolve ambiguities: a faint, low sound could be either in front or back of us, so we quickly and sub-consciously turn our head a small fraction to the left, and we know if the sound is now off to the right, it is in the front, otherwise it is in the back. One of the reasons why interactive audio is more realistic than pre-recorded audio (soundtracks) is the fact that the listeners head motion can be properly simulated in an interactive system (using inputs from a joystick, mouse, or head-tracking system).

5.4 The Vortex A3D Silicon Engines

Aureal has developed a line of PCI-bus based digital audio chips called Vortex. These chips, among many other features, contain silicon implementations of A3D algorithms, including HRTF and Wavetracing rendering engines. Vortex is a no-compromise PCI audio chip architecture. It takes true advantage of the PCI bus by streaming dozens of audio sources to on-board audio processing engines: A3D, DirectSound, Wavetable synthesis, legacy audio, multi-channel mixers, sample rate converters, etc. Vortex delivers highest quality A3D capabilities for sound cards and PC motherboards at maximum price/performance points.

6. ADVANTAGES OF A3D AS ILLUSTRATED BY RESEARCH FINDINGS

Results from decades of psycho-acoustic research on binaural audio offer scientific explanations of why real-time binaural audio technologies such as A3D are highly effective in a range of applications.

6.1 Binaural Gain

Probably the single most important fact about binaural audio is that if an audio signal is played on top of white noise it will appear 6 to 8 dB louder if that signal is a binaural signal versus a non-binaural signal. This means that the exact same audio content is more audible and intelligible in the binaural case, because the brain can localize and therefore “single out” the binaural signal, while the non-binaural signal gets washed into the noise.

6.2 The “cocktail party effect”

At a cocktail party, a listener is capable of focusing on a conversation, while there are hundreds of other conversations going on all around. If that party was recorded and then played back using a regular mono or stereo procedure, all the conversations would be combined into one (mono) or two (stereo) locations. The result would in most cases be unintelligible. With a binaural recording, or recreation of that party, a listener would still be able to tune into and understand individual conversations, because they are still spatially separated, and “amplified by” binaural gain.

6.3 Faster reaction time

In an environment such as a jet cockpit, where a lot of critical information is displayed to a user, reaction time is crucial. Research documents that audio information can be processed and reacted to more quickly if presented in binaural form, because such a signal mirrors the ones received in the real world. In addition, binaural signals can convey positional information: a binaural radar warning sound can warn a user about a specific object that is approaching (with a sound that is unique to that object), and naturally indicate where that object is coming from.

6.4 Less listening fatigue

Phone operators that listen to a mono headphone signal all day long, experience listening fatigue. If those same signals are presented as binaural signals, listening fatigue can be reduced substantially. Humans are used to hearing sounds that originate outside of their heads, as is the case with binaural signals. Mono or stereo signals appear to come from inside a listener’s head when using headphones, and produce more strain than a natural sounding, binaural signal.

6.5 Increased perception and immersion

Some of the most interesting research into binaural audio shows that a subject will consistently report a more immersive, and higher quality (“nicer” colors, or “better” graphics) environment when visuals are shown in synch with binaural sound, versus stereo sound, or no sound at all.

7. SUMMARY

For well over ten years, real-time binaural, or “3D”, audio technology has been the subject of intense research and development in the psycho-acoustic research community. The findings of a large number of research studies indicate that interactive 3D audio is an important technology that enables an entirely new level of audio experience: a three-dimensional sound field is created in real-time to continuously envelop a listener. The listener is no longer aware of the audio system that is rendering the sounds – the application communicates directly with the user, creating levels of awareness, realism, immersion and increases in reaction time and communication of audio information previously only possible in real-life situations.

Besides understanding real world sounds and the hearing process, the biggest challenges associated with building an effective positional 3D audio solution are:

1. The measurement and operation of exact HRTF filters.
2. The development of efficient, high quality algorithms that allow for real-time rendering of a 3D soundfield using minimal computational resources.
3. The deployment and support of a technology enabling API into the application development communities to ensure proper software content support.
4. The development of feature and cost competitive silicon engines to enable products based on the technology.
5. The definition and launch of a successful consumer brand and consumer products that will get the technology into the hands of the end-users.

A3D has mastered all of the above challenges. A3D is based on the world’s most advanced algorithms and HRTF measurement and compression techniques, that have been developed in high-performance, mission-critical application areas such as NASA simulators, jet fighter cockpits, and Virtual Reality systems. Aureal has created free software tools, SDKs, and APIs and evangelized them successfully to over 100 top tier PC software development houses. Aureal’s breakthrough Vortex PCI audio chips render A3D on dozens of new sound cards and PCs. Finally, Aureal has created the A3D brand that is actively promoted with a simple message: if you take a software application with the A3D logo on it, and a sound card or PC with the same logo on it, they will combine to deliver the most amazing, immersive and realistic interactive audio experience.

c. 1998, Aureal Corporation.
From Aureal Corporation Site. (Republished with permission.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.