Depth Perception in Headphones.

(The value of headphones in relation to loudspeakers)

by Ron Soh


In a Stax Omega 1 versus Stax Omega 2 headphone discussion that I was involved in several weeks ago, responses from a number of headphone hobbyists revolved around the issue of the value of expensive high-end headphones. I made an appeal for the value of headphones to be judged not just against other headphones, but also against speakers that cost far more. In this essay, I want to share with you what I mean by headphones having a better value than speakers, and also to delve into the subject of how headphones, like loudspeakers, portray soundstage when playing conventional two-channel recordings.

The psychoacoustics of sound localisation need not be brought into this discussion. This essay is NOT about a magic trick of shifting the soundstage from inside the head to a plane between the speakers! Instead, it outlines a simple experiment to illustrate these main two points: (1) headphones sound better than speakers costing more, and (2) headphones do portray a soundstage. If depth clues are portrayed in speakers, then why not in headphones? After all, it is the same signal we are feeding to both of them. In the last section is a tutorial for training one’s ears to identify depth cues in headphones. The tutorial lists specific recordings and passages which have captured these cues clearly and are good training examples.

EXPERIMENT: play some music on your speakers, and make sure you sit (or stand) at the ‘sweet spot’. The ‘sweet spot’ is the position where you are equidistant from the left and right speaker, and not too far and not too near the speakers. Make sure that you have switched off any ‘loudness’ buttons and set all bass/treble control knobs to neutral (if your amplifier has these knobs). Listen to the music via the speakers for a few minutes. Then listen to the same piece of music via your headphones, but sit (or stand) at the same ‘sweet spot’, facing the speakers. Of course, turn off the speakers. Make sure you position yourself at the ‘sweet spot’, face and look at the speakers. Listen to your headphone, and be surprised.

(This experiment assumes that your speakers do not cost more than five times your headphone, as a guideline. For instance, don’t compare a $100 headphone with $2000 speakers! Also this experiment assumes that you are not using cheap $15 headphones. And finally, don’t expect the headstage of your headphone to suddenly shift towards the speakers just because you are looking at the speakers.)

The differences you will notice can be quite entertaining/ educational. Generally, what you will find is that:

  1. the headphone sounds clearer, and it is easier to distinguish between various instruments, as compared to the speaker.
  2. the loudspeaker will likely have a ‘veil’ covering the sound of voices/instruments, and this ‘veil’ is located somewhere around the upper bass/ lower midrange region. This upper bass / lower midrange ‘veil’ is typical of loudspeaker ‘box colorations’, which is the effect of resonances of the speaker cabinet. Your headphone is not perfect either, but it is more likely that your headphone has less ‘box colorations’ than your speakers.
  3. the purpose of facing the speakers while you listen to the headphone is to give you visual clues while you listen to the headphones, so that you can appreciate that headphones DO convey ‘depth clues’. (I will delve into this intricate matter in detail later.)

The purpose of the little experiment above is to demonstrate two key points, which are:

  1. Headphones can be better transducers than speakers
  2. Of course a headphone conveys depth clues

Headphones Can Be Better Transducers Than Speakers

Speakers face more difficulties in painting a sonic picture, because speakers need to move a lot more air compared to headphones. When a speaker has to move more air, its cone (or dome or diaphragm or whatever) has to move back and forth through greater distances, and these greater driver excursions create peculiar problems such as non-linear excursions, cone break-ups, ringing (which is the problem caused when the cone moves forward and then instead of moving back immediately its momentum carries it forward a little bit more), and other problems I might know only if I were a speaker designer.

A headphone has a far easier life. Most headphones do not have woofers, midranges and tweeters — they usually full-range transducers. Therefore a headphone needs no electronic crossover circuits to split up the frequency spectrum into low-frequency, mid-frequency and high-frequency signals that will be subsequently fed to woofers, midranges and tweeters. Crossover circuits present a longer signal path that tend to degrade signal quality. Also, sometimes the crossover is not handled properly and you have problems at the crossover frequency region where the two drivers try to overlap but not successfully.

A headphone also does not have big cabinets that tend to resonate. A speaker, in having to move more air, has to generate a lot of pistonic movement, and that results in huge backlash forces being transferred to the cabinet. A headphone does not need to generate huge pistonic movements, so less backlash energy is transferred to its chasis.

A headphone does not have to contend with room reflections. Of course, headphones have a cavity environment to deal with (a cavity environment is the space enclosed between the headset and your ears). In fact, most of the time, a headphone’s frequency response is not just due to the frequency response of its cone/diaphragm, but also due to this cavity environment. Headphone makers know how to ‘tune’ this cavity envoronment so as to compensate and counter-compensate for the frequency (im)balances of the cone/diaphragm. This cavity environment is far easier to predict, compared to the room environment where a pair of loudspeakers are found. Different rooms create different reflection characteristics, not to mention problems such as standing waves, and cause unpredictable colorations to the sound of speakers.

A headphone’s diaphragm is smaller and far lighter than a speaker’s. This single lone factor gives headphones a better start towards the goal of superior accuracy in the translation of electrical impulses to mechanical movement. Due to the lower inertia, a headphone’s diaphragm starts and stops more quickly than a speaker’s drivers can—therefore a well designed headphone can exhibit more transient attack speed.

A Headphone Can Convey Depth Clues

I hope that little experiment demonstrated for you that headphones give spatial clues, the same way loudspeakers do. In switching to and from your headphone and your speakers, you should be able to hear these depth clues. (If you cannot hear these depth clues, read on – especially the section and tutorial on how to perceive depth clues, then re-conduct the experiment.)

It is such a common misconception that headphones do not have a soundstage. Just because you wear the soundstage on your head does not mean that your headphone has no soundstage. To appreciate the differences between a headphone’s headstage and speakers’ soundstage, we first have to establish how speakers construct their soundstages.

When you position yourself in the ‘sweet spot’ in front of a pair of loudspeakers, a triangle is formed between you and the two speakers. The two speakers form a ‘picture plane’, which is the vertical plane that contains both speakers. This ‘picture plane’ faces you front-on, and depending on whether the speaker’s sonic character is forward or laid-back, images are formed in front of, or within, or behind this ‘picture plane’. Depending on the type of recording, some of the images appear to be positioned further behind this picture plane than other images, and this creates a sense of layering or depth of space. This combination of lateral (left-right) spread and front-to-back depth create what we call a 3-dimensional ‘soundstage’.

People say that although headphones portray lateral left-right spread, headphones do not have a soundstage because headphones do not portray depth — that missing third dimension. Which is the point you can disprove for yourself by conducting that little experiment. Listen to your speakers and notice which images appear to be positioned further back than others. Then listen to your headphones (and while still looking at your speakers). Do you hear the depth clues?

You may hear four ‘layers’ of depth:

  • 1st layer: very near (apparently only inches from the recording microphone)
  • 2nd layer: near (apparently 1 or 2 meters away from the recording microphone)
  • 3rd layer: not near, not far (apparently 5 to 10 meters away from the mike)
  • 4th layer: far (apparently 30 meters or more away from the mike).

Do not be too fixated on the actual numerical value of these distances – do not focus on an image and rack your brains out trying to figure out whether it is 2 meters or 5 meters from you. You are enjoying music, not taking a stressful high-school examination. Moreover, you are not a bat. It is difficult to actually assign a numerical value to the distance of an instrument from the microphone just by hearing it. The numerical figures I mention above are simply to help you picture what I am trying to say—don’t take the numerical digits too seriously.

Close-miked, heavily-mixed recordings tend to have almost all their images in the 1st layer, with the odd image sometimes suddenly appearing in the 4th layer — for example, the sound of a synthesizer that has been given an excessive reverberation treatment. Classical symphonic recordings portray images mainly in the 2nd, 3rd and 4th layers (the occasional solo instrument in the 2nd layer, massed strings in the 3rd, the occasional timpani/cymbals in the 4th, for example). However, please note that even if all the images appear to be in the 3rd layer only, that‘s depth perception already. Most recordings with a lot of depth clues contain images that reside mainly within a single layer, with occasional images making their guest-star appearances in a different layer.

(The reason why I don’t listen to pop CDs exclusively is because most pop recordings contain images that are exclusively positioned in the 1st layer. After listening to one pop CD I like to change to a different recording type which portray images in the 2nd or 3rd layer. It gets rather tiring to keep listening to CD after CD of exclusively 1st layer images.)

Now we come to the most important point I am making in this essay. How does a headphone give you depth clues? Imagine that a school band is performing a musical number and marching along a road steadily AWAY from you. If you were to close your eyes you could hear them receding slowly away from you. How is that so? When playing conventional stereo recordings, you perceive depth clues in three ways: (1) loudness, (2) texture clarity and (3) reverberation. Take these 3 clues one by one:

  1. LOUDNESS: When an instrument is near to you it tends to sound louder. Conversely, the softer an instrument sounds, the further away you infer it to be.
  2. TEXTURE CLARITY: But what happens when an instrument is very near to you, but is played softly? Would you perceive it as being very far away? No, you would not, because the second way you perceive distance is through texture. When your ears hear that the texture of an instrument is very well defined, you infer that it is near to you. The further an instrument is from you, the less specific its texture appears to be. When a guiter string is plucked near you, you can hear its steely nature, its lower and upper harmonics, even though the guitar is plucked softly. But when the guitar string is plucked far from you, you may no longer appreciate its steely nature nor its lower and upper harmonics—you may only hear the principal harmonic. When the texture of an instrument sounds less rich you infer it to be farther away from you. When a drum-sound image has a specific texture of a stretched dry skin being hit, along with a loud sound of a steely rattle, you perceive it to be near; however when its texture is more ‘washed-out’, i.e. less specific, you perceive it to be far away. When a saxophone-sound image has a raspy texture you infer that it is near; when it doesn’t have this strong raspy texture you infer that it is far away.
  3. REVERBERATION: Reverberation is sum total effect of sound reflecting off the wall, floor and ceiling surfaces of the recorded environment. The further an instrument is positioned from the microphone, the more reverberation is captured by the microphone. Consequently, the more an image is surrounded by a reverberative halo, the further away you perceive this image to be. Reverberation causes an image to sound more laid-back, enveloped in a softer, cushier halo. It is always pleasant to listen to recordings with reverberation when listening to headphones, because such images are robbed of any jarring directness, seemingly buffered from you by air cushion. (This is not to say that close-up images are always irritating—not at all. Close-up images in the 1st layer can be very enjoyable via headphones when they have a sense of ‘liquid-ness’ about them. The three states of matter—air, liquid and solid—are very apt metaphors to describe and appreciate the quality of sound.)

The above three ways of perceiving depth clues are the reasons why I say headphones portray soundstages. The reason for me asking you to look at your speakers whilst listening to your headphone is simply to give your some visual references to latch on to while you mentally assign the images to their respective layers (1st, 2nd, 3rd or 4th layer) through the perception of the above three depth clues (loudness, texture clarity and reverberation).

And the reason why I asked you to listen to your speakers FIRST before listening to your headphones is also to convince you that the way a speaker gives depth clues is exactly similar to the way a headphone gives depth clues. By listening to your speakers first you hear the three depth clues (loudness, texture and reverberation), and then when you switch to your headphones next you would be able hear that these very same depth clues are also present via your headphones. Speakers are not televisions, and if it seems that a speaker positions its images in 3-dimensional space it is only because your ears hear these depth clues, not because your eyes are seeing actual objects. If you can hear depth clues via speakers, then why can’t you hear depth clues via headphones? The purpose of looking at your speakers while listening to your headphones is to ‘compare apples to apples’, i.e., to create a level playing field, where the visual asistance normally given by speakers in image positioning is also extended to your headphones.

Having said that headphones have soundstages, it is quite appropriate for me to qualify here that the soundstage thrown by speakers is ‘easier to visualise’ than that portrayed by headphones. When listening to speakers there is something quite literal about the way an image that is perceived BY THE EARS to have more depth actually seems TO THE EYE to be positioned further behind the ‘picture plane’. This visual assistance in image positioning is why people say speakers create a soundstage whilst headphones don’t. But the truth is headphones do give ample depth clues, if not more so. (Note: listening to speakers with the lights off and in a completely dark room still lends this visual assistance to image positioning, because you have a visual memory of where the speakers are placed.)

I hope that if you initially did not perceive your headphone’s ‘depth clues’ when you first conducted the experiment, you will now re-conduct the experiment. You may need to try it out over a few CDs, before your ability to discern these ‘depth clues’ catches on 1.

The psychoacoustics of sound localisation are EXTRINSIC factors that are not the focus of my essay. These extrinsic factors of sound localisation are caused by the position of the transducers (be it speaker or headphone) in relation to our ears. If the transducers are placed in front of us, we perceive the images to be located in front of us. If the transducers are placed directly on our ears, we perceive the images to be located inside or around our heads.

The focus of my essay is the perception of depth clues that are already INTRINSICALLY present in the recordings that we play. These 3 depth clues (comparative loudness, comparative textural specificity, and comparative reverberation) are already part of the signal that we feed to our speakers and headphone. If these depth clues are portrayed by one transducer, then why not the other? After all, it is the same signal we are feeding to both of them.

If you have one of those audiophile test CDs you can demonstrate for yourself the truth of what I say. Listen to the track where the demonstrater hits a percussive instrument, while he slowly walks further and further from the pick-up microphone. The image that you hear over your headphone will CONSTANTLY RESIDE INSIDE YOUR HEAD, but you can tell that the demonstrator is walking progressively away from the microphone. How so? Via the perception of the 3 depth clues I outlined, i.e., the percussive instrument progressively gets (i) softer in volume, (ii) less specific in texture, and (iii) more and more diffused by a reverberative halo.

The psychoacoustics of sound localisation need not be mixed-up with the perception of depth clues via headphones. How far or how near you perceive an image to be is related to extrinsic factors of where the transducers are located in relation to your ears. Whether that image is located in front of you (in the case of speakers) or inside your head (in the case of headphones) does not change the historical fact that instrument A was placed 5 meters from the pick-up mike and instrument B was placed 30 meters from the pick-up mike. These ‘historical facts’ are INTRINSIC to the recordings we play, and are perceivable via headphones.

The purpose of asking the reader to look at his speakers while listening to his headphone was to extend a visual assistance in perceiving these 3 depth clues. I am already accustomed to perceiving depth clues present in recordings, and therefore do not need visual assistance in perceiving that certain instruments are placed further from the recording microphone than others. But for a reader who is new to this perception, this visual assistance is a helpful, but temporary, crutch.

Ear-Training For Depth Perception in Headphones

An audiophile test CD can give a simple demonstration of depth perception in headphones. Listen to the track where the demonstrater hits a percussive instrument, while slowly walking further and further from the pick-up microphone. The image that you hear over your headphone will CONSTANTLY RESIDE INSIDE YOUR HEAD, but you can tell that the demonstrator is walking progressively away from the microphone. How so? Via the perception of the 3 depth clues I outlined, i.e., the percussive instrument progressively gets (i) softer in volume, (ii) less specific in texture, and (iii) more and more diffused by a reverberative halo.

Thus, there are three mechanics of perceiving distances of voices/instruments from the pick-up mikes: (i)comparative loudness, (ii)comparative textural specificity, and (iii)comparative reverberation. These three mechanics are perceivable over any CD, and below is just a sampling:

(1) The Three Tenors In Concert (Teldec 4509-96200-2)
Track 5 Granada- Placido Domingo’s voice is Layer 2. He is not a pop singer who needs to stand so close to the mike!!! So it is not a Layer 1 voice. The tambourine is Layer 3. When only a few people clap, it appears that these people are at Layer 4, but the moment all of them start clapping, it appears they are closer to the mike; they appear to be at Layer 3. This must be the comparative loudness mechanism at work: the louder a sound is, the closer it appears to be.

(2) Planet Drum by Mickey Hart (Rykodisc RCD 80206)
All the images in this CD seem to be layer 2 images. The instruments appear all to be close-miked. However, because my headphone has a laid-back presentation style, I suspect this causes the images to become layer 2 images. A more forward-sounding headphone might portray these them as layer 1 images. I do not know; I don’t have a forward-sounding headphone at hand to verify.

(3) The Emissary by Chico Freeman (Clarity Recordings CCD-1015)
Track 2 Mandela- The saxophone is Layer 1 image (even on my laid-back phones), but the drum kit is Layer 2 in the sense that it is definitely a little further away from the mike than the lead saxophone is. The rest of the accompanying instruments like tambourine, electric guitar and background voices are likewise Layer 2. This style of layering is obviously to highlight the lead saxophone, who is Chico Freeman, the rightful star of this CD.

It is also interesting to note that Clarity Recordings employs a minimally-miked approach here, but the musicians, especially the lead sax, are positioned so close to these mikes that the recording seems a little reverberatively dry, at least where minimally-miked recordings go. I would characterise this CD as a forward-sounding minimally-miked recording. I am used to the idea that minimally-miked recordings are not forward-sounding.

(4) Toolbox by Toolbox (Vaccum Tube Logic Of America VTL 008)
This entire jazz CD is unbelievably reverberatively lush. The lead flute is Layer 2, the piano is Layer 3. The drum kit, oh the drum kit, how do I describe this one? The drum kit itself is Layer 3, but the faint echo/reverberation of that drum kit is Layer 4!!! Heavenly! At the end of each cymbal hit or drum hit the hall is suddenly ‘lit up’ for a very brief moment, and the that Layer 4 designation of the drum kit’s echo is also a sonic description of the acoustic within which this recording was made. Recordings by VTL (which employ a complete line-up of Manley equipment) are so incredible for headphone listening because of the way the layering of apparent distances are achieved by means of reverberation.

The other two mechanisms, i.e., comparative loudness and comparative textural specificity, are not so important in VTL recordings. VTL recordings are must-haves for headphone-freakos! Unfortunately, they do not release new recordings anymore. The alternative to hear the effect of reveberation on the perception of distances is to listen to binaural CDs, which also capture a lot of hall reverb, as a secondary by-product of the recording method.

(5) Music For Strings, Percussion & Celesta by Bartok (Decca 430 352-2)
This classical recording definitely employs a lot of accent mikes placed close to the musicians. Most of the images here are Layer 2 images, with that wierd piano-like percussive instrument being placed somewhere between Layer 1 and Layer 2, in the sense that it is more forward than the rest of the orchestra, but not so forward like the way Madonna would like to eat a mike. Strangely, even the timpani is a Layer 2 image. Timpanis are usually placed way way back at the rear of the orchestra, but this particular timpani does not sound that far away. Must be those accent mikes making the timpani sound nearer than it actually is.

(6) Fireworks by Stravinsky (Delos D/CD 3504)
Delos definitely does not use many accent mikes. The sense of layering in Delos CDs is definitely top-notch. The marvellous bloom of the lush violin section is Layer 3. The brass instruments appear closer, at Layer 2. And that gong-like/drum-like sound (cannot think of the name of that instrument) is way, way at the back of the orchestra, a Layer 4 image. I think the impulse reverberation of the gong-like/drum-like sound contributes to ts sense of being a Layer 4 image. There’s a sense of immense majesty when a timpani/gong/drum becomes a Layer 4 image – like the sound of a distant thunder, growling with authority from afar. No Layer 1 images here.

(7) Stereophile Test CD3 (STPH 006-2) Track 10.


Index 1: 2nos Omni-mikes – John Atkinson talks and hits a cowbell (a percussive thing) in a church interior, and walks from far-stage-left towards the mikes placed in the centre and then away from the mikes towards far-stage-right. His movement nearer and further from the mikes are obvious over headphones. Then he stands at the very back of the church, and walks along the centre aisle towards the mike. This last movement pattern (from back to front of church) is really obvious: his voice/cowbell is decreasingly diffused by church reveberation as he walks towards the front of the church where the mikes are. This is a smooth transition from Layer 4, through Layers 3 and 2, then finally Layer 1. There is no better proof than this track that the distance of a voice/instrument from the pick-up mike is perceivable via headphones!

Index 2: 3nos Omni-mikes – same pattern of movements, except with a different microphone array: a center mike was added. The sense of depth or movement to and from the mikes is likewise the same as Index 1, but due to the center mike, images far-left do not seem as far-left and the images far-right do not seem as far-right, compared to Index 1.

Index 3: ORTF cardiod mikes – same pattern of movements, except with a different microphone type. The sense of depth or movement to and from the mikes is likewise the same as Index 1 and 2, but due to the ORTFs picking up less hall reverb, the image of Mr. Atkinson’s voice/cowbell is less diffused by reverberation. Strangely, it is also very easy to tell when he is standing at the back of the church. This is due to (i)his voice being softer in volume, and (ii)the textural specificity of his voice being reduced, i.e., vocal pronounciations of vowels/consonants are less clear, and the cowbell is less sharp in transient attack when he stands at the back of the church. This experiment clearly shows that reverberation is not the sole mechanics of distance-perception.

Index 4: ORTF cardiod mikes with post-processing. Same as Index 3, but with Blumlein processing to add low-frequency bloom. Same observations as Index 3. I really cannot appreciate the so-called increased-LF-bloom. I do not hear it. But the LF bloom is not relevant to the issue at hand, which is distance perception.

Index 5: Schoeps sphere microphone (binaural) – same pattern of movements, but illusion of sound localisation is partially realised, unlike with the other 4 indexes above, due to the usage of a Schoeps microphone. Image size is smaller and more precisely located in relation to the headphone-wearer. The mechanics of perception of distance here is different from the above 4 indexes, because here the psychoacoustics of sound localisation is called into play. However, because my head is different from the plastic head used, the binaural illusion is only partially realised for me. Index 5 is not a good example to demonstrate the 3 mechanisms listed above, because a fourth mechanism, i.e., the mechanism of sound localisation, is involved here. This mechanism of sound localisation is the basis of binaural recordings.

Note: I am aware that recordings, even minimally-miked ones, are usually not just 2-mike affairs. In conjunction with the main microphones, there are accent microphones which pick up the clarity of the instruments’ musical lines, and there are also hall mikes that are placed further from the musicians to pick up hall reverberation. And the gain applied to each of these mikes is different, depending on the recording engineer’s sonic intentions.

But my contention is this: for every microphone array and mixing configuration, there is such a thing as the centre-of-gravity of that array. This centre-of-gravity is the location where our ears appear to be located, when we listen to a recording. This is an unavoidable fact, I guess due to the egocentric nature of perception: we will always perceive the world, the sonic world even, in relation to ourselves. One will always locate oneself as the centre of the perceived world. When one listens to a particular recording with a particular microphone array, there will always be that one spot where one thinks the musicians and the room/hall are located in relation to oneself. That spot is the centre-of-gravity of the microphone array.

Notes: 1: All these notes about depth clues should not mislead anyone into thinking that listening to a headphone is a very demanding effort. It sure is a cerebral effort for me to try to explain the construction of a headphone’s soundstage to you, but it should not be a cerebral affair for you to discern depth clues via headphones. Like I mentioned earlier, this isn’t a high-school examination — it is music you are enjoying. Remember: an appreciation of depth clues is only one-third of the appreciation of a headphone. There are three ways to enjoy your headphone: its sense of air (this is the category where depth clues fall into), its sense of liquid-ness, and its sense of solidity.

c. 2000, Ron Soh.

The Elements of Musical Perception.

(HeadWize Technical Series Paper)


Although an understanding of acoustics and psychoacoustics is not mandatory for enjoying headphones, it is useful knowledge especially when evaluating headphone sound quality or headphone acoustic simulators. Headphones do not sound the same as loudspeakers, and judging headphones based on a loudspeaker reference will more often result in disappointment. Also, the headphone accessories market now offers several spatial processor products, which operate on different acoustic principles. It is likely that listeners will have different preferences when deciding to purchase these types of signal processors. An understanding of acoustics and psychoacoustics can make the selection process less frustrating by grounding the listener’s expectations in scientific fact.

This article is meant as an introduction – just a few simple mathematical formulas – and includes real-world examples of the operation of acoustic principles that are relevant to audiophiles. The section on spatial hearing is deliberately compact to avoid duplicating the discussions in other articles on the HeadWize site. For more information about 3D hearing, see A 3D Audio Primer and The Psychoacoustics of Headphone Listening. For more information about headphone technologies and headphone accessories, see A Quick Guide To Headphones and A Quick Guide To Headphone Accessories. For more information about evaluating headphones, see Judging Headphones For Accuracy.


Simple Harmonic Motion


Sound waves are longitudinal waves (as opposed to transverse waves such as light waves) in that they oscillate in the same direction as their propagation. They move through the air as a series of compressions and expansions (also called rarefactions) of air molecules. The air molecules only vibrate up and down but do not move with the wave. A pulse traveling along a stretched out Slinky toy is an example of a longitudinal wave. Longitudinal waves can be represented using the familiar notation for transverse waves. Both longitudinal or transverse waves follow basic wave principles.


A simple wave is characterized by amplitude (its displacement from the center) and frequency (measured in Hertz, the number of oscillations per second). A simple sine wave completes one full oscillation in period T (secs) = 1/frequency (Hz) and travels 360 degrees. A simple sine wave is also called a pure tone.

Huygen’s Principle


Diffraction effects play an important role in acoustics, and are best understood when sound waves are viewed as Huygen wavelets. A freestanding sound source transmits sound in all directions so that the wavefront is actually a sphere. The physicist Christian Huygen saw all waves as being made up of an infinite number of tiny circular (in 2D) or spherical (in 3D) waves. Then, a complete wave pattern was merely the sum of these wavelets.



A consequence of Huygen’s Principle is that waves can bend around edges. If a wave hits a wall with an opening, a Huygen wavelet comes out the other side. Sound waves bend more than light waves (so one can hear but not see around corners), and low frequencies bend more than high frequencies (so diffracted sounds have a “muffled” quality). For example, loudspeaker drivers are mounted in a closed box (or one with a specially tuned vent) to prevent the out-of-phase rear waves from diffracting to the front and canceling out the main output of the speaker. Tweeters may have an “anti-diffraction” ring so that the dispersion of high frequencies is not affected by edges and seams on the speaker box.



When a sound wave hits a surface at an angle (the incidence angle), it bounces off that surface at the same angle (the reflection angle). Reverberation is a result of waves reflecting off walls and objects in an acoustic space and is one of the spatial cues used by the human brain for 3D hearing. When the surface is uneven, reflection analysis deconstructs it into a series of smaller flat surfaces, which are later summed. The diagram above is a simplification of reflection analysis using Huygen wavelets.

Inverse Square Law and Absorption

Sound waves radiate outward in a spherical shape. The further a listener is from the sound source, the weaker the loudness of the sound. The area of a sphere is calculated as A = 4pr2, where r is the radius of the sphere. The intensity of a sound wave decreases in proportion to the inverse square of the distance from the source.

The intensity of sound is also affected by the absorption characteristics of the air and of reflective materials. The degree of absorption is called the absorption coefficient. Some materials have similar absorption across all audio frequencies, but others are better at absorbing a particular band of frequencies. For example, frequencies below 1 kHz will travel much farther in air than those above 1 kHz. Thus, when the sound source is at a distance, a listener will hear a muffled quality to the sound because of the inverse square law and the absorption of high frequencies. Since headphones are worn so close to the eardrums, headphone sound will have more high frequency components than sound from loudspeakers.

Absorption and the inverse square law also affect the reverberation time of an acoustic space. Reverberation time measures how long it takes sound to decay by a factor of one million and is an important characteristic of concert halls. The best acoustic spaces for listening to music have a smooth rate of decay (as opposed to a rough decay rate where the sound keeps changing volume). The better concert halls have reverberation times of around 2 seconds.

Doppler Effect


If a sound source is in motion, a stationary listener will hear a change in pitch as the source approaches and leaves. (Light exhibits a Doppler effect that astronomers use to calculate the speed of stars traveling through space.) If the sound source is moving towards the listener, the sound waves tend to bunch together and the perceived pitch is higher than the actual sound. If the source is approaching at the speed of sound, the listener hears a sonic boom because all of the sound waves arrive at the same time. If the source is moving away from the listener, the wavelengths are stretched out, so the perceived pitch is lower than the actual sound.



When two or more waves travel in the same direction or cross each others’ paths, they remain distinct. Thus, the instruments in an orchestra or band or the voices in concurrent conversations are distinguishable even though they are playing or speaking at the same time. However, at a molecular level, waves that move in the same medium add together. They displace a total amount of air that is equal to the sum of their individual displacements. Music is heard as complex waves, so any analysis of musical perception begins with the Superposition Principle.



Although waves exist independently of one another, in the special case of similar waves combining, the result can be either constructive or destructive interference, depending on whether the waves are in phase or out of phase. Phase is a comparison of how closely two waves are in sync and is measured in degrees. Constructive interference will enhance sound (for example, make it louder). Destructive interference will weaken sound. If two identical waves are 180 degrees out of phase, they will cancel out. Whether the interference is constructive or destructive, the superposition principle requires that the individual waves continue to exist separately. The interference itself is merely the effect of the waves together at one point in space.



When two waves of slightly different frequencies combine, they produce a wobbling sound called beats. Beats have two characteristics: the beat frequency (how often the sound changes volume) and the tone frequency, which is the tone that the listener hears. The beat frequency fb = f2 – f1, where f2 > f1. The tone frequency ft = (f1 + f2)/2.

Standing Waves


When two waves collide that are identical in frequency and amplitude and traveling in opposite directions, they can create a standing wave. Unlike traveling waves, standing waves appear to vibrate in place. That is, the wave peaks alternate from positive to negative in place but do not move forwards or backwards, and each peak terminates with a point of zero displacement on both sides. The peaks are called antinodes and the points of zero displacement are called nodes.


One form of standing wave is resonance. Normally, if an object is excited to vibration, the vibration will fade away due to dampening. However, all objects have a preferred vibration frequency called the resonance frequency, at which vibrations are reinforced as standing waves within the object. If not excited continuously, an object vibrating at resonance will eventually calm down, but over a longer period of time than it would take at any other frequency. Resonance is a component of the sound of musical instruments, but is the bane of listening environments, which should not emphasize any one frequency or set of frequencies over others. Loudspeakers and headphone are dampened to reduce or eliminate the effects of system resonances on sound reproduction.


Harmonics and Overtones

A harmonic or overtone series consists of a fundamental frequency and sucessive frequencies that are integer multiples of the fundamental. If f is a fundamental, then its harmonic series would be f, 2f, 3f, 4f, 5f…. More than a mathematical curiosity, harmonics are central to musical perception. Jean Baptiste Fourier discovered that ANY waveform can be represented by summing a series of sine waves of different amplitudes and phases. For example, a square wave can be constructed from the sum of a fundamental frequency and its odd harmonic series.


The sounds of musical instruments and voices are filled with harmonic content (see section on Timbre). When audio amplifiers overload or clip, they generate harmonics. Thus, even if clipping takes place at a low frequency, an amplifier can output enough high frequency harmonics to damage tweeters. When bipolar transistor amplifiers overload, they produce more odd-order harmonics. When tube and MOSFET amplifiers overload, they produce more even-order harmonics. In the great Tube vs. Transistor debate, practitioners of the art of glass audio often cite this difference as one of the main reasons for the superiority of tube sound (however, not by a long shot is everyone convinced that tubes sound better).

Complex Waves (Timbre)

A complex wave is the sum of two or more harmonics. The human ear hears the timbre of a sound from a musical instrument based on the fundamental note (pitch) and the amplitudes and phase characteristics of the harmonics present in the sound. In addition, the tone quality of an instrument is affected by attack and decay transients. Attack transients occur when an instrument starts playing a note (for example, the striking of a piano note). Decay transients are the sounds of a note fading away. If these transients are removed (say, edited out in a recording), the sound of the remaining steady note loses its distinctiveness.

Loudness Perception


The Fletcher-Munson curves (above) measure loudness perception in human hearing at various sound pressure levels (deciBels or dB). With 1kHz as the reference point, hearing tends to be “flat” in the middle frequencies, but requires higher SPLs at the low and high frequencies to sound as loud as the reference. Thus, each curve marks SPLs of equal perceived loudness over frequency. At low listening levels, bass perception suffers dramatically, and the perception of the timbres of vocals and musical instruments changes. Quality tone controls or equalizers can help restore a satisfying tonal balance to music when listening at safe volume levels. For more information about hearing conservation, see Preventing Hearing Damage When Listening With Headphones.

The Missing Fundamental and Fundamental Tracking

If two or more notes played together are successive harmonics in a harmonic series, then the human ear will hear a third note: the fundamental frequency of the series. This effect is called the Missing Fundamental. If pairs of notes played in sequence have a frequency ratio of 3 to 2 and have different fundamental frequencies, the human ear will construct a fundamental frequency for each note. This phenomenon is called Fundamental Tracking. The headphones on most portable stereos exploit both of these principles to simulate an extended low frequency response. For more information on evaluating headphone sound quality, see Judging Headphones For Accuracy.


Binaural Beats

If two soft notes that are close in frequency are played separately in each ear (with no physical mixing either by aural bleed or bone conduction), the listener will hear Binaural Beats, which result when the brain mixes the sounds. Binaural Beats are unlike regular beats that derive from the mixing of sounds in the air and are one illustration of why headphones sound different from loudspeakers.

Spatial Cues for 3D Hearing (ILDs, ITDs and HRTFs)


There are three types of spatial hearing cues: interaural time differences (ITDs), interaural level differences (ILDs), and head-related transfer functions (HRTFs). ITDs refer to the difference in time for a sound to reach both ears. ILDs describe the amplitude differences in the frequency spectrum of sound as heard in both ears. HRTFs are a collection of spatial cues for a particular listener, including ITDs, ILDs and also taking into account the effect of the listener’s head, outer ears and torso the perceived sound.

Low frequency spatial cues are different from those for high frequencies. A listener’s head, body and ears acoustically contour sounds depending on the location of the source. With high frequencies, differences in the amplitude spectrums between the ears (ILDs) aid in placing the source. However, low frequencies tend to diffract around the head. Instead, the human brain factors the delay time or phase difference (ITDs) between ears to determine the location of low frequency sound souces. For example, if both ears hear a low frequency sound simultaneously, then the source is either directly in front or in back of the listener. If there is a delay, then the source will appear closer to the ear that hears it first. Time delays are also significant in high frequency localization.

ILDs and ITDs alone generally are not adequate for the human brain to resolve 3D sound. Head-related transfer functions (HRTFs) include ITDs and ILDs, but study them at a more personalized level. ITD and ILD measurements have generally assumed a spherical, disembodied head model. HRTFs factor in the effects of a listener’s outer ears (pinna), head and torso on perceived sound. In addition to the frequency-contouring of HRTFs, head movement can help the brain to localize sound. HRTFs are different for every listener. Headphones do not image realistically because they isolate the sound reproduction to the ears without the benefit of HRTFs to create spatial cues.

The Precedence Effect

The Precedence Effect localizes sound based on the first wave that reaches the ear, regardless of the loudness of any later arriving waves. Therefore, if several speakers are playing the same music, it will appear to come from the speaker closest to the listener, even if more distant speakers sound louder. If the sound source is a pure tone in a reverberant room, a listener who enters the room and does not hear the start of the sound will find it very difficult to localize (it seems to be coming from all directions).

For more information on 3D hearing, see 3D Audio Primer and The Psycoacoustics of Headphone Listening.


Benade, Arthur H., Fundamentals of Musical Acoustics (1990).
Berg, Richard and Stork, David, The Physics of Sound (1982).
Campbell, Murray, The Musicians Guide To Acoustics (1987).
Hall, Donald, Musical Acoustics (1991).
Hartmann, William M., “How We Localize Sound,” Physics Today, November 1999.
MacPherson, Ewan, “A Computer Model of Binaural Localization for Stereo Imaging Measurement,” JAES, September 1991.
Roederer, Juan, Introduction to the Physics and Psychophysics of Music (1975).
Sokol, Mike, The Great Amplifier Debate: Tube vs. Transistor, Free Spirit (1993).

c. 1998, 2000 Chu Moy.