Depth Perception in Headphones.

(The value of headphones in relation to loudspeakers)

by Ron Soh


In a Stax Omega 1 versus Stax Omega 2 headphone discussion that I was involved in several weeks ago, responses from a number of headphone hobbyists revolved around the issue of the value of expensive high-end headphones. I made an appeal for the value of headphones to be judged not just against other headphones, but also against speakers that cost far more. In this essay, I want to share with you what I mean by headphones having a better value than speakers, and also to delve into the subject of how headphones, like loudspeakers, portray soundstage when playing conventional two-channel recordings.

The psychoacoustics of sound localisation need not be brought into this discussion. This essay is NOT about a magic trick of shifting the soundstage from inside the head to a plane between the speakers! Instead, it outlines a simple experiment to illustrate these main two points: (1) headphones sound better than speakers costing more, and (2) headphones do portray a soundstage. If depth clues are portrayed in speakers, then why not in headphones? After all, it is the same signal we are feeding to both of them. In the last section is a tutorial for training one’s ears to identify depth cues in headphones. The tutorial lists specific recordings and passages which have captured these cues clearly and are good training examples.

EXPERIMENT: play some music on your speakers, and make sure you sit (or stand) at the ‘sweet spot’. The ‘sweet spot’ is the position where you are equidistant from the left and right speaker, and not too far and not too near the speakers. Make sure that you have switched off any ‘loudness’ buttons and set all bass/treble control knobs to neutral (if your amplifier has these knobs). Listen to the music via the speakers for a few minutes. Then listen to the same piece of music via your headphones, but sit (or stand) at the same ‘sweet spot’, facing the speakers. Of course, turn off the speakers. Make sure you position yourself at the ‘sweet spot’, face and look at the speakers. Listen to your headphone, and be surprised.

(This experiment assumes that your speakers do not cost more than five times your headphone, as a guideline. For instance, don’t compare a $100 headphone with $2000 speakers! Also this experiment assumes that you are not using cheap $15 headphones. And finally, don’t expect the headstage of your headphone to suddenly shift towards the speakers just because you are looking at the speakers.)

The differences you will notice can be quite entertaining/ educational. Generally, what you will find is that:

  1. the headphone sounds clearer, and it is easier to distinguish between various instruments, as compared to the speaker.
  2. the loudspeaker will likely have a ‘veil’ covering the sound of voices/instruments, and this ‘veil’ is located somewhere around the upper bass/ lower midrange region. This upper bass / lower midrange ‘veil’ is typical of loudspeaker ‘box colorations’, which is the effect of resonances of the speaker cabinet. Your headphone is not perfect either, but it is more likely that your headphone has less ‘box colorations’ than your speakers.
  3. the purpose of facing the speakers while you listen to the headphone is to give you visual clues while you listen to the headphones, so that you can appreciate that headphones DO convey ‘depth clues’. (I will delve into this intricate matter in detail later.)

The purpose of the little experiment above is to demonstrate two key points, which are:

  1. Headphones can be better transducers than speakers
  2. Of course a headphone conveys depth clues

Headphones Can Be Better Transducers Than Speakers

Speakers face more difficulties in painting a sonic picture, because speakers need to move a lot more air compared to headphones. When a speaker has to move more air, its cone (or dome or diaphragm or whatever) has to move back and forth through greater distances, and these greater driver excursions create peculiar problems such as non-linear excursions, cone break-ups, ringing (which is the problem caused when the cone moves forward and then instead of moving back immediately its momentum carries it forward a little bit more), and other problems I might know only if I were a speaker designer.

A headphone has a far easier life. Most headphones do not have woofers, midranges and tweeters — they usually full-range transducers. Therefore a headphone needs no electronic crossover circuits to split up the frequency spectrum into low-frequency, mid-frequency and high-frequency signals that will be subsequently fed to woofers, midranges and tweeters. Crossover circuits present a longer signal path that tend to degrade signal quality. Also, sometimes the crossover is not handled properly and you have problems at the crossover frequency region where the two drivers try to overlap but not successfully.

A headphone also does not have big cabinets that tend to resonate. A speaker, in having to move more air, has to generate a lot of pistonic movement, and that results in huge backlash forces being transferred to the cabinet. A headphone does not need to generate huge pistonic movements, so less backlash energy is transferred to its chasis.

A headphone does not have to contend with room reflections. Of course, headphones have a cavity environment to deal with (a cavity environment is the space enclosed between the headset and your ears). In fact, most of the time, a headphone’s frequency response is not just due to the frequency response of its cone/diaphragm, but also due to this cavity environment. Headphone makers know how to ‘tune’ this cavity envoronment so as to compensate and counter-compensate for the frequency (im)balances of the cone/diaphragm. This cavity environment is far easier to predict, compared to the room environment where a pair of loudspeakers are found. Different rooms create different reflection characteristics, not to mention problems such as standing waves, and cause unpredictable colorations to the sound of speakers.

A headphone’s diaphragm is smaller and far lighter than a speaker’s. This single lone factor gives headphones a better start towards the goal of superior accuracy in the translation of electrical impulses to mechanical movement. Due to the lower inertia, a headphone’s diaphragm starts and stops more quickly than a speaker’s drivers can—therefore a well designed headphone can exhibit more transient attack speed.

A Headphone Can Convey Depth Clues

I hope that little experiment demonstrated for you that headphones give spatial clues, the same way loudspeakers do. In switching to and from your headphone and your speakers, you should be able to hear these depth clues. (If you cannot hear these depth clues, read on – especially the section and tutorial on how to perceive depth clues, then re-conduct the experiment.)

It is such a common misconception that headphones do not have a soundstage. Just because you wear the soundstage on your head does not mean that your headphone has no soundstage. To appreciate the differences between a headphone’s headstage and speakers’ soundstage, we first have to establish how speakers construct their soundstages.

When you position yourself in the ‘sweet spot’ in front of a pair of loudspeakers, a triangle is formed between you and the two speakers. The two speakers form a ‘picture plane’, which is the vertical plane that contains both speakers. This ‘picture plane’ faces you front-on, and depending on whether the speaker’s sonic character is forward or laid-back, images are formed in front of, or within, or behind this ‘picture plane’. Depending on the type of recording, some of the images appear to be positioned further behind this picture plane than other images, and this creates a sense of layering or depth of space. This combination of lateral (left-right) spread and front-to-back depth create what we call a 3-dimensional ‘soundstage’.

People say that although headphones portray lateral left-right spread, headphones do not have a soundstage because headphones do not portray depth — that missing third dimension. Which is the point you can disprove for yourself by conducting that little experiment. Listen to your speakers and notice which images appear to be positioned further back than others. Then listen to your headphones (and while still looking at your speakers). Do you hear the depth clues?

You may hear four ‘layers’ of depth:

  • 1st layer: very near (apparently only inches from the recording microphone)
  • 2nd layer: near (apparently 1 or 2 meters away from the recording microphone)
  • 3rd layer: not near, not far (apparently 5 to 10 meters away from the mike)
  • 4th layer: far (apparently 30 meters or more away from the mike).

Do not be too fixated on the actual numerical value of these distances – do not focus on an image and rack your brains out trying to figure out whether it is 2 meters or 5 meters from you. You are enjoying music, not taking a stressful high-school examination. Moreover, you are not a bat. It is difficult to actually assign a numerical value to the distance of an instrument from the microphone just by hearing it. The numerical figures I mention above are simply to help you picture what I am trying to say—don’t take the numerical digits too seriously.

Close-miked, heavily-mixed recordings tend to have almost all their images in the 1st layer, with the odd image sometimes suddenly appearing in the 4th layer — for example, the sound of a synthesizer that has been given an excessive reverberation treatment. Classical symphonic recordings portray images mainly in the 2nd, 3rd and 4th layers (the occasional solo instrument in the 2nd layer, massed strings in the 3rd, the occasional timpani/cymbals in the 4th, for example). However, please note that even if all the images appear to be in the 3rd layer only, that‘s depth perception already. Most recordings with a lot of depth clues contain images that reside mainly within a single layer, with occasional images making their guest-star appearances in a different layer.

(The reason why I don’t listen to pop CDs exclusively is because most pop recordings contain images that are exclusively positioned in the 1st layer. After listening to one pop CD I like to change to a different recording type which portray images in the 2nd or 3rd layer. It gets rather tiring to keep listening to CD after CD of exclusively 1st layer images.)

Now we come to the most important point I am making in this essay. How does a headphone give you depth clues? Imagine that a school band is performing a musical number and marching along a road steadily AWAY from you. If you were to close your eyes you could hear them receding slowly away from you. How is that so? When playing conventional stereo recordings, you perceive depth clues in three ways: (1) loudness, (2) texture clarity and (3) reverberation. Take these 3 clues one by one:

  1. LOUDNESS: When an instrument is near to you it tends to sound louder. Conversely, the softer an instrument sounds, the further away you infer it to be.
  2. TEXTURE CLARITY: But what happens when an instrument is very near to you, but is played softly? Would you perceive it as being very far away? No, you would not, because the second way you perceive distance is through texture. When your ears hear that the texture of an instrument is very well defined, you infer that it is near to you. The further an instrument is from you, the less specific its texture appears to be. When a guiter string is plucked near you, you can hear its steely nature, its lower and upper harmonics, even though the guitar is plucked softly. But when the guitar string is plucked far from you, you may no longer appreciate its steely nature nor its lower and upper harmonics—you may only hear the principal harmonic. When the texture of an instrument sounds less rich you infer it to be farther away from you. When a drum-sound image has a specific texture of a stretched dry skin being hit, along with a loud sound of a steely rattle, you perceive it to be near; however when its texture is more ‘washed-out’, i.e. less specific, you perceive it to be far away. When a saxophone-sound image has a raspy texture you infer that it is near; when it doesn’t have this strong raspy texture you infer that it is far away.
  3. REVERBERATION: Reverberation is sum total effect of sound reflecting off the wall, floor and ceiling surfaces of the recorded environment. The further an instrument is positioned from the microphone, the more reverberation is captured by the microphone. Consequently, the more an image is surrounded by a reverberative halo, the further away you perceive this image to be. Reverberation causes an image to sound more laid-back, enveloped in a softer, cushier halo. It is always pleasant to listen to recordings with reverberation when listening to headphones, because such images are robbed of any jarring directness, seemingly buffered from you by air cushion. (This is not to say that close-up images are always irritating—not at all. Close-up images in the 1st layer can be very enjoyable via headphones when they have a sense of ‘liquid-ness’ about them. The three states of matter—air, liquid and solid—are very apt metaphors to describe and appreciate the quality of sound.)

The above three ways of perceiving depth clues are the reasons why I say headphones portray soundstages. The reason for me asking you to look at your speakers whilst listening to your headphone is simply to give your some visual references to latch on to while you mentally assign the images to their respective layers (1st, 2nd, 3rd or 4th layer) through the perception of the above three depth clues (loudness, texture clarity and reverberation).

And the reason why I asked you to listen to your speakers FIRST before listening to your headphones is also to convince you that the way a speaker gives depth clues is exactly similar to the way a headphone gives depth clues. By listening to your speakers first you hear the three depth clues (loudness, texture and reverberation), and then when you switch to your headphones next you would be able hear that these very same depth clues are also present via your headphones. Speakers are not televisions, and if it seems that a speaker positions its images in 3-dimensional space it is only because your ears hear these depth clues, not because your eyes are seeing actual objects. If you can hear depth clues via speakers, then why can’t you hear depth clues via headphones? The purpose of looking at your speakers while listening to your headphones is to ‘compare apples to apples’, i.e., to create a level playing field, where the visual asistance normally given by speakers in image positioning is also extended to your headphones.

Having said that headphones have soundstages, it is quite appropriate for me to qualify here that the soundstage thrown by speakers is ‘easier to visualise’ than that portrayed by headphones. When listening to speakers there is something quite literal about the way an image that is perceived BY THE EARS to have more depth actually seems TO THE EYE to be positioned further behind the ‘picture plane’. This visual assistance in image positioning is why people say speakers create a soundstage whilst headphones don’t. But the truth is headphones do give ample depth clues, if not more so. (Note: listening to speakers with the lights off and in a completely dark room still lends this visual assistance to image positioning, because you have a visual memory of where the speakers are placed.)

I hope that if you initially did not perceive your headphone’s ‘depth clues’ when you first conducted the experiment, you will now re-conduct the experiment. You may need to try it out over a few CDs, before your ability to discern these ‘depth clues’ catches on 1.

The psychoacoustics of sound localisation are EXTRINSIC factors that are not the focus of my essay. These extrinsic factors of sound localisation are caused by the position of the transducers (be it speaker or headphone) in relation to our ears. If the transducers are placed in front of us, we perceive the images to be located in front of us. If the transducers are placed directly on our ears, we perceive the images to be located inside or around our heads.

The focus of my essay is the perception of depth clues that are already INTRINSICALLY present in the recordings that we play. These 3 depth clues (comparative loudness, comparative textural specificity, and comparative reverberation) are already part of the signal that we feed to our speakers and headphone. If these depth clues are portrayed by one transducer, then why not the other? After all, it is the same signal we are feeding to both of them.

If you have one of those audiophile test CDs you can demonstrate for yourself the truth of what I say. Listen to the track where the demonstrater hits a percussive instrument, while he slowly walks further and further from the pick-up microphone. The image that you hear over your headphone will CONSTANTLY RESIDE INSIDE YOUR HEAD, but you can tell that the demonstrator is walking progressively away from the microphone. How so? Via the perception of the 3 depth clues I outlined, i.e., the percussive instrument progressively gets (i) softer in volume, (ii) less specific in texture, and (iii) more and more diffused by a reverberative halo.

The psychoacoustics of sound localisation need not be mixed-up with the perception of depth clues via headphones. How far or how near you perceive an image to be is related to extrinsic factors of where the transducers are located in relation to your ears. Whether that image is located in front of you (in the case of speakers) or inside your head (in the case of headphones) does not change the historical fact that instrument A was placed 5 meters from the pick-up mike and instrument B was placed 30 meters from the pick-up mike. These ‘historical facts’ are INTRINSIC to the recordings we play, and are perceivable via headphones.

The purpose of asking the reader to look at his speakers while listening to his headphone was to extend a visual assistance in perceiving these 3 depth clues. I am already accustomed to perceiving depth clues present in recordings, and therefore do not need visual assistance in perceiving that certain instruments are placed further from the recording microphone than others. But for a reader who is new to this perception, this visual assistance is a helpful, but temporary, crutch.

Ear-Training For Depth Perception in Headphones

An audiophile test CD can give a simple demonstration of depth perception in headphones. Listen to the track where the demonstrater hits a percussive instrument, while slowly walking further and further from the pick-up microphone. The image that you hear over your headphone will CONSTANTLY RESIDE INSIDE YOUR HEAD, but you can tell that the demonstrator is walking progressively away from the microphone. How so? Via the perception of the 3 depth clues I outlined, i.e., the percussive instrument progressively gets (i) softer in volume, (ii) less specific in texture, and (iii) more and more diffused by a reverberative halo.

Thus, there are three mechanics of perceiving distances of voices/instruments from the pick-up mikes: (i)comparative loudness, (ii)comparative textural specificity, and (iii)comparative reverberation. These three mechanics are perceivable over any CD, and below is just a sampling:

(1) The Three Tenors In Concert (Teldec 4509-96200-2)
Track 5 Granada- Placido Domingo’s voice is Layer 2. He is not a pop singer who needs to stand so close to the mike!!! So it is not a Layer 1 voice. The tambourine is Layer 3. When only a few people clap, it appears that these people are at Layer 4, but the moment all of them start clapping, it appears they are closer to the mike; they appear to be at Layer 3. This must be the comparative loudness mechanism at work: the louder a sound is, the closer it appears to be.

(2) Planet Drum by Mickey Hart (Rykodisc RCD 80206)
All the images in this CD seem to be layer 2 images. The instruments appear all to be close-miked. However, because my headphone has a laid-back presentation style, I suspect this causes the images to become layer 2 images. A more forward-sounding headphone might portray these them as layer 1 images. I do not know; I don’t have a forward-sounding headphone at hand to verify.

(3) The Emissary by Chico Freeman (Clarity Recordings CCD-1015)
Track 2 Mandela- The saxophone is Layer 1 image (even on my laid-back phones), but the drum kit is Layer 2 in the sense that it is definitely a little further away from the mike than the lead saxophone is. The rest of the accompanying instruments like tambourine, electric guitar and background voices are likewise Layer 2. This style of layering is obviously to highlight the lead saxophone, who is Chico Freeman, the rightful star of this CD.

It is also interesting to note that Clarity Recordings employs a minimally-miked approach here, but the musicians, especially the lead sax, are positioned so close to these mikes that the recording seems a little reverberatively dry, at least where minimally-miked recordings go. I would characterise this CD as a forward-sounding minimally-miked recording. I am used to the idea that minimally-miked recordings are not forward-sounding.

(4) Toolbox by Toolbox (Vaccum Tube Logic Of America VTL 008)
This entire jazz CD is unbelievably reverberatively lush. The lead flute is Layer 2, the piano is Layer 3. The drum kit, oh the drum kit, how do I describe this one? The drum kit itself is Layer 3, but the faint echo/reverberation of that drum kit is Layer 4!!! Heavenly! At the end of each cymbal hit or drum hit the hall is suddenly ‘lit up’ for a very brief moment, and the that Layer 4 designation of the drum kit’s echo is also a sonic description of the acoustic within which this recording was made. Recordings by VTL (which employ a complete line-up of Manley equipment) are so incredible for headphone listening because of the way the layering of apparent distances are achieved by means of reverberation.

The other two mechanisms, i.e., comparative loudness and comparative textural specificity, are not so important in VTL recordings. VTL recordings are must-haves for headphone-freakos! Unfortunately, they do not release new recordings anymore. The alternative to hear the effect of reveberation on the perception of distances is to listen to binaural CDs, which also capture a lot of hall reverb, as a secondary by-product of the recording method.

(5) Music For Strings, Percussion & Celesta by Bartok (Decca 430 352-2)
This classical recording definitely employs a lot of accent mikes placed close to the musicians. Most of the images here are Layer 2 images, with that wierd piano-like percussive instrument being placed somewhere between Layer 1 and Layer 2, in the sense that it is more forward than the rest of the orchestra, but not so forward like the way Madonna would like to eat a mike. Strangely, even the timpani is a Layer 2 image. Timpanis are usually placed way way back at the rear of the orchestra, but this particular timpani does not sound that far away. Must be those accent mikes making the timpani sound nearer than it actually is.

(6) Fireworks by Stravinsky (Delos D/CD 3504)
Delos definitely does not use many accent mikes. The sense of layering in Delos CDs is definitely top-notch. The marvellous bloom of the lush violin section is Layer 3. The brass instruments appear closer, at Layer 2. And that gong-like/drum-like sound (cannot think of the name of that instrument) is way, way at the back of the orchestra, a Layer 4 image. I think the impulse reverberation of the gong-like/drum-like sound contributes to ts sense of being a Layer 4 image. There’s a sense of immense majesty when a timpani/gong/drum becomes a Layer 4 image – like the sound of a distant thunder, growling with authority from afar. No Layer 1 images here.

(7) Stereophile Test CD3 (STPH 006-2) Track 10.


Index 1: 2nos Omni-mikes – John Atkinson talks and hits a cowbell (a percussive thing) in a church interior, and walks from far-stage-left towards the mikes placed in the centre and then away from the mikes towards far-stage-right. His movement nearer and further from the mikes are obvious over headphones. Then he stands at the very back of the church, and walks along the centre aisle towards the mike. This last movement pattern (from back to front of church) is really obvious: his voice/cowbell is decreasingly diffused by church reveberation as he walks towards the front of the church where the mikes are. This is a smooth transition from Layer 4, through Layers 3 and 2, then finally Layer 1. There is no better proof than this track that the distance of a voice/instrument from the pick-up mike is perceivable via headphones!

Index 2: 3nos Omni-mikes – same pattern of movements, except with a different microphone array: a center mike was added. The sense of depth or movement to and from the mikes is likewise the same as Index 1, but due to the center mike, images far-left do not seem as far-left and the images far-right do not seem as far-right, compared to Index 1.

Index 3: ORTF cardiod mikes – same pattern of movements, except with a different microphone type. The sense of depth or movement to and from the mikes is likewise the same as Index 1 and 2, but due to the ORTFs picking up less hall reverb, the image of Mr. Atkinson’s voice/cowbell is less diffused by reverberation. Strangely, it is also very easy to tell when he is standing at the back of the church. This is due to (i)his voice being softer in volume, and (ii)the textural specificity of his voice being reduced, i.e., vocal pronounciations of vowels/consonants are less clear, and the cowbell is less sharp in transient attack when he stands at the back of the church. This experiment clearly shows that reverberation is not the sole mechanics of distance-perception.

Index 4: ORTF cardiod mikes with post-processing. Same as Index 3, but with Blumlein processing to add low-frequency bloom. Same observations as Index 3. I really cannot appreciate the so-called increased-LF-bloom. I do not hear it. But the LF bloom is not relevant to the issue at hand, which is distance perception.

Index 5: Schoeps sphere microphone (binaural) – same pattern of movements, but illusion of sound localisation is partially realised, unlike with the other 4 indexes above, due to the usage of a Schoeps microphone. Image size is smaller and more precisely located in relation to the headphone-wearer. The mechanics of perception of distance here is different from the above 4 indexes, because here the psychoacoustics of sound localisation is called into play. However, because my head is different from the plastic head used, the binaural illusion is only partially realised for me. Index 5 is not a good example to demonstrate the 3 mechanisms listed above, because a fourth mechanism, i.e., the mechanism of sound localisation, is involved here. This mechanism of sound localisation is the basis of binaural recordings.

Note: I am aware that recordings, even minimally-miked ones, are usually not just 2-mike affairs. In conjunction with the main microphones, there are accent microphones which pick up the clarity of the instruments’ musical lines, and there are also hall mikes that are placed further from the musicians to pick up hall reverberation. And the gain applied to each of these mikes is different, depending on the recording engineer’s sonic intentions.

But my contention is this: for every microphone array and mixing configuration, there is such a thing as the centre-of-gravity of that array. This centre-of-gravity is the location where our ears appear to be located, when we listen to a recording. This is an unavoidable fact, I guess due to the egocentric nature of perception: we will always perceive the world, the sonic world even, in relation to ourselves. One will always locate oneself as the centre of the perceived world. When one listens to a particular recording with a particular microphone array, there will always be that one spot where one thinks the musicians and the room/hall are located in relation to oneself. That spot is the centre-of-gravity of the microphone array.

Notes: 1: All these notes about depth clues should not mislead anyone into thinking that listening to a headphone is a very demanding effort. It sure is a cerebral effort for me to try to explain the construction of a headphone’s soundstage to you, but it should not be a cerebral affair for you to discern depth clues via headphones. Like I mentioned earlier, this isn’t a high-school examination — it is music you are enjoying. Remember: an appreciation of depth clues is only one-third of the appreciation of a headphone. There are three ways to enjoy your headphone: its sense of air (this is the category where depth clues fall into), its sense of liquid-ness, and its sense of solidity.

c. 2000, Ron Soh.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.