Speech Production

Humans are capable of producing a great variety of distinct speech sounds. The approximately 5,000 languages that humans speak use over 850 different sounds. The capability of humans to produce such a variety of vocal sounds, far more than any other animal, is a result of the unique structure of the human vocal tract, which includes the lungs, trachea, larynx, nasal cavity, and structures in the mouth.

Human Vocal Tract

The structures that produce human speech can interact in a variety of ways to produce a large range of sounds. This video shows some of the interactions of the parts of the human vocal tract during the production of speech.

Describing Sounds

The sounds that are combined into speech can be described by the manner in which they are produced, that is, by how the various structures in the vocal tract perform to produce the sound.

The initial process in speech production is the forcing of air out from the lungs. In a few human languages, some sounds are produced by inhaling air into the lungs, and some sounds – clicks, for example – are produced without involving the lungs. In American English, however, all speech sounds involve exhaling from the lungs.

When we are not speaking, we breathe about 15 times per minute. Breathing occurs in two phases, inhalation (drawing the air in) and exhalation (sending the air out), that are approximately equal in duration. While speaking, however, we inhale quickly, in less than a second, then extend the exhalation to several seconds.

At the start of a speech sound, the muscles of the diaphragm push air out of the lungs, through the trachea, and up to the larynx. At the larynx, the air must pass between two small flaps of muscle, which are called the vocal chords.

The air passing between these flaps can cause them to vibrate. The frequency at which they vibrate affects the pitch of the vocal sound that will be produced, and that frequency is, in turn, affected by the size and tension of the vocal chords. Smaller vocal chords vibrate more quickly than larger ones, and thus, produce higher pitched sounds. The size depends on the age and sex of the person. Young humans have smaller vocal chords than adults, and females tend to have smaller vocal chords than males, which is why children and women tend to have higher-pitched voices than men. The tension of the vocal chords is controlled by the muscles that compose them. When the muscles are contracted, they become stiffer, and they vibrate more rapidly and produce higher-pitched sounds. When the muscles are relaxed, they become more flexible, and they vibrate more slowly and produce lower-pitched sounds. By controlling the tension of the vocal chords, humans can vary the pitch of their voices.

The vibrations of the vocal chords are, in a sense, the raw material of speech. The truly remarkable variety in human speech is produced by structures above the larynx and vocal chords. These structures include the mouth and nasal cavity, which are together called the vocal tract. By moving various muscles in the vocal tract, we restrict and direct the exhaled air, adding tremendous variety to the vibrations that started with the vocal chords. These movements in the vocal tract are called articulation. Speech sounds are most often described in terms of articulation, that is, in terms of how the structures in the vocal tract are configured to produce the sounds.

Human Vocal Sounds

The sounds of speech can be classified into two major categories, vowels and consonants. A vowel is a sound that is produced with vocal cords vibrating and an open vocal tract, so that air pressure does not build up. A consonant is a sound that is produced with or without the vocal cords vibrating, but with a complete or partial obstruction of the vocal tract, so that air pressure does build up. Generally, speech is a sequence of sounds called syllables. Each syllable is composed of a vowel sound that may be preceded or followed or both by a consonant sound.

Vowels

Humans produce a variety of sounds that are classified as vowels. These vowels are distinguished by how the sounds are produced. The position of the tongue in the mouth and the shape of the lips are significant. Some vowel sounds are produced with the tongue held high near the roof of the mouth (such as the e sound in feet), others with the tongue held low in the mouth (such as the a in father).

TRY THIS:
Alternate saying ee (as in feet) and a (as in father). Notice how your tongue moves up and down, up for ee and down for a.

Some vowel sounds are formed with the tongue held back in the mouth (such as the au in caught), others with the tongue held forward (such as the ai in bait).

TRY THIS:
Alternate saying ai (as in bait) and au (as in caught). Notice how your tongue moves forward and back, forward for ai and back for au. You can also feel vibration toward the back of your mouth when you say au and closer to the front when you say ai.

The position of the lips also affects the sound of a vowel. Some vowels are sounded with the lips held in a rounded form (the oo in boot), others with the lips unrounded (the au in baught).

TRY THIS:
Alternate saying oo (as in boot) and au (as in caught). Notice how your lips pucker to form a round opening then relax to an unrounded shape, round for oo and unrounded for au.

Vowel sounds, then, can be described by how the structures of the vocal tract form them. There are various combinations of these structures, and American English uses several of them. For example, there is a high, back, rounded vowel sound, in which the tongue is high and toward the front in the mouth, and the lips are rounded: the u in true. There is a low, back, rounded vowel sound, in which the tongue is low and toward the back in the mouth, and the lips are rounded: the o in both. There is a high, back, unrounded vowel sound, with the tongue high and toward the back of the mouth, and the lips are unrounded: the u in put. There is a high, front, unrounded vowel sound, with the tongue high and toward the front of the mouth, and with the lips unrounded: the e in feet. There are others used in American English. There are some combinations of lip and tongue positions that do not correspond to a vowel in English, but do so in other languages. Even finer distinctions in tongue and lip positions have been used for describing vowel sounds with greater precision.

Consonants

Consonant sounds are, like vowel sounds, distinguished by how they are produced. Consonant formation involves stopping or impeding the flow of air from the lungs. This stopping or impeding can occur at various places in the vocal tract and involve various parts of its anatomy: the lips, the tongue, the teeth, the glottis, and the palate. The vocal cords can also be involved during the production of a consonant. The assortment of consonant sounds is quite broad, because so many different combinations of vocal tract actions can be involved. Spoken American English involves only a subset of the many possible consonants, and these are described in what follows.

Consonants that involve the vocal cords during production are called voiced consonants. Examples include the m in mat and r in run. Those produced without using the vocal cords are called voiceless consonants. Examples include the s in sat and the p in pun.

TRY THIS:
Begin to say the word run. Notice that your vocal cords are vibrating as you begin to say the word. Put a finger on your larynx as you say run, and you will feel your vocal cords vibrating as you produce the r sound. The r is a voiced consonant. Now begin to say the word sat. Notice that your vocal cords are not vibrating as you begin to say sat. Put a finger on your larynx as you say sat, and you will not feel your vocal cords vibrate until you get to the a sound. The s is an unvoiced consonant.

Consonant sounds are also categorized by the location in the vocal tract where the air stream is stopped or impeded. This can happen in any of these locations: at the lips, at the teeth, at the ridge just behind the upper teeth (called the alveolar ridge), at the roof of the mouth (the hard palate), and at the soft palate at the back of the mouth (also called the velum).

In American English, the g sound in go and the k sound in kin are produced when the back of the tongue is pressed against the velum. They are called velar sounds. These two differ in voicing, the g sound is voiced and the k sound is voiceless.

TRY THIS:
Begin to say the word go. Feel how the back of your tongue rises to touch the back of your palate. Also, notice that your vocal cords are vibrating as you begin to say the word. The g is a voiced velar consonant. Now begin to say the word kin. Feel again how the back of your tongue rises to touch the palate. Also, notice that your vocal cords are not vibrating as you begin to say the word, but they begin to vibrate as you form the i vowel sound. The k is an unvoiced velar consonant.

The alveolar ridge is involved in forming the sounds of d in dog, and t in tin. These are alveolar consonants in American English. The d and t differ in voicing, the d sound is voiced and the k sound is voiceless.

TRY THIS:
Begin to say the word dog. Feel how the front of your tongue rises to touch the ridge just behind your upper teeth. Also, notice that your vocal cords are vibrating as you begin to say the word. The d is a voiced alveolar consonant. Now begin to say the word tin. Feel again the front of your tongue rises to touch the ridge just behind your upper teeth. Also, notice that your vocal cords are not vibrating as you begin to say the word, but they begin to vibrate as you form the i vowel sound. The t is an unvoiced alveolar consonant.

The two lips are involved in forming the sounds of b in bit and the p in pit. These are bilabial (two-lipped) consonants. The b and p differ in voicing, the b sound is voiced and the p sound is voiceless.

TRY THIS:
Begin to say the word bit. Notice that as you begin the word, your two lips are pressed together. Also, notice that your vocal cords are vibrating as you begin to say the word. The b is a voiced bilabial consonant. Now begin to say the word pit. Feel again that as you begin the word, your two lips are pressed together. Also, notice that your vocal cords are not vibrating as you begin to say the word, but they begin to vibrate as you form the i vowel sound. The p is an unvoiced bilabial consonant.

The lower lip and the upper teeth are involved in producing the sounds of v in vat and f in fat. These consonants are called labiodental (lip-teeth) consonants. The v is a voiced consonant, while the f is unvoiced.

TRY THIS:
Begin to say the word vat. Notice that as you begin the word, your upper teeth are touching your lower lip. Also, notice that your vocal cords are vibrating as you begin to say the word. The v is a voiced labiodental consonant. Now begin to say the word fat. Feel again that as you begin the word, your upper teeth are touching your lower lip. Also, notice that your vocal cords are not vibrating as you begin to say the word, but they begin to vibrate as you form the a vowel sound. The f is an unvoiced alveolar consonant.

The roof of the mouth is used along with the tongue to produce the consonants represented in American English by the sh in ship and the ch in chip. Neither of these sounds is voiced, both are produced without using the vocal cords. They differ in how they obstruct the air flow from the lungs. In producing the sh sound the air from the lungs is forced to pass through a narrow opening formed by holding the tongue against the roof of the mouth. The turbulence caused by the air through a narrow opening produces the sound. A consonant formed in this way is called a fricative. In producing the ch sound, the air flow from the lungs is briefly blocked completely, then suddenly released. A consonant produced by a complete blocking of the air flow is called a stop. The sh sound is a voiceless palatal fricative, and the ch sound is a voiceless palatal stop.

TRY THIS:
Begin to say the word chip. Notice that as you begin the word, you press your tongue against the roof of your mouth. The ch sound is a palatal consonant. If you hold your mouth in position to begin the word chip and try to exhale, the air will pass through your nose, not your mouth. The air passage in your mouth is completely blocked. The ch sound is a stop.

Now begin to say the word ship. Again, notice how you lift your tongue up to the roof of your mouth. The sh sound is a palatal consonant. If you hold your mouth in position to begin the word ship and try to exhale, air will pass through your mouth, although the flow is somewhat restricted compared to when you breathe normally through your mouth. The sh sound is a fricative.


We have already encountered consonants that are stops. Both the b in bit and the p in pit are stops. Both involve completely stopping the flow of air through the mouth, allowing the pressure to build somewhat, and then suddenly releasing it. In both cases, the stoppage occurs when the lips are pressed together. The difference between them is whether the vocal cords are used in producing the sound. The vocal cords are involved in producing b, so it is a voiced bilabial stop. The vocal cords are not used in producing p, so it is a voiceless bilabial stop.

Stops can be produced in other locations in the mouth as well. A stop can be created with the tip of the tongue pressed against the alveolar ridge, which is at the top front of the mouth, just behind the teeth. The t in tin and the d in din both involve pressing the tongue against the alveolar ridge. The difference between the t and d sounds is the time at which the vocal cords vibrate when the sound is produced.

TRY THIS:
Say the words din and tin and observe that in din the vocal cords vibrate before the tongue is removed from the alveolar ridge, while in tin, the vocal cords vibrate after it is removed.
The initial consonants in the English words din and tin are alveolar stops, the d is voiced and the t is voiceless.

A stop can also be produced by pressing the tongue against the soft palate at the back of the roof of the mouth, which is also called the velum. The sounds of the initial consonants in the English words gin and kin are velar stops, the g is voiced and the k is voiceless.

TRY THIS:
Say the words gin and kin and observe that in gin the vocal cords vibrate before the tongue is separated from the velum, while in kin, the vocal cords vibrate after they are separated.

Many consonants are produced not by a complete stopping of the air flow from the lungs, but by constricting the flow in some way and forcing it through a narrowed opening. Consonants produced in this way are called fricatives. Fricatives can be formed by forcing air between the upper teeth and lower lip when these are pressed together. These are called labiodental fricatives. The consonants at the start of the words van and fan are examples, the former being a voiced labiodental fricative and the latter an unvoiced labiodental fricative.

TRY THIS:
Say the words van and fan and observe that both words begin with the upper teeth pressed against the lower lip. Both v and f are labiodental. In both, the flow of air is partially obstructed; they are both fricatives. With the word van the vocal cords vibrate before the teeth are separated from the lip, and v is a voiced consonant. With the word fan the vocal cords begin to vibrate after the teeth and lip are separated, and f is an unvoiced consonant.

Other fricatives are formed by placing the front of the tongue against the alveolar ridge at the top front of the mouth, behind the teeth, and forcing air between them. The consonants at the beginning of the words zip and sip are formed this way, and are called alveolar fricatives, the first being voiced and the second unvoiced.

TRY THIS:
Say the words  zip and sip. Observe that both words begin with the front of the tongue against the alveolar ridge. Air is forced between the restricted space between the tongue and the top of the mouth. Both z and s are alveolar fricatives. With zip the vocal cords vibrate while air flows between the tongue and the roof of the mouth, and with sip, the vocal cords begin to vibrate after the tongue is removed from the top of the mouth. Z is voiced, and s is unvoiced.

Fricatives can also be produced by pressing the upper teeth onto the tongue and forcing air between them, as in the initial sound in the word thin. This sound is an unvoiced dental fricative. The voiced dental fricative is the initial sound in the word then.

TRY THIS:
Say the words then and thin. Observe that both words begin with the front of the tongue touching the upper teeth.  Air is forced between the restricted space between the tongue and teeth. Both are dental fricatives. With then the vocal cords vibrate while air flows between the tongue and the teeth, and with thin, the vocal cords begin to vibrate after the tongue is removed from the teeth. Then is voiced, and thin is unvoiced.

When the tongue is held against the palate and air forced between them, palatal fricative consonants can be formed. The consonant sound in the middle of the word fashion is an unvoiced palatal fricative, and the consonant sound in the middle of the word vision is a voiced palatal fricative.

TRY THIS:
Say the words fashion and vision. Observe that in the middle of each word, the tongue is lifted up to the palate at the top of the mouth, and air is forced between the tongue and the palate. Both words contain palatal fricatives. In vision, the vocal cords vibrate during the fricative; it is a voiced fricative. In fashion, the vocal cords do not vibrate, so it is unvoiced.

A fricative can also be formed by constricting the airway by tightening the vocal cords, which narrows the opening between them, called the glottis, and forcing air between them. Because the vocal cords are tightened, they cannot vibrate, so only an unvoiced glottal fricative is possible. This is the breathy consonant sound at the beginning of the word happy.

Some consonants are produced by vibrating the vocal cords and shaping the vocal tract in such a way as to produce particular resonances. Some of these resonances involve the nasal cavity, and the consonants so formed are called nasal. The sound at the beginning of the word mud is a bilabial nasal, produced by pressing the two lips together, vibrating the vocal cords, and slowly exhaling through the nose. This causes the sound to resonate in the nasal cavity.

TRY THIS:
Say the word mud.  Observe that at the start of the word, your two lips are pressed together, yet air is flowing out of your lungs. It is flowing out through your nose. To see that you are exhaling through your nose, start making the m sound as at the beginning of the word mud, and pinch your nose shut. The sound will stop. The sound of m is produced with the lips held together while exhaling through the nose. The m sound is a bilabial nasal consonant.

A similar sound is that at the beginning of the word net, which is an alveolar nasal, formed by pressing the tongue to the alveolar ridge, vibrating the vocal cords, and slowly exhaling through the nose. A third nasal consonant is produced by pressing the back of the tongue to the back of the palate, the velum, vibrating the vocal cords, and exhaling slowly through the nose. This velar nasal is the consonant sound at the end of the word ring. English has no words beginning with this sound, nor do any other European languages, but many African and Asian languages do.

The mouth itself can also serve as a resonant cavity in the formation of consonants. The cavity of the mouth can be shaped by the tongue to result in different consonants.

TRY THIS:
Place the tip of your tongue on the alveolar ridge in your mouth, just behind your upper front teeth. Now vibrate your vocal cords. If you were now to start speaking a word, with what letter would that word begin? If you answered l, as in lock or leaf, you would be correct. Say the word lock. Observe when starting the word, how you place the tip of your tongue on your alveolar ridge.

Another consonant that uses resonance of the mouth is formed when the tongue is positioned so the left and right edges of the tongue are placed against the upper teeth at both sides of the mouth. Then, with the lips positioned to form a round mouth opening, the vocal cords are vibrated. The the consonant that this produces is the sound at the beginning of the word ring, namely, the r sound. The consonants formed by resonance in the mouth, such as l and r, are called the liquid consonants.

This chart summarizes the classification of some of the consonants used in American English.

Production of some consonants in American English.
Manner Voicing Location
Bilabial Labiodental Interdental Alveolar Palatal Velar
Stop Voiceless p     t ch k
Voiced b     d   g
Fricative Voiceless   f th(in) s sh h
Voiced   v th(en) z  
Nasal Voiced m     n   ng
Liquid Voiced       l r  

Next: Music

Music is an art that presents itself over a period of time. In this it is unlike sculpture and painting, which are complete at one time and unchanging. Music is more like theater and cinema and dance, which also develop over a period of time. In fact, these time-dependent arts frequently involve music. What characterizes ...

READ MORE →