• Subject List
  • Take a Tour
  • For Authors
  • Subscriber Services
  • Publications
  • African American Studies
  • African Studies
  • American Literature
  • Anthropology
  • Architecture Planning and Preservation
  • Art History
  • Atlantic History
  • Biblical Studies
  • British and Irish Literature
  • Childhood Studies
  • Chinese Studies
  • Cinema and Media Studies
  • Communication
  • Criminology
  • Environmental Science
  • Evolutionary Biology
  • International Law
  • International Relations
  • Islamic Studies
  • Jewish Studies
  • Latin American Studies
  • Latino Studies


  • Literary and Critical Theory
  • Medieval Studies
  • Military History
  • Political Science
  • Public Health
  • Renaissance and Reformation
  • Social Work
  • Urban Studies
  • Victorian Literature
  • Browse All Subjects

How to Subscribe

  • Free Trials

In This Article Expand or collapse the "in this article" section Speech Production


  • Historical Studies
  • Animal Studies
  • Evolution and Development
  • Functional Magnetic Resonance and Positron Emission Tomography
  • Electroencephalography and Other Approaches
  • Theoretical Models
  • Speech Apparatus
  • Speech Disorders

Related Articles Expand or collapse the "related articles" section about

About related articles close popup.

Lorem Ipsum Sit Dolor Amet

Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Aliquam ligula odio, euismod ut aliquam et, vestibulum nec risus. Nulla viverra, arcu et iaculis consequat, justo diam ornare tellus, semper ultrices tellus nunc eu tellus.

  • Acoustic Phonetics
  • Animal Communication
  • Articulatory Phonetics
  • Biology of Language
  • Clinical Linguistics
  • Cognitive Mechanisms for Lexical Access
  • Cross-Language Speech Perception and Production
  • Dementia and Language
  • Early Child Phonology
  • Interface Between Phonology and Phonetics
  • Khoisan Languages
  • Language Acquisition
  • Speech Perception
  • Speech Synthesis
  • Voice and Voice Quality

Other Subject Areas

Forthcoming articles expand or collapse the "forthcoming articles" section.

  • Cognitive Grammar
  • Edward Sapir
  • Find more forthcoming articles...
  • Export Citations
  • Share This Facebook LinkedIn Twitter

Speech Production by Eryk Walczak LAST REVIEWED: 22 February 2018 LAST MODIFIED: 22 February 2018 DOI: 10.1093/obo/9780199772810-0217

Speech production is one of the most complex human activities. It involves coordinating numerous muscles and complex cognitive processes. The area of speech production is related to Articulatory Phonetics , Acoustic Phonetics and Speech Perception , which are all studying various elements of language and are part of a broader field of Linguistics . Because of the interdisciplinary nature of the current topic, it is usually studied on several levels: neurological, acoustic, motor, evolutionary, and developmental. Each of these levels has its own literature but in the vast majority of speech production literature, each of these elements will be present. The large body of relevant literature is covered in the speech perception entry on which this bibliography builds upon. This entry covers general speech production mechanisms and speech disorders. However, speech production in second language learners or bilinguals has special features which were described in separate bibliography on Cross-Language Speech Perception and Production . Speech produces sounds, and sounds are a topic of study for Phonology .

As mentioned in the introduction, speech production tends to be described in relation to acoustics, speech perception, neuroscience, and linguistics. Because of this interdisciplinarity, there are not many published textbooks focusing exclusively on speech production. Guenther 2016 and Levelt 1993 are the exceptions. The former has a stronger focus on the neuroscientific underpinnings of speech. Auditory neuroscience is also extensively covered by Schnupp, et al. 2011 and in the extensive textbook Hickok and Small 2015 . Rosen and Howell 2011 is a textbook focusing on signal processing and acoustics which are necessary to understand by any speech scientist. A historical approach to psycholinguistics which also covers speech research is Levelt 2013 .

Guenther, F. H. 2016. Neural control of speech . Cambridge, MA: MIT.

This textbook provides an overview of neural processes responsible for speech production. Large sections describe speech motor control, especially the DIVA model (co-authored by Guenther). It includes extensive coverage of behavioral and neuroimaging studies of speech as well as speech disorders and ties them together with a unifying theoretical framework.

Hickok, G., and S. L. Small. 2015. Neurobiology of language . London: Academic Press.

This voluminous textbook edited by Hickok and Small covers a wide range of topics related to neurobiology of language. It includes a section devoted to speaking which covers neurobiology of speech production, motor control perspective, neuroimaging studies, and aphasia.

Levelt, W. J. M. 1993. Speaking: From intention to articulation . Cambridge, MA: MIT.

A seminal textbook Speaking is worth reading particularly for its detailed explanation of the author’s speech model, which is part of the author’s language model. The book is slightly dated, as it was released in 1993, but chapters 8–12 are especially relevant to readers interested in phonetic plans, articulating, and self-monitoring.

Levelt, W. J. M. 2013. A history of psycholinguistics: The pre-Chomskyan era . Oxford: Oxford University Press.

Levelt published another important book detailing the development of psycholinguistics. As its title suggests, it focuses on the early history of discipline, so readers interested in historical research on speech can find an abundance of speech-related research in that book. It covers a wide range of psycholinguistic specializations.

Rosen, S., and P. Howell. 2011. Signals and Systems for Speech and Hearing . 2d ed. Bingley, UK: Emerald.

Rosen and Howell provide a low-level explanation of speech signals and systems. The book includes informative charts explaining the basic acoustic and signal processing concepts useful for understanding speech science.

Schnupp, J., I. Nelken, and A. King. 2011. Auditory neuroscience: Making sense of sound . Cambridge, MA: MIT.

A general introduction to speech concepts with main focus on neuroscience. The textbook is linked with a website which provides a demonstration of described phenomena.

back to top

Users without a subscription are not able to see the full content on this page. Please subscribe or login .

Oxford Bibliographies Online is available by subscription and perpetual access to institutions. For more information or to contact an Oxford Sales Representative click here .

  • About Linguistics »
  • Meet the Editorial Board »
  • Acceptability Judgments
  • Accessibility Theory in Linguistics
  • Acquisition, Second Language, and Bilingualism, Psycholin...
  • Adpositions
  • African Linguistics
  • Afroasiatic Languages
  • Algonquian Linguistics
  • Altaic Languages
  • Ambiguity, Lexical
  • Analogy in Language and Linguistics
  • Applicatives
  • Applied Linguistics, Critical
  • Arawak Languages
  • Argument Structure
  • Artificial Languages
  • Australian Languages
  • Austronesian Linguistics
  • Auxiliaries
  • Balkans, The Languages of the
  • Baudouin de Courtenay, Jan
  • Berber Languages and Linguistics
  • Bilingualism and Multilingualism
  • Borrowing, Structural
  • Caddoan Languages
  • Caucasian Languages
  • Celtic Languages
  • Celtic Mutations
  • Chomsky, Noam
  • Chumashan Languages
  • Classifiers
  • Clauses, Relative
  • Cognitive Linguistics
  • Colonial Place Names
  • Comparative Reconstruction in Linguistics
  • Comparative-Historical Linguistics
  • Complementation
  • Complexity, Linguistic
  • Compositionality
  • Compounding
  • Computational Linguistics
  • Conditionals
  • Conjunctions
  • Connectionism
  • Consonant Epenthesis
  • Constructions, Verb-Particle
  • Contrastive Analysis in Linguistics
  • Conversation Analysis
  • Conversation, Maxims of
  • Conversational Implicature
  • Cooperative Principle
  • Coordination
  • Creoles, Grammatical Categories in
  • Critical Periods
  • Cyberpragmatics
  • Default Semantics
  • Definiteness
  • Dene (Athabaskan) Languages
  • Dené-Yeniseian Hypothesis, The
  • Dependencies
  • Dependencies, Long Distance
  • Derivational Morphology
  • Determiners
  • Dialectology
  • Distinctive Features
  • Dravidian Languages
  • Endangered Languages
  • English as a Lingua Franca
  • English, Early Modern
  • English, Old
  • Eskimo-Aleut
  • Euphemisms and Dysphemisms
  • Evidentials
  • Exemplar-Based Models in Linguistics
  • Existential
  • Existential Wh-Constructions
  • Experimental Linguistics
  • Fieldwork, Sociolinguistic
  • Finite State Languages
  • First Language Attrition
  • Formulaic Language
  • Francoprovençal
  • French Grammars
  • Gabelentz, Georg von der
  • Genealogical Classification
  • Generative Syntax
  • Genetics and Language
  • Grammar, Categorial
  • Grammar, Construction
  • Grammar, Descriptive
  • Grammar, Functional Discourse
  • Grammars, Phrase Structure
  • Grammaticalization
  • Harris, Zellig
  • Heritage Languages
  • History of Linguistics
  • History of the English Language
  • Hmong-Mien Languages
  • Hokan Languages
  • Humor in Language
  • Hungarian Vowel Harmony
  • Idiom and Phraseology
  • Imperatives
  • Indefiniteness
  • Indo-European Etymology
  • Inflected Infinitives
  • Information Structure
  • Interjections
  • Iroquoian Languages
  • Isolates, Language
  • Jakobson, Roman
  • Japanese Word Accent
  • Jones, Daniel
  • Juncture and Boundary
  • Kiowa-Tanoan Languages
  • Kra-Dai Languages
  • Labov, William
  • Language and Law
  • Language Contact
  • Language Documentation
  • Language, Embodiment and
  • Language for Specific Purposes/Specialized Communication
  • Language, Gender, and Sexuality
  • Language Geography
  • Language Ideologies and Language Attitudes
  • Language in Autism Spectrum Disorders
  • Language Nests
  • Language Revitalization
  • Language Shift
  • Language Standardization
  • Language, Synesthesia and
  • Languages of Africa
  • Languages of the Americas, Indigenous
  • Languages of the World
  • Learnability
  • Lexical Access, Cognitive Mechanisms for
  • Lexical Semantics
  • Lexical-Functional Grammar
  • Lexicography
  • Lexicography, Bilingual
  • Linguistic Accommodation
  • Linguistic Anthropology
  • Linguistic Areas
  • Linguistic Landscapes
  • Linguistic Prescriptivism
  • Linguistic Profiling and Language-Based Discrimination
  • Linguistic Relativity
  • Linguistics, Educational
  • Listening, Second Language
  • Literature and Linguistics
  • Machine Translation
  • Maintenance, Language
  • Mande Languages
  • Mass-Count Distinction
  • Mathematical Linguistics
  • Mayan Languages
  • Mental Health Disorders, Language in
  • Mental Lexicon, The
  • Mesoamerican Languages
  • Minority Languages
  • Mixed Languages
  • Mixe-Zoquean Languages
  • Modification
  • Mon-Khmer Languages
  • Morphological Change
  • Morphology, Blending in
  • Morphology, Subtractive
  • Munda Languages
  • Muskogean Languages
  • Nasals and Nasalization
  • Niger-Congo Languages
  • Non-Pama-Nyungan Languages
  • Northeast Caucasian Languages
  • Oceanic Languages
  • Papuan Languages
  • Penutian Languages
  • Philosophy of Language
  • Phonetics, Acoustic
  • Phonetics, Articulatory
  • Phonological Research, Psycholinguistic Methodology in
  • Phonology, Computational
  • Phonology, Early Child
  • Policy and Planning, Language
  • Politeness in Language
  • Positive Discourse Analysis
  • Possessives, Acquisition of
  • Pragmatics, Acquisition of
  • Pragmatics, Cognitive
  • Pragmatics, Computational
  • Pragmatics, Cross-Cultural
  • Pragmatics, Developmental
  • Pragmatics, Experimental
  • Pragmatics, Game Theory in
  • Pragmatics, Historical
  • Pragmatics, Institutional
  • Pragmatics, Second Language
  • Pragmatics, Teaching
  • Prague Linguistic Circle, The
  • Presupposition
  • Psycholinguistics
  • Quechuan and Aymaran Languages
  • Reading, Second-Language
  • Reciprocals
  • Reduplication
  • Reflexives and Reflexivity
  • Register and Register Variation
  • Relevance Theory
  • Representation and Processing of Multi-Word Expressions in...
  • Salish Languages
  • Sapir, Edward
  • Saussure, Ferdinand de
  • Second Language Acquisition, Anaphora Resolution in
  • Semantic Maps
  • Semantic Roles
  • Semantic-Pragmatic Change
  • Semantics, Cognitive
  • Sentence Processing in Monolingual and Bilingual Speakers
  • Sign Language Linguistics
  • Sociolinguistics
  • Sociolinguistics, Variationist
  • Sociopragmatics
  • Sound Change
  • South American Indian Languages
  • Specific Language Impairment
  • Speech, Deceptive
  • Speech Production
  • Switch-Reference
  • Syntactic Change
  • Syntactic Knowledge, Children’s Acquisition of
  • Tense, Aspect, and Mood
  • Text Mining
  • Tone Sandhi
  • Transcription
  • Transitivity and Voice
  • Translanguaging
  • Translation
  • Trubetzkoy, Nikolai
  • Tucanoan Languages
  • Tupian Languages
  • Usage-Based Linguistics
  • Uto-Aztecan Languages
  • Valency Theory
  • Verbs, Serial
  • Vocabulary, Second Language
  • Vowel Harmony
  • Whitney, William Dwight
  • Word Classes
  • Word Formation in Japanese
  • Word Recognition, Spoken
  • Word Recognition, Visual
  • Word Stress
  • Writing, Second Language
  • Writing Systems
  • Zapotecan Languages
  • Privacy Policy
  • Cookie Policy
  • Legal Notice
  • Accessibility

Powered by:

  • [|]

Logo for Open Library Publishing Platform

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

2.1 How Humans Produce Speech

Phonetics studies human speech. Speech is produced by bringing air from the lungs to the larynx (respiration), where the vocal folds may be held open to allow the air to pass through or may vibrate to make a sound (phonation). The airflow from the lungs is then shaped by the articulators in the mouth and nose (articulation).

Check Yourself

Video script.

The field of phonetics studies the sounds of human speech.  When we study speech sounds we can consider them from two angles.   Acoustic phonetics ,  in addition to being part of linguistics, is also a branch of physics.  It’s concerned with the physical, acoustic properties of the sound waves that we produce.  We’ll talk some about the acoustics of speech sounds, but we’re primarily interested in articulatory phonetics , that is, how we humans use our bodies to produce speech sounds. Producing speech needs three mechanisms.

The first is a source of energy.  Anything that makes a sound needs a source of energy.  For human speech sounds, the air flowing from our lungs provides energy.

The second is a source of the sound:  air flowing from the lungs arrives at the larynx. Put your hand on the front of your throat and gently feel the bony part under your skin.  That’s the front of your larynx . It’s not actually made of bone; it’s cartilage and muscle. This picture shows what the larynx looks like from the front.

Larynx external

This next picture is a view down a person’s throat.

Cartilages of the Larynx

What you see here is that the opening of the larynx can be covered by two triangle-shaped pieces of skin.  These are often called “vocal cords” but they’re not really like cords or strings.  A better name for them is vocal folds .

The opening between the vocal folds is called the glottis .

We can control our vocal folds to make a sound.  I want you to try this out so take a moment and close your door or make sure there’s no one around that you might disturb.

First I want you to say the word “uh-oh”. Now say it again, but stop half-way through, “Uh-”. When you do that, you’ve closed your vocal folds by bringing them together. This stops the air flowing through your vocal tract.  That little silence in the middle of “uh-oh” is called a glottal stop because the air is stopped completely when the vocal folds close off the glottis.

Now I want you to open your mouth and breathe out quietly, “haaaaaaah”. When you do this, your vocal folds are open and the air is passing freely through the glottis.

Now breathe out again and say “aaah”, as if the doctor is looking down your throat.  To make that “aaaah” sound, you’re holding your vocal folds close together and vibrating them rapidly.

When we speak, we make some sounds with vocal folds open, and some with vocal folds vibrating.  Put your hand on the front of your larynx again and make a long “SSSSS” sound.  Now switch and make a “ZZZZZ” sound. You can feel your larynx vibrate on “ZZZZZ” but not on “SSSSS”.  That’s because [s] is a voiceless sound, made with the vocal folds held open, and [z] is a voiced sound, where we vibrate the vocal folds.  Do it again and feel the difference between voiced and voiceless.

Now take your hand off your larynx and plug your ears and make the two sounds again with your ears plugged. You can hear the difference between voiceless and voiced sounds inside your head.

I said at the beginning that there are three crucial mechanisms involved in producing speech, and so far we’ve looked at only two:

  • Energy comes from the air supplied by the lungs.
  • The vocal folds produce sound at the larynx.
  • The sound is then filtered, or shaped, by the articulators .

The oral cavity is the space in your mouth. The nasal cavity, obviously, is the space inside and behind your nose. And of course, we use our tongues, lips, teeth and jaws to articulate speech as well.  In the next unit, we’ll look in more detail at how we use our articulators.

So to sum up, the three mechanisms that we use to produce speech are:

  • respiration at the lungs,
  • phonation at the larynx, and
  • articulation in the mouth.

Essentials of Linguistics Copyright © 2018 by Catherine Anderson is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

Logo for BCcampus Open Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

9.2 The Standard Model of Speech Production

Speech production falls into three broad areas: conceptualization, formulation and articulation (Levelt, 1989). In conceptualization , we determine what to say. This is sometimes known as message-level processing. Then we need to formulate the concepts into linguistic forms. Formulation takes conceptual entities as input and connects them with the relevant words associated with them to build a syntactic, morphological, and phonological structure. This structure is phonetically encoded and articulated , resulting in speech.

During conceptualization, we develop an intention and select relevant information from the internal (memory) or external (stimuli) environment to create an utterance. Very little is known about this level as it is pre-verbal. Levelt (1989) divided this stage into microplanning and macroplanning. Macroplanning is thought to be the elaboration of a communication goal into subgoals and connecting them with the relevant information. Microplanning assigns the correct shape to these pieces of information and deciding on the focus of the utterance.

Formulation is divided into lexicalization and syntactic planning . In lexicalization, we select the relevant word-forms and in syntactic planning we put these together into a sentence. In talking about word-forms, we need to consider the idea of lemmas . This is the basic abstract conceptual form which is the basis for other derivations. For example, break can be considered a lemma which is the basis for other forms such as break , breaks , broke , broken and breaking . Lemma retrieval used a conceptual structure to retrieve a lemma that makes syntactic properties available for encoding (Kempen & Hoenkamp, 1987). This can specify the parameters such as number, tense, and gender. During word-form encoding, the information connected to lemmas is used to access the morphemes and phonemes linked to the word. The reason these two processing levels, lemma retrieval and word-form encoding, are assumed to exist comes from speech errors where words exchange within the same syntactic categories. For example, nouns exchange with nouns and verbs with verbs from different phrases. Bierwisch (1970), Garrett (1975, 1980) and Nooteboom (1967) provide some examples:

  • “… I left my briefcase in the cigar ”
  • “What we want to do is train its tongue to move the cat ”
  •  “We completely forgot to add the list to the roof ”
  • “As you reap , Roger, so shall you sow ”

We see here that not only are the exchange of words within syntactic categories, the function words associated with the exchanges appear to be added after the exchange (as in ‘its’ before ‘tongue’ and ‘the’ before ‘cat’). In contrast to entire words (which exchange across different phrases), segment exchanges usually occur within the same phrase and do not make any reference to syntactic categories. Garrett (1988) provides an example in “she is a real r ack p at” instead of “she is a real pack rat.” In such errors, the segments involved in the error often share phonetic similarities or share the same syllable position (Dell, 1984). This suggests that these segments must be operating within some frame such as syllable structure. To state this in broader terms, word exchanges are assumed to occur during lemma retrieval, and segment exchanges occur during word-form encoding.

Putting these basic elements together, Meyer (2000) introduced the ‘Standard Model of Word-form Encoding’ (see Figure 9.2) as a summation of previously proposed speech production models (Dell, 1986; Levelt et al., 1999; Shattuck-Huffnagel, 1979, 1983; Fromkin, 1971, 1973; Garrett, 1975, 1980). The model is not complete in itself but a way for understanding the various levels assumed by most psycholinguistic models. The model represents levels for morphemes, segments, and phonetic representations.

image description linked to in caption

Morpheme Level

We have already seen (in Chapter 3 ) that morphemes are the smallest units of meaning. A word can be made up on one or more morphemes. Speech errors involving morphemes effect the lemma level or the wordform level (Dell, 1986) as in:

  • “how many pies does it take to make an apple ?” (Garrett, 1988)
  • “so the apple has less trees ” (Garrett, 2001)
  • “I’d hear one if I knew it” (Garrett, 1980)
  • “… slice ly thinn ed” (Stemberger, 1985)

In the first, we see that the morpheme that indicates the plural number has remained in place while the morpheme for ‘apple’ and ‘pie’ exchanged. This is also seen in the last example. This suggests that the exchange occurred after the parameters for number were set indicating that lemmas can switch independent of their morphological and phonological representations (which occur further down in speech production).

Segment Level

While speech production models differ in their organisation and storage of segments, we will assume thay segments have to be retrieved at some level of speech production. Between 60-90% of all speech errors tend to involve segments (Boomer & Laver, 1968; Fromkin, 1971; Nooteboom, 1969; Shattuck-Hufnagel, 1983). However, 10-30% of all speech errors also involve segment sequences (Stemberger, 1983; Shattuck-Hufnagel, 1983). Reaction time experiments have also been employed to justify this level. Roeloffs (1999) asked participants to learn a set of word pairs followed by the first word in the pair being presented as a prompt to produce the second word. These test blocks were presented as either homogeneous or heterogenous phonological forms. In the homogenous blocks there were shared onsets or the segments differed only in voicing. In the heterogenous blocks the initial segments contrasted in voicing and place of articulation. He found that there were priming effects in homogenous blocks when the targets shared an initial segment but not when all but one feature was shared suggesting that whole phonological segments are represented at some level rather than distinctive features.

Phonetic Level

The segmental level we just discussed is based on phonemes. The standard understanding of speech is that there must be a phonetic level that represents the actual articulated speech as opposed to the stored representations of sound. We have already discussed this in Chapter 2 and will expand here. For example, in English, there are two realizations of unvoiced stops. One form is unaspirated /p/, /k/, and /t/ and the other is aspirated [ph], [kh], and [th]. This can be seen in the words pit [phɪt] and lip [lɪp] where syllable-initial stops are aspirated as a rule. The pronunciation of pit as *[pɪt] doesn’t change the meaning but will sound odd to a native speaker. This shows that /p/ has one phonemic value but two phonetic values: [p] and [ph]. This can be understood as going from an abstract level to a concrete level developing as speech production occurs. Having familiarized ourselves with the basic levels of speech production, we can now go on to see how they are realized in actual speech production models.

Image descriptions

Figure 9.2 The Standard Model of Speech Production

The Standard Model of Word-form Encoding as described by Meyer (2000), illustrating five level of summation of conceptualization, lemma, morphemes, phonemes, and phonetic levels, using the example word “tiger”. From top to bottom, the levels are:

  • Semantic level: the conceptualization of “tiger” with an image of a tiger.
  • Lemma level: select the lemma of the word “tiger”.
  • Morpheme level: morphological encoding of the word tiger, t, i, g, e, r.
  • Phoneme level: phonological encoding of each morpheme in the word “tiger”.
  • Phonetic kevel: syllabification of the phonemes in the word “tiger”.

[Return to place in text (Figure 9.2)]

Media Attributions

  • Figure 9.2 The Standard Model of Speech Production by Dinesh Ramoo, the author, is licensed under a  CC BY 4.0 licence .

The process of forming a concept or idea.

The creation of the word form during speech production.

The formation of speech.

The process of developing a word for production.

The planning of word order in a sentence.

The form of a word as it is presented at the head of an entry in a dictionary.

Psychology of Language Copyright © 2021 by Dinesh Ramoo is licensed under a Creative Commons Attribution 4.0 International License , except where otherwise noted.

Share This Book

definition of speech production


Speech production

Speech production refers to the process of formulating and expressing spoken words or sounds. It involves coordinating muscles, such as those controlling breathing, vocal cords, tongue, and lips.

Related terms

Articulation disorders : Difficulties with pronouncing specific sounds or words due to issues with muscle coordination.

Phonemes : The smallest units of sound that can change meaning in a language (e.g., "bat" vs. "cat").

Apraxia of speech : A motor speech disorder characterized by difficulty planning and coordinating the movements necessary for clear speech.

" Speech production " appears in:

Practice questions ( 2 ).

What counterargument could be made against the theory that damage to Broca's area exclusively affects speech production?

If an individual's Broca’s area - involved in speech production - is damaged, what intervention would likely be most helpful for enhancing the person's communication abilities?

Are you a college student?

Study guides for the entire semester

200k practice questions

Glossary of 50k key terms - memorize important vocab


Stay Connected

© 2024 Fiveable Inc. All rights reserved.

AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

  • Search Menu
  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Urban Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Literature
  • Classical Reception
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Papyrology
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Greek and Roman Archaeology
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Emotions
  • History of Agriculture
  • History of Education
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Evolution
  • Language Reference
  • Language Acquisition
  • Language Variation
  • Language Families
  • Lexicography
  • Linguistic Anthropology
  • Linguistic Theories
  • Linguistic Typology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies (Modernism)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Media
  • Music and Religion
  • Music and Culture
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Science
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Lifestyle, Home, and Garden
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Politics
  • Law and Society
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Clinical Neuroscience
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Toxicology
  • Medical Oncology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Medical Ethics
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Security
  • Computer Games
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Psychology
  • Cognitive Neuroscience
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business Ethics
  • Business Strategy
  • Business History
  • Business and Technology
  • Business and Government
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic History
  • Economic Systems
  • Economic Methodology
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Natural Disasters (Environment)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Political Theory
  • Politics and Law
  • Public Policy
  • Public Administration
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Developmental and Physical Disabilities Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

The Oxford Handbook of Language Evolution

  • < Previous chapter
  • Next chapter >

22 The anatomical and physiological basis of human speech production: adaptations and exaptations

Ann MacLarnon is Director of the Centre for Research in Evolutionary Anthropology at Roehampton University. She has worked on a wide variety of areas in primatology and palaeoanthropology, with an emphasis on comparative approaches. Research topics include reproductive life histories and physiology, stress endocrinology and behaviour, and aspects of comparative morphology including the brain and spinal cord. Work on this last area led to the unexpected discovery that humans evolved increased breathing control for speech.

  • Published: 18 September 2012
  • Cite Icon Cite
  • Permissions Icon Permissions

This article provides details on human speech production involving a range of physical features, which may have evolved as specific adaptations for this purpose. All mammalian vocalizations are produced similarly, involving features that primarily evolved for respiration or ingestion. Sounds are produced using the flow of air inhaled through the nose or mouth, or expelled from the lungs. Unvoiced sounds are produced without the involvement of the vocal folds of the larynx. Mammalian vocalizations require coordination of the articulation of the supralaryngeal vocal tract with the flow of air, in or out. An extensive series of harmonics above a fundamental frequency, F 0 for phonated sounds is produced by resonance. These series are filtered by the shape and size of the vocal tract, resulting in the retention of some parts of the series, and diminution or deletion of others, in the emitted vocalization. Human sound sequences are also much more rapid than those of non-human primates, except for very simple sequences such as repetitive trills or quavers. Human vocal tract articulation is much faster, and humans are able to produce multiple sounds on a single breath movement, inhalation or exhalation. The unique form of the tongue within the vocal tract in humans is considered to be a key factor in the speech-related flexibility of supralaryngeal vocal tract.

The major medium for the transmission of human language is vocalization, or speech. Humans use rapid, highly variable, extended sound sequences to transmit the complex information content of language. Speech is a very efficient communication medium: it costs little energetically, it does not require visual contact with the intended receiver(s), and it can be carried out simultaneously with separate manual and other tasks. Although the vocal communication systems of some birds and other mammals, such as cetaceans, may resemble important aspects of human speech, none is as complex, nor as capable of transmitting information, as human speech‐propelled language. Certainly, our closest relatives, the apes and other primates, demonstrate nothing close to this unique human form of communication. Human speech production involves a range of physical features which may have evolved as specific adaptations for this purpose; alternatively, they evolved as exaptations, commandeering existing features. Combining knowledge of the anatomical and physiological basis of human speech production, comparisons with other primate species, and information from the human fossil record, it is possible to form an outline framework for the evolution of human speech capabilities, the features concerned, the likely timing and sequence in which they arose, and the possible combination of adaptations and exaptations involved—the what, when, and why of speech evolution.

All mammalian vocalizations are produced similarly, involving features that primarily evolved for respiration or ingestion. Sounds are produced using the flow of air inhaled through the nose or mouth, or expelled from the lungs. Unvoiced sounds are produced without the involvement of the vocal folds of the larynx. They entail pressurizing the airflow by temporary restriction of the vocal tract at some point(s) along its length. The turbulence of the released air produces either an aperiodic noise, such as a burst or hiss, or, under special conditions, it may produce a periodic sound such as a whistle. For voiced or phonated sounds, the vocal folds at the glottis of the larynx (a structure which first evolved at the top of the trachea to prevent water entering the lungs in aquatic creatures) are held taut, and the air flow needs to be powerful enough to cause the vocal folds to vibrate. This cuts the air flow into a chain of ‘air puffs’, or a periodic sound wave, perceived by the ear as sound at a pitch equivalent to the air puff frequency; this is known as the fundamental frequency or F 0 , and it varies with the length and tension of the vocal folds. Voiced sounds may be modified further by so‐called gestural articulations of the supralaryngeal vocal tract produced by positions or movements of articulatory structures such as the tongue and lips, both primarily involved in ingestion. Mammalian vocalizations therefore require coordination of the articulation of the supralaryngeal vocal tract with the flow of air, in or out. For phonated sounds, an extensive series of harmonics above F 0 is produced by resonance. These series are filtered by the shape and size of the vocal tract, resulting in the retention of some parts of the series, and diminution or deletion of others, in the emitted vocalization. Unvoiced vocalizations generally have less structured acoustic features and broad bands of emitted frequencies. What distinguishes human speech from the vocalizations of other species is the extraordinary range of acoustic variation involved, produced by an enormous variety of gestural articulations of the vocal tract, together with intricate manipulations of the larynx and other respiratory structures. Rather than utilizing the air flow of both inspirations and expirations, human speech is also produced almost entirely on expired air, released in extended, highly controlled expirations.

More than 100 different sound units or phonemes found in human languages are recognized in the International Phonetic Alphabet, together with a further array of major variant types. Each sound unit is acoustically distinctive (Fant 1960 ), as depicted in spectrograms, in which emitted sound frequencies and their amplitudes are plotted against time. Phonemes vary with different relative timing of the start of phonation and of vocal tract constriction, different speeds of movement and combinations of vocal tract articulators, different intonation changes produced in the larynx or by the lungs; sounds may be breathy, creaky, nasal, or aspirated, and so the list goes on. Different languages use different subsets of phonemes.

Phonemes comprise consonants and vowels, which form the building blocks of syllables. Consonants, voiced or unvoiced, involve the complete or near complete obstruction and release of airflow through the vocal tract, which produces characteristic spectrum profiles or envelopes of sound frequencies emitted over time (Fant 1960 ). Vowels always involve phonation, and filtering through different vocal tract constrictions produced by gestures of the tongue, without complete obstruction. They are distinguished by their combinations of formants (Fant 1960 ), which are sharp peaks in the frequency ranges above F 0 emitted following filtration, known as F 1 , F 2 , etc.; typically, different vowels within a language can be characterized by the first two formants. The perception of vowels is not dependent on their absolute formant frequencies, but rather their relative values, normalized by the listener according to the typical frequency levels of a particular individual speaker, be they generally higher or lower pitched, the differences resulting from a shorter or longer vocal tract.

The range and variation of human speech sounds, the different subsets utilized in hundreds of languages, and how they are produced anatomically and physiologically, have been superbly documented in an extraordinary compendium by Ladefoged and Maddieson ( 1996 ). For consonants, they describe how nine independent, moveable, soft tissue articulators can be distinguished: lips; tongue—tip, blade, underblade, front, back, root; epiglottis; and glottis. These move to constrict or block the vocal tract at 11 main articulation points, or more accurately zones: lips, incisor teeth, different points along the palate, the velum or soft palate, and the uvula (the skin flap hanging from the velum), the pharynx or throat, the epiglottis, and the glottis. Together these produce 17 different categories of articulatory gestures, whose precise formation varies in different languages and dialects. Consonants are further differentiated into stops, nasals, fricatives, laterals, rhotics, and clicks, according to whether they involve, respectively, momentary complete stoppage of airflow by vocal tract obstruction, mouth closure and nasal‐only airflow, a turbulent airstream, midline tract closure limited with lateral airflow around the partial obstruction, tongue trills and related movements, or two points of vocal tract closure trapping air with subsequent articulator movement increasing the trapped air volume and hence decreasing pressure prior to its sudden release. Vowel production involves subtle tongue‐shaping in the oral or pharyngeal cavities, resulting in different points of vocal tract constriction, and hence different formant combinations.

It became evident early in attempts to teach apes to speak that our closest living relatives are not capable of the intricate articulatory manoeuvres of the upper respiratory tract which underlie the enormous range of human speech sounds. Recent evidence from Diana monkeys suggests that vocal tract articulation in non‐human primates may not be as severely limited as previously thought (Riede et al. 2005 ). However, it seems improbable that capabilities so useful to human communication would not have been exploited more fully if they existed in other species, and it is therefore likely that the human capacity for the production of highly varied speech sounds is unique among primates.

Human sound sequences are also much more rapid than those of non‐human primates, except for very simple sequences such as repetitive trills or quavers. Human vocal tract articulation is much faster, and humans are able to produce multiple sounds on a single breath movement, inhalation or exhalation. Most non‐human sound sequences, such as chimpanzee pant‐hoots and other vocalizations (Marler and Tenaza 1977 ), are produced on successive inspirations and expirations. Commonly each component sound of such sequences (e.g. the pant, or the hoot of the chimpanzee call) can only be produced on either an inhalation or an exhalation, which also restricts sound sequence combinations.

The laryngeal air sacs present in some non‐human primate species enable them to produce slightly more complex sound sequences on single breath movements, either through additional breath movements in and out of the sacs, or by vibration of the vocal lip at the opening of the sacs into the larynx (e.g. bitonal scream of siamangs; Haimoff 1983 ). Humans do not possess air sacs, and instead produce complex sound sequences by the intricate manipulation of airflow within individual exhalations, freed much more than any non‐human primate from the restrictions of vocalizations tied to breath movements (Hewitt et al. 2002 ). Overall, humans are able to produce sound sequences of up to about 30 sound units per second (P. Lieberman et al. 1992 ). Maximum sound production rates for non‐human primates are typically only 2–3 per second, extending to 5 per second with the involvement of air sacs (MacLarnon and Hewitt 1999 ).

Human speech also demonstrates further flexibility through an enhanced ability to control breathing, the airflow itself, compared with non‐human primates (MacLarnon and Hewitt 1999 , 2004 ). First, humans speak on very extended exhalations, interspersed with quick inhalations, compared with much more even breathing cycles during quiet breathing; non‐human primates appear not to be able to distort their breathing cycles so markedly. During normal speech, humans typically utilize exhalations of 4–5 seconds (Hoit et al. 1994 ), extending up to more than 12 seconds (Winkworth et al. 1995 ), whereas the longest calls given on single breath movements in non‐human primates are only about 5 seconds (MacLarnon and Hewitt 1999 ). Calibrating these measures, taking into account the faster quiet breathing rates of smaller animals, the maximum duration of human speech exhalations is more than 7 times that during quiet breathing. In non‐human primates, the normal maximum duration of exhalations during vocalization is only 2–3 times that during quiet breathing. The exceptions to this are species with air sacs, such as howler monkeys and gibbons, which can extend exhalations to 4–5‐fold their duration during quiet breathing. Again, humans do not possess air sacs, an apparent alternative to control of pulmonary air release for extending call exhalation length, though one that does not enable the very subtle control of respiratory airflow of human speech (Hewitt et al. 2002 ).

22.1 Sound articulation

The unique form of the tongue within the vocal tract in humans is considered to be a key factor in the speech‐related flexibility of our supralaryngeal vocal tract (P. Lieberman 1984 ). In mammals, the tongue is typically a flat muscular structure lying largely within the oral cavity, anchored posteriorly by its attachment to the hyoid bone, which lies just below oral level in the pharynx, immediately above the larynx. The primary function of the tongue is to move food around the mouth for mastication, and posteriorly for swallowing. In humans, however, the tongue is a curved structure, lying part horizontally in the oral cavity and part vertically down an extended pharynx, where it attaches to a much lower hyoid, just above a descended larynx. The horizontal (oral) and vertical (pharyngeal) portions of the human supralaryngeal tract (SVT H and SVT V ) are equal in length, compared with other species in which SVT H is substantially longer. Greatly because of its curvature, movement of the human tongue, together with jaw movements, can vary the cross‐sectional area of each of the two tubes of our vocal tract independently by a factor of approximately ten, providing a very broad range of articulatory gestures, and very variable resultant formants of emitted sound. The 1:1 ratio of SVT H :SVT V , with a sharp bend between the two, is notably important for the production of three vowels, designated phonetically [i], [u], and [a]. These vowels are particularly easily distinguished, with very low perceptual error rates, by their F1, F2 combinations, which lie at the outer limits of the acoustic vowel space, and [i], followed by [u], is the most reliable and commonly used sound unit for vocal tract normalization. The tongue positions for production of the three vowels utilize the angle at the midpoint of the human vocal tract to produce abrupt discontinuities in the cross‐sectional areas of the tube. Because the angle is sharp, the articulatory gestures involved do not have to be performed with particular accuracy for consistent, distinctive acoustic results, making these vowels marked examples of the quantal nature of human speech sounds (Stevens 1972 ). Perhaps consequently, they are the most common vowels in the world's languages (Ladefoged and Maddieson 1996 ).

Humans are not completely unique in having a descended larynx; species including dog, goat, pig, and tamarin lower the larynx during loud calls (Fitch 2001b ). Several deer have a permanently lowered larynx, which may temporarily be lowered further during male roars (Fitch and Reby 2001 ); large cats are apparently similar (Weissengruber et al. 2002 ). However, laryngeal descent is rarely accompanied by descent of the hyoid; hence the tongue remains horizontal in the oral cavity, and cannot act as a pharyngeal articulator (P. Lieberman 2007 ). Temporary laryngeal descent is also much less disruptive of other functions. In humans, because of marked, permanent laryngeal descent, simple contact between the epiglottis and velum is no longer possible, disrupting the normal mammalian separation of the respiratory and digestive tracts during swallowing, and increasing the risk of choking. Permanent laryngeal descent is thus a very different evolutionary development. Nishimura et al. ( 2006 ) have demonstrated that the larynx does descend to some extent during development in chimpanzees, followed by hyoidal descent. However, only humans have evolved permanent, major, laryngeal descent, with associated hyoidal descent, resulting in a curved tongue, and a two‐tube vocal tract with 1:1 proportions. It is not laryngeal descent per se that is crucial to human speech capabilities, but rather a suite of factors in the shape and proportions of the supralaryngeal vocal tract and tongue (P. Lieberman 2007 ).

Considerable efforts have been made to determine when the two‐tube vocal tract evolved in our ancestors, using indirect means, as its soft tissue structures do not fossilize. Reconstruction of the fossil hominin tract was first attempted by Philip Lieberman and Crelin ( 1971 ), using basicranial and mandibular characteristics, followed by Laitman and colleagues (e.g. 1979), who used the basicranial angle, or flexion of the skull base. However, Daniel Lieberman and McCarthy ( 1999 ) recently demonstrated, using radiographic series, that human laryngeal descent is not linked ontogenetically to the development of basicranial flexion. So, reconstruction of the supralaryngeal tract is not possible from basicranial form, and much previous work on the speech articulation capabilities of fossil hominins was therefore flawed, as P. Lieberman ( 2007 ) has fully accepted. In addition, D. Lieberman et al. ( 2001 ) showed that during postnatal descent of the hyoid and larynx in humans, the relative vertical positions of the hyoid, mandible, hard palate and larynx are held more or less constant. However, the ratio SVT H :SVT V changes during development, as a result of differential growth patterns of the total oral and pharyngeal lengths, and only reaches 1:1 from about 6–8 years. Together these results indicate that the descent of the hyolaryngeal structures is primarily constrained to maintain muscular function in relation to mandibular movement for swallowing; speech‐related factors are not maximized until well into childhood, matching the gradual ontogenetic development of acoustically accurate speech production (P. Lieberman 1980 ). Various possible exaptive explanations for why humans evolved their unique vocal tract configuration have been proposed. For example, obligate bipedalism required a more forward position of the spine under the skull, possibly reducing the space available in the upper throat, so squeezing the hyoid and larynx down the pharynx; increased carnivory in early Homo was associated with reduced jaw size and reduced oral cavity length, possibly requiring a compensatory increase in pharyngeal length (Negus 1949 ; Aiello 1996 ).

Recently, D. Lieberman and colleagues (e.g. 2002) have produced substantial new evidence on the integrated evolution of many modern human cranial features, providing a more comprehensive basis for exploring the evolution of the human vocal tract. They showed that a small number of developmental shifts distinguish modern human crania from those of our predecessors, including two—a more flexed basicranium and reduction in face size—which result in a shortening of SVT H , contributing to the attainment of an SVT H :SVT V ratio of 1:1. D. Lieberman ( 2008 ) suggested possible adaptational bases for these shifts, such as temporal lobe increase for enhanced cognitive processing including language, increasing basicranial flexion; increased meat consumption and technologically enhanced food processing including cooking, resulting in facial reduction; endurance running, building on obligate bipedalism, involving facial reduction for improved head stabilization; direct selection for speech capabilities, driving a decrease in oral cavity length, involving facial reduction and/or basicranial flexion, to produce a 1:1 SVT H :SVT V ratio. In other words, a suite of factors may have affected SVT H , and hence played a part in the evolution of the modern human capability for quantal speech. The other component in the evolution of a 1:1 ratio, an increase in SVT V , may have been directly selected for enhanced speech capabilities, so counterbalancing the negative impact of increased choking risk. However, this would not have been advantageous prior to substantial decrease in SVT H , because a long SVT V would require laryngeal descent into the thorax, producing muscular orientations that would compromise functional swallowing. Rather than major, coordinated shifts in both vocal tract parameters occurring with the evolution of modern humans, I think it more probable that other factors, earlier in human evolution, produced descent of the hyolaryngeal complex, and an increase in SVT V . From this exaptive basis, final reduction in SVT H, with the evolution of modern human cranial shape, could be adaptive for quantal speech. As outlined above, maintenance of functional swallowing is central to human developmental hyolaryngeal descent, which only becomes advantageous for speech articulation later in childhood. This, too, is congruent with the suggestion that hyolaryngeal descent resulted from earlier evolutionary change. The most likely candidate is the evolution of bipedalism, involving reconfiguration of neck structures, in Homo erectus . Jaw length also reduced in this species, associated with changing diet and food processing. The use of more complex vocalizations for communication may have begun to increase at the same time, alongside brain size and presumed social complexity (Aiello 1996 ).

As well as its curved shape, other features of the tongue have also been explored for their potential contribution to human speech articulation. Duchin ( 1990 ) drew attention to the greater manoeuvrability of the human tongue compared with apes. Jaw reduction produces a shorter, more controllable tongue, and hyoidal descent angles the tongue, increasing mechanical advantage. Takemoto ( 2008 ) showed that chimpanzee and human tongues have the same detailed internal topology, a muscular hydrostat formation (Kier and Smith 1985 ), which enables elongation, shortening, thinning, fattening, and twisting of the tongue for moving food around the mouth and for swallowing. However, the overall curved shape of the human tongue, compared with the flat chimpanzee form, means the same internal structures are arranged radially in humans, compared with linearly in apes, which increases the degrees of freedom for tongue deformation (Takemoto 2008 ). Hence, the dietary and other changes from early Homo through to modern humans provided the potential for enhanced control of speech articulation gestures through exaptive realignment of both external and internal tongue features.

The lips are second only to the tongue in their importance as human speech articulators. They are particularly important for the production of two major consonant groups, stops and fricatives (the former being the only consonant type to occur in all languages), and also in vowel production (Ladefoged and Maddieson 1996 ). In typical mammals, the face is dominated by a prominent snout housing major structures of the highly developed olfactory sense, which extend onto the face, in the form of the rhinarium, or wet nose. Within primates, the evolution of the haplorhines (tarsiers, monkeys, and apes) involved a shift to diurnal activity from the typical mammalian nocturnal pattern retained by strepsirhines (lemurs and lorises). With this came increased specialization of the visual sense, and an associated reduction in olfaction. The snout reduced, and the rhinarium was lost. As a result, the facial and lip muscles became less constrained and were co‐opted for facial expressions. Haplorhines evolved thicker lips (Schön Ybarra 1995 ), presumably to enhance this function. Hence, the evolution of mobile, muscular lips, so important to human speech, was the exaptive result of the evolution of diurnality and visual communication in the common ancestor of haplorhines. There is a lack of evidence as to whether there have been further adaptational developments in the lips during human evolution, or whether there have been changes in some other articulators, such as the velum or the epiglottis.

To date, there has been one attempt to investigate the comparative innervation of human vocal tract articulators. Kay et al. ( 1998 ) used the size of the hypoglossal canal in the base of the skull to estimate the relative number of nerve fibres in the hypoglossal nerve, which is a major innervator of the tongue. Their results suggested that Middle Pleistocene hominins and Neanderthals had modern human levels of tongue innervation, substantially greater than found in australopithecines and apes, and hence, they suggested, human‐like speech‐related tongue control had evolved by this time. However, DeGusta et al. ( 1999 ) demonstrated that hypoglossal canal and nerve sizes are not correlated, and Jungers et al. ( 2003 ) accepted that the canal size therefore offers no evidence about the timing of human speech evolution. Split second coordination between the highly flexible movements of the human speech articulators is required for human speech, as well as coordination with laryngeal movements affecting phonation. Different sounds result, for example, if the vocal cords start vibrating slightly before, at the same time, or slightly after an articulatory gesture. It seems likely that at least some increase in neural control has evolved in humans for speech articulation, even if empirical evidence is presently lacking.

22.2 Respiratory control

Humans have enhanced control of breathing compared with non‐human primates, which they use to extend exhalations and shorten inhalations during speech, as well as to modulate loudness. Humans are not constrained to produce vocalizations that fade as the lungs deflate. They can also vary the volume of air released through a phrase to emphasize particular words or syllables. In addition, variation in subglottal air pressure can affect intonation patterns. Enhanced breathing control therefore contributes to the human ability to produce fast sound sequences, and to generate a whole variety of language‐specific patterns and meanings, communicated through the intonation and emphasis of phrases or specific syllables. Much of this needs to be tied to cognitive intention, involving complex neural communication and feedback (MacLarnon and Hewitt 1999 ).

Control of subglottal pressure is key to human speech breathing control. During speech breathing, intercostal and anterior abdominal muscles are recruited to expand the thorax and draw air into the lungs, and to control gravitational recoil and hence the release of air as the lungs deflate. This is similar to quiet breathing, except that the diaphragm has a very limited role in speech breathing. It also differs from muscle recruitment during non‐human primate vocalizations, which does involve the diaphragm, and has only a limited role for intercostal muscles (e.g. Jürgens and Schriever 1991 ). The specific muscle movements required vary according to the volume of the lungs and other actions undertaken simultaneously (MacLarnon and Hewitt 1999 ). Overall, the fineness of control required of the intercostal muscles during human speech has been likened to that of the small muscles of the hand (Campbell 1968 ).

There is evidence, from an increase in spinal cord grey matter in the thoracic region, that humans have markedly greater innervation of the intercostal and anterior abdominal muscles compared with non‐human primates (MacLarnon 1993 ). Spinal cord dimensions are well correlated with those of its bony encasement, the vertebral canal. Evidence from fossil hominins demonstrates that enlargement of the canal, and therefore the cord, was not present in australopithecines and Homo erectus , but was present in Neanderthals and early modern humans (MacLarnon and Hewitt 1999 ). The function requiring enhanced neurological control therefore evolved in later human evolution. Of all the functions of the intercostal muscles, including maintenance of body posture for bipedal locomotion, vomiting, coughing, defecation, and breathing control, only enhanced breathing control for speech both requires substantial neurological control and fits the evolutionary timing constraints. It appears, therefore, that enhanced breathing control for speech was absent in Homo erectus , and present in the common ancestor of Neanderthals and modern humans, in the later Middle Pleistocene (MacLarnon and Hewitt 1999 , 2004 ).

As outlined above, human breathing control is not aided by the presence of air sacs, which can provide additional re‐breathed air for the extension of exhalations, without the risk of hyperventilation from excess oxygen intake (Hewitt et al. 2002 ). Larger ape species all possess laryngeal air sacs, so they were presumably lost at some point during human evolution. Air sacs abut against the hyoid bone where they produce characteristic indentions. The australopithecine hyoid from Dikika demonstrates the presence of air sacs (Alemseged et al. 2006 ), whereas hyoids from Homo heidelbergensis at Atapuerca, and a specimen from Castel di Guido dated to 400,000 years ago, as well as Neanderthals from El Sidrón and Kebara (Arensburg et al. 1990 ; Capasso et al. 2008 ; Martínez et al. 2008 ), show that air sacs had been lost by some point in the Middle Pleistocene. One possibility is that this occurred when the human thorax altered from the funnel‐shape of australopithecines, to the barrel‐shape of Homo erectus , as, in apes, air sacs extend into the thorax. It therefore quite probably occurred prior to the evolution of human speech‐breathing control, and it may also have been a necessary prerequisite stage.

The mammalian larynx, which protects the entrance to the lungs during swallowing, comprises a series of three sets of articulating cartilages connected by ligaments and membranes. Some mammal species retain a non‐valvular larynx, in which occlusion involves a simple muscular sphincter; other species have a valvular larynx, in which a mechanical valve provides for closure at the glottis. Based on the distribution of the valvular form, including its greatest development in primates, Negus ( 1949 ) proposed that the valvular larynx is a locomotor adaptation, enabling greater stabilization of the thorax in species with independent use of the forelimbs, through build up of air pressure below a closed glottis. Humans share with gibbons an extreme ability to close the glottis; other primates cannot completely close it off as the inner edges of the vocal processes of their arytenoid cartilages are curved, and when brought together, a small hiatus intervocalis always remains (Schön Ybarra 1995 ). Most likely humans lost the hiatus intervocalis independently from gibbons, as it is retained in living great apes. Gibbons may have evolved complete closure as an adaptation to brachiation. Bipedal humans use the capability of building up high subglottal pressure while lifting heavy objects with their arms, and in forceful coughing, which is particularly important with upright posture (Aiello and Dean 1990 ). In addition, for human speech, substantial subglottal air pressure is required to fuel very long exhalations. Complete glottal closure enhances the ability to control the pitch or intonation (Kelemen 1969 ), something which gibbons use in their songs, and humans use in speech, although it is unclear whether subglottal air pressure, or movements of the laryngeal cricothyroid muscle are more important in human control of intonation (Borden et al. 2003 ). Overall, humans probably lost the hiatus intervocalis as an adaptation to bipedalism, providing an exaptation for speech. Further to this, the membranous part of the vocal folds of humans is less sharp‐edged than in other primates (Negus 1929 ). This may be a direct adaptation for the production of more melodious sounds, selected for at some point after the locomotor‐associated function of the larynx altered in humans, with the evolution of exclusive bipedality in Homo erectus (Aiello 1996 ).

22.3 Evolutionary framework

Diet and technology‐related changes through human evolution, from the time of early Homo , have produced decreases in jaw and tongue length exaptive for the evolution of human speech capabilities. In addition to these, a three‐stage framework for the major features of human speech evolution can tentatively be proposed: first, the evolution of obligate bipedalism in Homo erectus produced the exaptations of laryngeal descent, and the loss of air sacs and the hiatus intervocalis; secondly, during the Middle Pleistocene, human speech breathing control evolved as a specific speech adaptation; thirdly, with the evolution of modern humans, the optimal vocal tract proportions (1:1) were evolved adaptively. Further details are summarized in Table 22.1 , together with suggested speech capabilities for each stage of the evolutionary framework.


I would like to thank Kathleen Gibson and Maggie Tallerman for the invitation to contribute to this volume, and for their very helpful editing. My interest in the evolution of human speech was first stimulated by stumbling on evidence for the evolution of human breathing control working with Gwen Hewitt. This paper builds on a lecture prepared for the Language Origins Society, thanks to an invitation from Bernard Bichakjian.

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Speech perception and production

Elizabeth d. casserly.

1 Department of Linguistics, Speech Research Laboratory, Indiana University, Bloomington, IN 47405, USA

David B. Pisoni

2 Department of Psychological and Brain Sciences, Speech Research Laboratory, Cognitive Science Program, Indiana University, Bloomington, IN 47405, USA

Until recently, research in speech perception and speech production has largely focused on the search for psychological and phonetic evidence of discrete, abstract, context-free symbolic units corresponding to phonological segments or phonemes. Despite this common conceptual goal and intimately related objects of study, however, research in these two domains of speech communication has progressed more or less independently for more than 60 years. In this article, we present an overview of the foundational works and current trends in the two fields, specifically discussing the progress made in both lines of inquiry as well as the basic fundamental issues that neither has been able to resolve satisfactorily so far. We then discuss theoretical models and recent experimental evidence that point to the deep, pervasive connections between speech perception and production. We conclude that although research focusing on each domain individually has been vital in increasing our basic understanding of spoken language processing, the human capacity for speech communication is so complex that gaining a full understanding will not be possible until speech perception and production are conceptually reunited in a joint approach to problems shared by both modes.

Historically, language research focusing on the spoken (as opposed to written) word has been split into two distinct fields: speech perception and speech production. Psychologists and psycholinguists worked on problems of phoneme perception, whereas phoneticians examined and modeled articulation and speech acoustics. Despite their common goal of discovering the nature of the human capacity for spoken language communication, the two broad lines of inquiry have experienced limited mutual influence. The division has been partially practical, because methodologies and analysis are necessarily quite different when aimed at direct observation of overt behavior, as in speech production, or examination of hidden cognitive and neurological function, as in speech perception. Academic specialization has also played a part, since there is an overwhelming volume of knowledge available, but single researchers can only learn and use a small portion. In keeping with the goal of this series, however, we argue that the greatest prospects for progress in speech research over the next few years lie at the intersection of insights from research on speech perception and production, and in investigation of the inherent links between these two processes.

In this article, therefore, we will discuss the major theoretical and conceptual issues in research dedicated first to speech perception and then to speech production, as well as the successes and lingering problems in these domains. Then we will turn to several exciting new directions in experimental evidence and theoretical models which begin to close the gap between the two research areas by suggesting ways in which they may work together in everyday speech communication and by highlighting the inherent links between speaking and listening.


Before the advent of modern signal processing technology, linguists and psychologists believed that speech perception was a fairly uncomplicated, straightforward process. Theoretical linguistics’ description of spoken language relied on the use of sequential strings of abstract, context-invariant segments, or phonemes, which provided the mechanism of contrast between lexical items (e.g., distinguishing pat from bat ). 1 , 2 The immense analytic success and relative ease of approaches using such symbolic structures led language researchers to believe that the physical implementation of speech would adhere to the segmental ‘linearity condition,’ so that the acoustics corresponding to consecutive phonemes would concatenate like an acoustic alphabet or a string of beads stretched out in time. If that were the case, perception of the linguistic message in spoken utterances would be a trivial matching process of acoustics to contrastive phonemes. 3

Understanding the true nature of the physical speech signal, however, has turned out to be far from easy. Early signal processing technologies, prior to the 1940s, could detect and display time-varying acoustic amplitudes in speech, resulting in the familiar waveform seen in Figure 1 . Phoneticians have long known that it is the component frequencies encoded within speech acoustics, and how they vary over time, that serve to distinguish one speech percept from another, but waveforms do not readily provide access to this key information. A major breakthrough came in 1946, when Ralph Potter and his colleagues at Bell Laboratories developed the speech spectrogram, a representation which uses the mathematical Fourier transform to uncover the strength of the speech signal hidden in the waveform amplitudes (as shown in Figure 1 ) at a wide range of possible component frequencies. 4 Each calculation finds the signal strength through the frequency spectrum of a small time window of the speech waveform; stringing the results of these time-window analyses together yields a speech spectrogram or voiceprint, representing the dynamic frequency characteristics of the spoken signal as it changes over time ( Figure 2 ).

An external file that holds a picture, illustration, etc.
Object name is nihms430957f1.jpg

Speech waveform of the words typical and yesteryear as produced by an adult male speaker, representing variations in amplitude over time. Vowels are generally the most resonant speech component, corresponding to the most extreme amplitude levels seen here. The identifying formant frequency information in the acoustics is not readily accessible from visual inspection of waveforms such as these.

An external file that holds a picture, illustration, etc.
Object name is nihms430957f2.jpg

A wide-band speech spectrogram of the same utterance as in Figure 1 , showing the change in component frequencies over time. Frequency is represented along the y -axis and time on the x -axis. Darkness corresponds to greater signal strength at the corresponding frequency and time.

Phonemes—An Unexpected Lack of Evidence

As can be seen in Figure 2 , the content of a speech spectrogram does not visually correspond to the discrete segmental units listeners perceive in a straightforward manner. Although vowels stand out due to their relatively high amplitudes (darkness) and clear internal frequency structure, reflecting harmonic resonances or ‘formant frequencies’ in the vocal tract, their exact beginning and ending points are not immediately obvious to the eye. Even the seemingly clear-cut amplitude rises after stop consonant closures, such as for the [p] in typical , do not directly correlate with the beginning of a discrete vowel segment, since these acoustics simultaneously provide critical information about both the identity of the consonant and the following vowel. Demarcating consonant/vowel separation is even more difficult in the case of highly sonorant (or resonant) consonants such as [w] or [r].

The simple ‘acoustic alphabet’ view of speech received another set-back in the 1950s, when Franklin Cooper of Haskins Laboratories reported his research group’s conclusion that acoustic signals composed of strictly serial, discrete units designed to corresponding phonemes or segments are actually impossible for listeners to process at speeds near those of normal speech perception. 5 No degree of signal simplicity, contrast between units, or user training with the context-free concatenation system could produce natural rates of speech perception for listeners. Therefore, the Haskins group concluded that speech must transmit information in parallel, through use of the contextual overlap observed in spectrograms of the physical signal. Speech does not look like a string of discrete, context-invariant acoustic segments, and in order for listeners to process its message as quickly as they do, it cannot be such a system. Instead, as Alvin Liberman proposed, speech is a ‘code,’ taking advantage of parallel transmission of phonetic content on massive scale through co-articulation 3 (see section ‘Variation in Invariants,’ below).

As these discoveries came to light, the ‘speech perception problem’ began to appear increasingly insurmountable. On the one hand, phonological evidence (covered in more depth in the ‘Variation in Invariants’ section) implies that phonemes are a genuine property of linguistic systems. On the other hand, it has been shown that the acoustic speech signal does not directly correspond to phonological segments. How could a listener use such seemingly unhelpful acoustics to recover a speaker’s linguistic message? Hockett encapsulated early speech scientists’ bewilderment when he famously likened the speech perception task to that of the inspector in the following scenario:

Imagine a row of Easter eggs carried along a moving belt; the eggs are of various sizes, and variously colored, but not boiled. At a certain point, the belt carries the row of eggs between two rollers of a wringer, which quite effectively smash them and rub more or less into each other. The flow of eggs before the wringer represents the series of impulses from the phoneme source; the mess that emerges from the wringer represents the output of the speech transmitter. At a subsequent point, we have an inspector whose task it is to examine the passing mess and decide, on the basis of the broken and unbroken yolks, the variously spread out albumen, and the variously colored bits of shell, the nature of the flow of eggs which previously arrived at the wringer. (Ref 1 , p. 210)

For many years, researchers in the field of speech perception focused their efforts on trying to solve this enigma, believing that the heart of the speech perception problem lay in the seemingly impossible task of phoneme recognition—putting the Easter eggs back together.

Synthetic Speech and the Haskins Pattern Playback

Soon after the speech spectrogram enabled researchers to visualize the spectral content of speech acoustics and its changes over time, that knowledge was put to use in the development of technology able to generate speech synthetically. One of the early research synthesizers was the Pattern Playback ( Figure 3 , top panel), developed by scientists and engineers, including Cooper and Liberman, at Haskins Laboratories. 6 This device could take simplified sound spectrograms like those shown in Figure 3 and use the component frequency information to produce highly intelligible corresponding speech acoustics. Hand-painted spectrographic patterns ( Figure 3 , lower panel) allowed researchers tight experimental control over the content of this synthetic, very simplified Pattern Playback speech. By varying its frequency content and transitional changes over time, investigators were able to determine many of the specific aspects in spoken language which are essential to particular speech percepts, and many which are not. 3 , 6

An external file that holds a picture, illustration, etc.
Object name is nihms430957f3.jpg

Top panel: A diagram of the principles and components at work in the Haskins Pattern Playback speech synthesizer. (Reprinted with permission from Ref. 68 Copyright 1951 national Academies of Science.) Lower panel: A series of hand-painted schematic spectrographic patterns used as input to the Haskins Pattern Playback in early research on perceptual ‘speech cues.’ (Reprinted with permission from Ref. 69 Copyright 1957 American Institute of Physics.)

Perceptual experiments with the Haskins Pattern Playback and other speech synthesizers revealed, for example the pattern of complex acoustics that signals the place of articulation of English stop consonants, such as [b], [t] and [k]. 3 For voiced stops ([b], [d], [g]) the transitions of the formant frequencies from silence to the vowel following the consonant largely determine the resulting percept. For voiceless stops ([p], [t], [k]) however, the acoustic frequency of the burst of air following the release of the consonant plays the largest role in identification. The experimental control gained from the Pattern Playback allowed researchers to alter and eliminate many aspects of naturally produced speech signals, discovering the identities of many such sufficient or necessary acoustic cues for a given speech percept. This early work attempted to pair speech down to its bare essentials, hoping to reveal the mechanisms of speech perception. Although largely successful in identifying perceptually crucial aspects of speech acoustics and greatly increasing our fundamental understanding of speech perception, these pioneering research efforts did not yield invariant, context-independent acoustic features corresponding to segments or phonemes. If anything, this research program suggested alternative bases for the communication of linguistic content. 7 , 8

Phoneme Perception—Positive Evidence

Some of the research conducted with the aim of understanding phoneme perception, however, did lead to results suggesting the reality of psychological particulate units such as phonemes. For instance, in some cases listeners show evidence of ‘perceptual constancy,’ or abstraction from signal variation to more generalized representations—possibly phonemes. Various types of such abstraction have been found in speech perception, but we will address two of the most influential here.

Categorical Perception Effects

Phoneme representations split potential acoustic continuums into discrete categories. The duration of aspiration occurring after the release of a stop consonant, for example, constitutes a potential continuum ranging from 0 ms, where vocalic resonance begins simultaneously with release of the stop, to an indefinitely long period between the stop release and the start of the following vowel. Yet stops in English falling along this continuum are split by native listeners into two functional groups—voiced [b], [d], [g] or voiceless [p], [t], [k]—based on the length of this ‘voice onset time.’ In general, this phenomenon is not so strange: perceptual categories often serve to break continuous variation into manageable chunks.

Speech categories appear to be unique in one aspect, however, listeners are unable to reliably discriminate between two members of the same category. That is, although we may assign two different colors both to the category ‘red,’ we can easily distinguish between the two shades in most cases. When speech scientists give listeners stimuli varying along an acoustic continuum, however, their discrimination between different tokens of the same category (analogous to two shades of red) is very close to chance. 9 They are highly accurate at discriminating tokens spanning category boundaries, on the other hand. The combination of sharp category boundaries in listeners’ labeling of stimuli and their within-category insensitivity in discrimination, as shown in Figure 4 , appears to be unique to human speech perception, and constitutes some of the strongest evidence in favor of robust segmental categories underlying speech perception.

An external file that holds a picture, illustration, etc.
Object name is nihms430957f4.jpg

Data for a single subject from a categorical perception experiment. The upper panel gives labeling or identification data for each step on a [b]/[g] place-of-articulation continuum. The lower graph gives this subject’s ABX discrimination data (filled circles) for the same stimuli with one step difference between pairs, as well as the predicted discrimination performance (open circles). Discrimination accuracy is high at category boundaries and low within categories, as predicted. (Reprinted with permission from Ref. 9 Copyright 1957 American Psychological Association.)

According to this evidence, listeners sacrifice sensitivity to acoustic detail in order to make speech category distinctions more automatic and perhaps also less subject to the influence of variability. This type of category robustness is observed more strongly in the perception of consonants than vowels. Not coincidentally, as discussed briefly above and in more detail in the ‘Acoustic Phonetics’ section, below, the stop consonants which listeners have the most difficulty discriminating also prove to be the greatest challenge to define in terms of invariant acoustic cues. 10

Perceptual Constancy

Categorical perception effects are not the only case of such abstraction or perceptual constancy in speech perception; listeners also appear to ‘translate’ the speech they hear into more symbolic or idealized forms, encoding based on expectations of gender and accent. Niedzielski, for example, found that listeners identified recorded vowel stimuli differently when they were told that the original speaker was from their own versus another dialect group. 11 For these listeners, therefore, the mapping from physical speech characteristics to linguistic categories was not absolute, but mediated by some abstract conceptual unit. Johnson summarizes the results of a variety of studies showing similar behavior, 12 which corroborates the observation that, although indexical or ‘extra-linguistic’ information such as speaker gender, dialect, and speaking style are not inert in speech perception, more abstract linguistic units play a role in the process as well.

Far from being exotic, this type of ‘perceptual equivalence’ corresponds very well with language users’ intuitions about speech. Although listeners are aware that individuals often sound drastically different, the feeling remains that something holds constant across talkers and speech tokens. After all, cat is still cat no matter who says it. Given the signal variability and complexity observed in speech acoustics, such consistency certainly seems to imply the influence of some abstract unit in speech perception, possibly contrastive phonemes or segments.

Phoneme Perception—Shortcomings and Roadblocks

From the discussion above, it should be evident that speech perception research with the traditional focus on phoneme identification and discrimination has been unable either to confirm or deny the psychological reality of context-free symbolic units such as phonemes. Listeners’ insensitivity to stimulus differences within a linguistic category and their reference to an abstract ideal in identification support the cognitive role of such units, whereas synthetic speech manipulation has simultaneously demonstrated that linguistic percepts simply do not depend on invariant, context-free acoustic cues corresponding to segments. This paradoxical relationship between signal variance and perceptual invariance constitutes one of the fundamental issues in speech perception research.

Crucially, however, the research discussed until now focused exclusively on the phoneme as the locus of language users’ perceptual invariance. This approach stemmed from the assumption that speech perception can essentially be reduced to phoneme identification, relating yet again back to theoretical linguistics’ analysis of language as sequences of discrete, context-invariant units. Especially given the roadblocks and contradictions emerging in the field, however, speech scientists began to question the validity of those foundational assumptions. By attempting to control variability and isolate perceptual effects on the level of the phoneme, experimenters were asking listeners to perform tasks that bore little resemblance to typical speech communication. Interest in the field began to shift toward the influence of larger linguistic units such as words, phrases, and sentences and how speech perception processes are affected by them, if at all.

Beyond the Phoneme—Spoken Word Recognition Processes

Both new and revisited experimental evidence readily confirmed that the characteristics of word-level units do exert massive influence in speech perception. The lexical status (word vs non-word) of experimental stimuli, for example, biases listeners’ phoneme identification such that they hear more tokens as [d] in a dish / tish continuum, where the [d] percept creates a real word, than a da / ta continuum where both perceptual options are non-words. 13 Moreover, research into listeners’ perception of spoken words has shown that there are many factors that play a major role in word recognition but almost never influence phoneme perception.

Perhaps the most fundamental of these factors is word frequency: how often a lexical item tends to be used. The more frequently listeners encounter a word over the course of their daily lives, the more quickly and accurately they are able to recognize it, and the better they are at remembering it in a recall task (e.g., Refs 14 , 15 ). High-frequency words are more robust in noisy listening conditions, and whenever listeners are unsure of what they have heard through such interference, they are more likely to report hearing a high-frequency lexical item than a low-frequency one. 16 In fact, the effects of lexical status mentioned above are actually only extreme cases of frequency effects; phonotactically legal non-words (i.e., non-words which seem as though they could be real words) are treated psychologically like real words with a frequency of zero. Like cockroaches, these so-called ‘frequency effects’ pop up everywhere in speech research.

The nature of a word’s ‘lexical neighborhood’ also plays a pervasive role in its recognition. If a word is highly similar to many other words, such as cat is in English, then listeners will be slower and less accurate to identify it, whereas a comparably high-frequency word with fewer ‘neighbors’ to compete with it will be recognized more easily. ‘Lexical hermits’ such as Episcopalian and chrysanthemum , therefore, are particularly easy to recognize despite their low frequencies (and long durations). As further evidence of frequency effects’ ubiquitous presence, however, the frequencies of a word’s neighbors also influence perception: a word with a dense neighborhood of high-frequency items is more difficult to recognize than a word with a dense neighborhood of relatively low-frequency items, which has weaker competition. 17 , 18

Particularly troublesome for abstractionist phoneme-based views of speech perception, however, was the discovery that the indexical properties of speech (see ‘Perceptual Constancy,’ below) also influence word recognition. Goldinger, for example, has shown that listeners are more accurate at word recall when they hear stimuli repeated by the same versus different talkers. 19 If speech perception were mediated only by linguistic abstractions, such ‘extra-linguistic’ detail should not be able to exert this influence. In fact, this and an increasing number of similar results (e.g., Ref 20 ) have caused many speech scientists to abandon traditional theories of phoneme-based linguistic representation altogether, instead positing that lexical items are composed of maximally detailed ‘episodic’ memory traces. 19 , 21

Conclusion—Speech Perception

Regardless of the success or failure of episodic representational theories, a host of new research questions remain open in speech perception. The variable signal/common percept paradox remains a fundamental issue: what accounts for the perceptual constancy across highly diverse contexts, speech styles and speakers? From a job interview in a quiet room to a reunion with an old friend at a cocktail party, from a southern belle to a Detroit body builder, what makes communication possible? Answers to these questions may lie in discovering the extent to which the speech perception processes tapped by experiments in word recognition and phoneme perception are related, and uncovering the nature of the neural substrates of language that allow adaptation to such diverse situations. Deeply connected to these issues, Goldinger, Johnson and others’ results have prompted us to wonder: what is the representational specificity of speech knowledge and how does it relate to perceptual constancy?

Although speech perception research over the last 60 years has made substantial progress in increasing our understanding of perceptual challenges and particularly the ways in which they are not solved by human listeners, it is clear that a great deal of work remains to be done before even this one aspect of speech communication is truly understood.


Speech production research serves as the complement to the work on speech perception described above. Where investigations of speech perception are necessarily indirect, using listener response time latencies or recall accuracies to draw conclusions about underlying linguistic processing, research on speech production can be refreshingly direct. In typical production studies, speech scientists observe articulation or acoustics as they occur, then analyze this concrete evidence of the physical speech production process. Conversely, where speech perception studies give researchers exact experimental control over the nature of their stimuli and the inputs to a subject’s perceptual system, research on speech production severely limits experimental control, making the investigators observe more or less passively, whereas speakers do as they will in response to their prompts.

Such fundamentally different experimental conditions, along with focus on the opposite side of the perceptual coin, allows speech production research to ask different questions and draw different conclusions about spoken language use and speech communication. As we discuss below, in some ways this ‘divide and conquer’ approach has been very successful in expanding our understanding of speech as a whole. In other ways, however, it has met with many of the same roadblocks as its perceptual complement and similarly leaves many critical questions unanswered in the end.

A Different Approach

When the advent of the speech spectrogram made it obvious that the speech signal does not straightforwardly mirror phonemic units, researchers responded in different ways. Some, as discussed above, chose to question the perceptual source of phoneme intuitions, trying to define the acoustics necessary and sufficient for an identifiable speech percept. Others, however, began separate lines of work aiming to observe the behavior of speakers more directly. They wanted to know what made the speech signal as fluid and seamless as it appeared, whether the observed overlap and contextual dependence followed regular patterns or rules, and what evidence speakers might show in support of the reality of the phonemic units. In short, these speech scientists wanted to demystify the puzzling acoustics seen on spectrograms by investigating them in the context of their source.

The Continuing Search

It may seem odd, perhaps, that psychologists, speech scientists, engineers, phoneticians, and linguists were not ready to abandon the idea of phonemes as soon as it became apparent that the physical speech signal did not straightforwardly support their psychological reality. Dating back to Panini’s grammatical study of Sanskrit, however, the use of abstract units such as phonemes has provided enormous gains to linguistic and phonological analysis. Phonemic units appear to capture the domain of many phonological processes, for example, and their use enables linguists to make sense of the multitude of patterns and distributions of speech sounds across the world’s languages. It has even been argued 22 that their discrete, particulate nature underlies humanity’s immense potential for linguistic innovation, allowing us to make ‘infinite use of finite means.’ 23

Beyond these theoretical gains, phonemes were argued to be empirically supported by research on speech errors or ‘slips of the tongue,’ which appeared to operate over phonemic units. That is, the kinds of errors observed during speech production, such as anticipations (‘‘a leading list’’), perseverations (‘pulled a pantrum’), reversals (‘heft lemisphere’), additions (‘moptimal number’), and deletions (‘chrysanthemum p ants’), appear to involve errors in the ordering and selection of whole segmental units, and always result in legal phonological combinations, whose domain is typically described as the segment. 24 Without evidence to the contrary, these errors seemed to provide evidence for speakers’ use of discrete phonological units.

Although there have been dissenters 25 and shifts in the conception of the units thought to underlie speech, abstract features, or phoneme-like units of some type have remained prevalent in the literature. In light of the particulate nature of linguistic systems, the enhanced understanding gained with the assumption of segmental analysis, and the empirical evidence observed in speech planning errors, researchers were and are reluctant to give up the search for the basis of phonemic intuitions in physically observable speech production.

Acoustic Phonetics

One of the most fruitful lines of research into speech production focused on the acoustics of speech. This body of work, part of ‘Acoustic Phonetics,’ examines the speech signals speakers produce in great detail, searching for regularities, invariant properties, and simply a better understanding of the human speech capacity. Although the speech spectrograph did not immediately show the invariants researchers anticipated, they reasoned that such technology would also allow them to investigate the speech signal at an unprecedented level of scientific detail. Because speech acoustics are so complex, invariant cues corresponding to phonemes may be present, but difficult to pinpoint. 10 , 26

While psychologists and phoneticians in speech perception were generating and manipulating synthesized speech in an effort to discover the acoustic ‘speech cues,’ therefore, researchers in speech production refined signal processing techniques enabling them to analyze the content of naturally produced speech acoustics. Many phoneticians and engineers took on this problem, but perhaps none has been as tenacious and successful as Kenneth Stevens of MIT.

An electrical engineer by training, Stevens took the problem of phoneme-level invariant classification and downsized it, capitalizing on the phonological theories of Jackobson et al. 27 and Chomsky and Halle’s Sound Patterns of English 28 which postulated linguistic units below the level of the phoneme called distinctive features. Binary values of universal features such as [sonorant], [continuant], and [high], these linguists argued, constituted the basis of phonemes. Stevens and his colleagues thought that invariant acoustic signals may correspond to distinctive features rather than phonemes. 10 , 26 Since phonemes often share features (e.g., /s/ and /z/ share specification for all distinctive features except [voice]), it would make sense that their acoustics are not as unique as might be expected from their contrastive linguistic function alone.

Stevens, therefore, began a thorough search for invariant feature correlates that continued until his retirement in 2007. He enjoyed several notable successes: many phonological features, it turns out, can be reliably specified by one or two signal characteristics or ‘acoustic landmarks.’ Phonological descriptors of vowel quality, such as [high] and [front], were found to correspond closely to the relative spacings between the first and second harmonic resonances of the vocal tract (or ‘formants’) during the production of sonorant vowel segments. 10

Some features, however, remained more difficult to define acoustically. Specifically, the acoustics corresponding to consonant place of articulation seemed to depend heavily on context—the exact same burst of noise transitioning from a voiceless stop to a steady vowel might result from the lip closure of a [p] or the tongue-dorsum occlusion of a [k], depending on the vowel following the consonant. Equally problematic, the acoustics signaling the alveolar ridge closure location of coronal stop [t] are completely different before different vowels. 29 The articulation/acoustic mismatch, and the tendency for linguistic percepts to mirror articulation rather than acoustics, is represented in Figure 5 .

An external file that holds a picture, illustration, etc.
Object name is nihms430957f5.jpg

Observations from early perceptual speech cue studies. In the first case, two different acoustic signals (consonant/vowel formant frequency transitions) result in the same percept. In the latter case, identical acoustics (release burst at 1440 Hz) result in two different percepts, depending on the vocalic context. In both cases, however, perception reflects articulatory, rather than acoustic, contrast. (Adapted and reprinted with permission from Ref. 29 Copyright 1996 American Institute of Physics.)

Variation in Invariants

Why do listeners’ speech percepts show this dissociation from raw acoustic patterns? Perhaps the answer becomes more intuitive when we consider that even the most reliable acoustic invariants described by Stevens and his colleagues tend to be somewhat broad, dealing in relative distances between formant frequencies in vowels and relative abruptness of shifts in amplitude and so on. This dependence on relative measures comes from two major sources: individual differences among talkers and contextual variation due to co-articulation. Individual speakers’ vocal tracts are shaped and sized differently, and therefore they resonant differently (just as different resonating sounds are produced by blowing over the necks of differently sized and shaped bottles), making the absolute formant frequencies corresponding to different vowels, for instance, impossible to generalize across individuals.

Perhaps more obviously problematic, though, is the second source: speech acoustics’ sensitivity to phonetic context. Not only do the acoustics cues for [p], [t], or [k] depend on the vowel following the stop closure, for example, but because the consonant and vowel are produced nearly simultaneously, the identity of the consonant reciprocally affects the acoustics of the vowel. Such co-articulatory effects are extremely robust, even operating across syllable and word boundaries. This extensive interdependence makes the possibility of identifying reliable invariance in the acoustic speech signal highly remote.

Although some researchers, such as Stevens, attempted to factor out or ‘‘normalize’’ these co-articulatory effects, others believed that they are central to the functionality of speech communication. Liberman et al. at Haskins Laboratories pointed out that co-articulation of consonants and vowels allows the speech signal to transfer information in parallel, transmitting messages more quickly than it could if spoken language consisted of concatenated strings of context-free discrete units. 3 Co-articulation therefore enhances the efficiency of the system, rather than being a destructive or communication-hampering force. Partially as a result of this view, some speech scientists focused on articulation as a potential key to understanding the reliability of phonemic intuitions, rather than on its acoustic consequences. They developed the research program called ‘articulatory phonetics,’ aimed at the study of the visible and hidden movements of the speech organs.

Articulatory Phonetics

In many ways articulatory phonetics constitutes as much of an engineering challenge as a linguistic one. Because the majority of the vocal tract ‘machinery’ lies hidden from view (see Figure 6 ), direct observation of the mechanics of speech production requires technology, creativity, or both. And any potential solution to the problem of observation cannot disrupt natural articulation too extensively if its results are to be useful in understanding natural production of speech. The challenge, therefore, is to investigate aspects of speech articulation accurately and to a high level of detail, while keeping interference with the speaker’s normal production as minor as possible.

An external file that holds a picture, illustration, etc.
Object name is nihms430957f6.jpg

A sagittal view of the human vocal tract showing the main speech articulators as labeled. (Reprinted with permission from Ref. 70 Copyright 2001 Blackwell Publishers Inc.)

Various techniques have been developed that manage to satisfy these requirements, spanning from the broadly applicable to the highly specialized. Electromyography (EMG), for instance, allows researchers to measure directly the activity of muscles within the vocal tract during articulation via surface or inserted pin electrodes. 30 These recordings have broad applications in articulatory phonetics, from determining the relative timing of tongue movements during syllable production to measures of pulmonary function from activity in speakers’ diaphragms to examining tone production strategies via raising and lowering of speakers’ larynxes. EMG electrode placement can significantly impact articulation, however, which does impose limits on its use. More specialized techniques are typically still more disruptive of typical speech production, but interfere minimally with their particular investigational target. In transillumination of the glottis, for example, a bundle of fiberoptic lights is fed through a speaker’s nose until the light source is positioned just above their larynx. 31 A light-sensitive photocell is then placed on the neck just below the glottis to detect the amount of light passing through the vocal folds at any given moment, which correlates directly with the size of glottal opening over time. Although transillumination is clearly not an ideal method to study the majority of speech articulation, it nevertheless provides a highly accurate measure of various glottal states during speech production.

Perhaps the most currently celebrated articulatory phonetics methods are also the least disruptive to speakers’ natural articulation. Simply filming speech production in real-time via X-ray provided an excellent, complete view of unobstructed articulation, but for health and safety reasons can no longer be used to collect new data. 32 Methods such as X-ray microbeam and Electromagnetic Mid-Sagittal Articulometer (EMMA) tracking attempt to approximate that ‘X-ray vision’ by recording the movements of speech articulators in real-time through other means. The former uses a tiny stream of X-ray energy aimed at radio-opaque pellets attached to a speaker’s lips, teeth, and tongue to monitor the movements of the shadows created by the pellets as the speaker talks. The latter, EMMA, generates similar positional data for the speech organs by focusing alternating magnetic fields on a speaker and monitoring the voltage induced in small electromagnetic coils attached to speaker’s articulators as they move through the fields during speech. Both methods track the movements of speech articulators despite their inaccessibility to visible light, providing reliable position-over-time data that minimally disrupts natural production. 33 , 34 However, comparison across subjects can be difficult due to inconsistent placement of tracking points from one subject to another and simple anatomical differences between subjects.

Ultrasound provides another, even less disruptive, articulatory phonetics technique that has been gaining popularity in recent years (e.g., Refs 35 , 36 ). Using portable machinery that does nothing more invasive than send sound waves through a speaker’s tissue and monitor their reflections, speech scientists can track movements of the tongue body, tongue root, and pharynx that even X-ray microbeam and EMMA cannot capture, as these articulators are all but completely manually inaccessible. By placing an ultrasound wand at the juncture of the head and neck below the jaw, however, images of the tongue from its root in the larynx to its tip can be viewed in real-time during speech production, with virtually no interference to the speech act itself. The tracking cannot extend beyond cavities of open air, making this method inappropriate for studies of precise place of articulation against the hard palate or of velum movements, for example, but these are areas in which X-ray microbeam and EMMA excel. The data recently captured using these techniques are beginning to give speech scientists a more complete picture of speech articulation than ever before.

Impact on the Search for Phonemes

Unfortunately for phoneme-based theories of speech production and planning, the results of recent articulatory studies of speech errors do not seem to paint a compatible picture. As discussed above, the categorical nature of speech errors has served as important support for the use of phonemic units in speech production. Goldstein, Pouplier, and their colleagues, however, used EMMA to track speakers’ production of errors in a repetition task similar to a tongue twister. Confirming earlier suspicious (e.g., Ref 25 ), they found that while speakers’ articulation sometimes followed a categorically ‘correct’ or ‘errorful’ gestural pattern, it was more frequently somewhere between two opposing articulations. In these cases, small ‘incorrect’ movements of the articulators would intrude upon the target speech gesture, both gestures would be executed simultaneously, or the errorful gesture would completely overshadow the target articulation. Only the latter reliably resulted in the acoustic percept of a speech error. 37 As Goldstein and Pouplier point out, such non-categorical, gradient speech errors cannot constitute support for discrete phonemic units in speech planning.

Importantly, this finding was not isolated in the articulatory phonetics literature: speakers frequently appear to execute articulatory movements that do not result in any acoustic consequences. Specifically, X-ray microbeam tracking of speakers’ tongue tip, tongue dorsum, and lip closures during casual pronunciation of phrases such as perfect memory reveals that speakers raise their tongue tips for [t]-closure, despite the fact that the preceding [k] and following [m] typically obscure the acoustic realization of the [t] completely. 38 Although they could minimize their articulatory effort by not articulating the [t] where it will not be heard, speakers faithfully proceed with their complete articulation, even in casual speech.

Beyond the Phoneme

So far we have seen that, while technological, methodological and theoretical advances have enabled speech scientists to understand the speech signal and its physical production better than ever before, the underlying source of spoken language’s systematic nature remains largely mysterious. New research questions continue to be formulated, however, using results that were problematic under old hypotheses to motivate new theories and new approaches to the study of speech production.

The theory of ‘Articulatory Phonology’ stands as a prominent example; its proponents took the combination of gradient speech error data, speakers’ faithfulness to articulation despite varying means of acoustic transmission, and the lack of invariant acoustic speech cues as converging evidence that speech is composed of articulatory, rather than acoustic, fundamental units that contain explicit and detailed temporal structure. 8 , 38 Under this theory, linguistic invariants are underlyingly motor-based articulatory gestures which specify the degree and location of constrictions in the vocal tract and delineate a certain amount of time relative to other gestures for their execution. Constellations of these gestures, related in time, constitute syllables and words without reference to strictly sequential segmental or phonemic units. Speech perception, then, consists of determining the speech gestures and timing responsible for creating a received acoustic signal, possibly through extension of experiential mapping between the perceiver’s own gestures and their acoustic consequences, as in Liberman and Mattingly’s Motor Theory of Speech Perception 7 or Fowler’s Direct Realist approach. 39 Recent evidence from neuroscience may provide a biological mechanism for this process 40 (see ‘Neurobiological Evidence—Mirror Neurons,’ below).

And although researchers like Stevens continued to focus on speech acoustics as opposed to articulation, the separate lines of inquiry actually appear to be converging on the same fundamental solution to the invariance problem. The most recent instantiation of Stevens’ theory posits that some distinctive phonological features are represented by sets of redundant invariant acoustic cues, only a subset of which are necessary for recognition in any single token. As Stevens recently wrote, however, the distinction between this most recent feature-based account and theories of speech based on gestures may no longer be clear:

The acoustic cues that are used to identify the underlying distinctive features are cues that provide evidence for the gestures that produced the acoustic pattern. This view that a listener focuses on acoustic cues that provide evidence for articulatory gestures suggests a close link between the perceptually relevant aspects of the acoustic pattern for a distinctive feature in speech and the articulatory gestures that give rise to this pattern.
(Ref 10 , p. 142)

Just as in speech perception research, however, some speech production scientists are beginning to wonder if the invariance question was even the right question to be asked in the first place. In the spirit of Lindblom’s hyper-articulation and hypo-articulation theory 41 (see ‘Perception-Driven Adaptation in Speech Production,’ below), these researchers have begun investigating control and variability in production as a means of pinning down the nature of the underlying system. Speakers are asked to produce the same sentence in various contextual scenarios such that a target elicited word occurs as the main element of focus, as a carrier of stress, as a largely unstressed element, and as though a particular component of the word was misheard (e.g., in an exchange such as ‘Boy?’ ‘No, toy ’), while their articulation and speech acoustics are recorded. Then the data are examined for regularities. If the lengths of onset consonants and following vowels remain constant across emphasized, focused, stressed, and unstressed conditions, for example, that relationship may be specified in the representation of syllables, whereas the absolute closure and vocalic durations vary freely and therefore must not be subject to linguistic constraint. Research of this type seeks to determine the articulatory variables under active, regular control and which (if any) are mere derivatives or side effects of deliberate actions. 42 – 44

Conclusion—Speech Production

Despite targeting a directly observable, overt linguistic behavior, speech production research has had no more success than its complement in speech perception at discovering decisive answers to the foundational questions of linguistic representational structure or the processes governing spoken language use. Due to the joint endeavors of acoustic and articulatory phonetics, our understanding of the nature of the acoustic speech signal and how it is produced has increased tremendously, and each new discovery points to new questions. If the basic units of speech are gesture-based, what methods and strategies do listeners use in order to perceive them from acoustics? Are there testable differences between acoustic and articulatory theories of representation? What aspects of speech production are under demonstrable active control, and how do the many components of the linguistic and biological systems work together across speakers and social and linguistic contexts? Although new lines of inquiry are promising, speech production research seems to have only begun to scratch the surface of the complexities of speech communication.


As Roger Moore recently pointed out, the nature of the standard scientific method is such that ‘it leads inevitably to greater and greater knowledge about smaller and smaller aspects of a problem’ (Ref. 45 , p. 419). Speech scientists followed good scientific practice when they effectively split the speech communication problem, one of the most complex behaviors of a highly complex species, into more manageable chunks. And the perceptual and productive aspects of speech each provided enough of a challenge, as we have seen, that researchers had plenty to work on without adding anything. Yet we have also seen that neither discipline on its own has been able to answer fundamental questions regarding linguistic knowledge, representation, and processing.

Although the scientific method serves to separate aspects of a phenomenon, the ultimate goal of any scientific enterprise is to unify individual discoveries, uncovering connections and regularities that were previously hidden. 46 One of the great scientific breakthroughs of the 19th century, for example, brought together the physics of electricity and magnetism, previously separate fields, and revealed them to be variations of the same basic underlying principles. Similarly, where research isolated to either speech perception or production has failed to find success, progress may lie in the unification of the disciplines. And unlike electricity and magnetism the a priori connection between speech perception and speech production is clear: they are two sides of the same process, two links in Denes and Pinson’s famous ‘speech chain’. 47 Moreover, information theory demands that whatever signals generated in speech production match those received in perception, a criteria known as ‘signal parity’ which must be met for successful communication to take place; therefore, the two processes must at some point even deal in the same linguistic currency. 48

In this final section, we will discuss theories and experimental evidence that highlight the deep, inherent links between speech perception and production. Perhaps by bringing together the insights achieved within each separate line of inquiry, the recent evidence pointing to the nature of the connection between them, and several theories of how they may work together in speech communication, we can point to where the most exciting new research questions lie in the future.

Early Evidence—Audiovisual Speech Perception

Lurking behind the idea that speech perception and production may be viewed as parts of a unified speech communication process is the assumption that speech consists of more than just acoustic patterns and motor plans that happen to coincide. Rather, the currency of speech must somehow combine the domains of perception and production, satisfying the criterion of signal parity discussed above. Researchers such as Kluender, Diehl, and colleagues take the former, more ‘separatist’ stance, believing that speech is processed by listeners like any other acoustic signal, without input from or reference to complementary production systems. 49 , 50 Much of the research described in this section runs counter to such a ‘general auditory’ view, but none so directly challenges its fundamental assumptions as the phenomenon of audiovisual speech perception.

The typical view of speech, fully embraced thus far here, puts audition and acoustics at the fore. However, visual and other sensory cues also play important roles in the perception of a speech signal, augmenting or occasionally even overriding a listener’s auditory input. Very early in speech perception research, Sumby and Pollack showed that simply seeing a speaker’s face during communication in background noise can provide listeners with massive gains in speech intelligibility, with no change in the acoustic signal. 51 Similarly, it has been well documented that access to a speaker’s facial dynamics improves the accuracy and ease of speech perception for listeners with mild to more severe types of hearing loss 52 and even deaf listeners with cochlear implants. 53 , 54 Perhaps no phenomenon demonstrates this multimodal integration as clearly or has attracted more attention in the field than the effect reported by McGurk and MacDonald in 1976. 55 When listeners receive simultaneous, mismatching visual and auditory speech input, such as a face articulating the syllable ba paired with the acoustics for ga , they typically experience a unified percept da that appears to combine features of both signals while matching neither. In cases of a closer match—between visual va and auditory ba , for example—listeners tend to perceive va , adhering to the visual rather than auditory signal. The effect is robust even when listeners are aware of the mismatch, and has been observed with conflicting tactile rather than visual input 56 and with pre-lingual infants. 57 As these last cases show, the effect cannot be due to extensive experience linking visual and auditory speech information. Instead, the McGurk effect and the intelligibility benefits of audiovisual speech perception provide strong evidence for the inherently multimodal nature of speech processing, contrary to a ‘general auditory’ view. As a whole, the audiovisual speech perception evidence supports the assumptions which make possible the discussion of evidence for links between speech perception and production below.

Phonetic Convergence

Recent work by Pardo builds on the literature of linguistic ‘alignment’ to find further evidence of an active link between speech perception and production in ‘real-time,’ typical communicative tasks. She had pairs of speakers play a communication game called the ‘map task,’ where they must cooperate to copy a path marked on one speaker’s map to the other’s blank map without seeing one another. The speakers refer repeatedly to certain landmarks on the map, and Pardo examined their productions of these target words over time. She asked naive listeners to compare a word from one speaker at both the beginning and end of the game with a single recording of the same word said by the other speaker. Consistently across pairs, she found that the recordings from the end of the task were judged to be more similar than those from the beginning. Previous studies have shown that speakers may align in their patterns of intonation, 58 for example, but Pardo’s are the first results demonstrating such alignment at the phonetic level in an ecologically valid speech setting.

This ‘phonetic convergence’ phenomenon defies explanation unless the processes of speech perception and subsequent production are somehow linked within an individual. Otherwise, what a speaker hears his or her partner say could not affect subsequent productions. Further implications of the convergence phenomenon become apparent in light of the categorical perception literature described in ‘Categorical Perception Effects’ above. In these robust speech perception experiments, listeners appear to be unable to reliably detect differences in acoustic realization of particular segments. 9 Yet the convergence observed in Pardo’s work seems to operate at the sub-phonemic level, affecting subtle changes within linguistic categories (i.e., convergence results do not depend on whole-segment substitutions, but much more fine-grained effects).

As Pardo’s results show, the examination of links between speech perception and production has already pointed toward new answers to some old questions. Perhaps we do not understand categorical perception effects as well as we thought—if the speech listeners hear can have these gradient within-category effects on their own speech production, then why is it that they cannot access these details in the discrimination tasks of classic categorical perception experiments? And what are the impacts of the answer for various representational theories of speech?

Perception-Driven Adaptation in Speech Production

Despite the typical separation between speech perception and production, the idea that the two processes interact or are coupled within individual speakers is not new. In 1990, Björn Lindblom introduced his ‘hyper-articulation and hypo-articulation’ (H&H) theory, which postulated that speakers’ production of speech is subject to two conflicting forces: economy of effort and communicative contrast. 41 The first pressures speech to be ‘hypo-articulated,’ with maximally reduced articulatory movements and maximal overlap between movements. In keeping with the theory’s roots in speech production research, this force stems from a speaker’s motor system. The contrasting pressure for communicative distinctiveness pushes speakers toward ‘hyper-articulated’ speech, executed so as to be maximally clear and intelligible, with minimal co-articulatory overlap. Crucially, this force stems from listener-oriented motivation. Circumstances that make listeners less likely to correctly perceive a speaker’s intended message—ranging from physical factors like presence of background noise, to psychological factors such as the lexical neighborhood density of a target word, to social factors such as a lack of shared background between the speaker and listener—cause speakers to hyper-articulate, expending greater articulatory effort to ensure transmission of their linguistic message.

For nearly a hundred years, speech scientists have known that physical conditions such as background noise affect speakers’ production. As Lane and Tranel neatly summarized, a series of experiments stemming from the work of Etienne Lombard in 1911 unequivocally showed that the presence of background noise causes speakers not only to raise the level of their speech relative to the amplitude of the noise, but also to alter their articulation style in ways similar to those predicted by H&H theory. 59 No matter the eventual status of H&H theory in all its facets, this ‘Lombard Speech’ effect empirically demonstrates a real and immediate link between what speakers are hearing and the speech they produce. As even this very early work demonstrates, speech production does not operate in a vacuum, free from the influences of its perceptual counterpart; the two processes are coupled and closely linked.

Much more recent experimental work has demonstrated that speakers’ perception of their own speech can be subject to direct manipulation, as opposed to the more passive introduction of noise used in inducing Lombard speech, and that the resulting changes in production are immediate and extremely powerful. In one experiment conducted by Houde and Jordan, for example, speakers repeatedly produced a target vowel [ ε ], as in bed , while hearing their speech only through headphones. The researchers ran the speech through a signal processing program which calculated the formant frequencies of the vowel and shifted them incrementally toward the frequencies characteristic of [æ], raising the first formant and lowering the second. Speakers were completely unaware of the real-time alteration of the acoustics corresponding to their speech production, but they incrementally shifted their articulation of [ ε ] to compensate for the researchers’ manipulation: they began producing lower first formants and higher second formants. This compensation was so dramatic that speakers who began by producing [ ε ] ended the experiment by saying vowels much closer to [i] (when heard outside the formant-shifting influence of the manipulation). 60

Houde, Jordan, and other researchers working in this paradigm point out that such ‘sensorimotor adaptation’ phenomena demonstrate an extremely powerful and constantly active feedback system in operation during speech production. 61 , 62 Apparently, a speaker’s perception of his or her own speech plays a significant role in the planning and execution of future speech production.

The Role of Feedback—Modeling Spoken Language Use

In his influential theory of speech production planning and execution, Levelt makes explicit use of such perceptual feedback systems in production. 63 In contrast to Lindblom’s H&H theory, Levelt’s model (WEAVER++) was designed primarily to provide an account of how lexical items are selected from memory and translated into articulation, along with how failures in the system might result in typical speech errors. In Levelt’s model, speakers’ perception of their own speech allows them to monitor for errors and execute repairs. The model goes a step further, however, to posit another feedback loop entirely internal to the speaker, based on their experience with mappings between articulation and acoustics.

According to Levelt’s model, then, for any given utterance a speaker has several levels of verification and feedback. If, for example, a speaker decides to say the word day the underlying representation of the lexical item is selected and prepared for articulation, presumably following the various steps of the model not directly relevant here. Once the articulation has been planned, the same ‘orders’ are sent to both the real speech organs and a mental emulator or ‘synthesizer’ of the speaker’s vocal tract. This emulator generates the acoustics that would be expected from the articulatory instructions it received, based on the speaker’s past experience with the mapping. The expected acoustics feed back to the underlying representation of day to check for a match with remembered instances of the word. Simultaneous to this process, the articulators are actually executing their movements and generating acoustics. That signal enters the speaker’s auditory pathway, where the resulting speech percept feeds back to the same underlying representation, once again checking for a match.

Such a system may seem redundant, but each component has important properties. As Moore points out for his own model (see below), internal feedback loops of the type described in Levelt’s work allow speakers to repair errors much more quickly than reliance on external feedback would permit, which translates to significant evolutionary advantages. 45 Without external loops backing up the internal systems, however, speakers might miss changes to their speech imposed by physical conditions (e.g., noise). Certainly the adaptation observed in Houde and Jordan’s work demonstrates active external feedback control over speech production: only an external loop could catch the disparity between the acoustics a speaker actually perceives and his or her underlying representation. And indeed, similar feedback-reliant models have been proposed as the underpinnings of non-speech movements such as reaching. 64

As suggested above, Moore has recently proposed a model of speech communication that also incorporates multiple feedback loops, both internal and external. 45 His Predictive Sensorimotor Control and Emulation (PRESENCE) model goes far beyond the specifications of Levelt’s production model, however, to incorporate additional feedback loops that allow the speaker to emulate the listener’s emulation of the speaker , and active roles for traditionally ‘extra-linguistic’ systems such as the speaker’s affective or emotional state. In designing his model, Moore attempts to take the first step in what he argues is the necessary unification of not just research on speech perception and production, but the work related to speech in many other fields as well, such as neuroscience, automated speech recognition, text-to-speech synthesis, and biology, to name just a few. 45

Perhaps most fundamental to our discussion here, however, is the role of productive emulation or feedback during speech perception in the model. Where Levelt’s model deals primarily with speech production, Moore’s PRESENCE incorporates both speech perception and production, deliberately emphasizing their interdependence and mutually constraining relationship. According to his model, speech perception takes place with the aid of listener-internal emulation of the acoustic-to-articulatory mapping potentially responsible for the received signal. As Moore puts it, speech perception in his model is essentially a revisiting of the idea of ‘recognition-by-synthesis’ (e.g., Ref 65 ), whereas speech production is (as in Levelt) ‘synthesis by recognition.’

Neurobiological Evidence—Mirror Neurons

The experimental evidence we considered above suggests pervasive links between what listeners hear and the speech they produce. Conversational partners converge in their production of within-category phonetic detail, speakers alter their speech styles in adverse listening conditions, and manipulation of speakers’ acoustic feedback from their own speech can dramatically change the speech they produce in response. As we also considered, various theoretical models of spoken language use have been proposed to account for these phenomena and the observed perceptual and productive links. Until recently, however, very little neurobiological evidence supported these proposals. The idea of a speaker-internal vocal tract emulator, for instance, seemed highly implausible to many speech scientists; how would the brain possibly implement such a structure?

Cortical populations of newly discovered ‘mirror neurons,’ however, seem to provide a plausible neural substrate for proposals of direct, automatic, and pervasive links between speech perception and production. These neurons ‘mirror’ in the sense that they fire both when a person performs an action themselves and when they perceive someone else performing the same action, either visually or through some other (auditory, tactile) perceptual mode. Human mirror neuron populations appear to be clustered in several cortical areas, including the pre-frontal cortex, which is often implicated in behavioral inhibition and other executive function, and areas typically recognized as centers of speech processing, such as Broca’s area (for in-depth review of the literature and implications, see Ref 66 ).

Neurons which physically equate (or at least directly link) an actor’s production and perception of a specific action have definite implications for theories linking speech perception and production: they provide a potential biological mechanism. The internal feedback emulators hypothesized most recently by Levelt and Moore could potentially be realized in mirror neuron populations, which would emulate articulatory-to-acoustic mappings (and vice versa) via their mutual sensitivity to both processes and their connectivity to both sensory and motor areas. Regardless of their specific applicability to Levelt and Moore’s models, however, these neurons do appear to be active during speech perception, as one study using Transcranial Magnetic Stimulation (TMS) demonstrates elegantly. TMS allows researchers to temporarily either attenuate or elevate the background activation of a specific brain area, respectively inducing a state similar to the brain damage caused by a stroke or lesion or making it so that any slight increase in the activity of the area causes overt behavior when its consequences would not normally be observable. The later excitation technique was used by Fadiga and colleagues, who raised the background activity of specific motor areas controlling the tongue tip. When the ‘excited’ subjects then listened to speech containing consonants which curled the tongue upward, their tongues twitched correspondingly. 67 Perceiving the speech caused activation of the motor plans that would be used in producing the same speech—direct evidence of the link between speech perception and production.

Perception/Production Links—Conclusion

Clearly, the links between speech perception and production are inherent in our use of spoken language. They are active during typical speech perception (TMS mirror neuron study), are extremely powerful, automatic and rapid (sensorimotor adaptation), and influence even highly ecologically valid communication tasks (phonetic convergence). Spoken language processing, therefore, seems to represent a linking of sensory and motor control systems, as the pervasive effects of visual input on speech perception suggest. Indeed, speech perception cannot be just sensory interpretation and speech production cannot be just motor execution. Rather, both processes draw on common resources, using them in tandem to accomplish remarkable tasks such as generalization from talker to talker and acquiring new lexical items. As new information regarding these links comes to light, models such as Lindblom’s H&H, Levelt’s WEAVER++, and Moore’s PRESENCE will both develop greater reflection of the actual capabilities of language users (simultaneous speakers and listeners) and be subject to greater constraint in their hypotheses and mechanisms. And hopefully, theory and experimental evidence will converge to discover how speech perception and production interact in the highly complex act of vocal communication.


Despite the strong intuitions and theoretical traditions of linguists, psychologists, and speech scientists, spoken language does not appear to straightforwardly consist of linear sequences of discrete, idealized abstract, context-free symbols such as phonemes or segments. This discovery begs the question, however; how does speech convey equivalent information across talkers, dialects, and contexts? And how do language users mentally represent both the variability and constancy in the speech they hear?

New directions in research on speech perception include theories of exemplar-based representation of speech and experiments designed to discover the specificity, generalized application, and flexibility of listeners’ perceptual representations. Efforts to focus on more ecologically valid tasks such as spoken word recognition also promise fruitful progress in coming years, particularly those which provide tests of theoretical and computational models. In speech production, meanwhile, the apparent convergence of acoustic and articulatory theories of representation points to the emerging potential for exciting new lines of research combining their individual successes. At the same time, more and more speech scientists are turning their research efforts toward variability in speech, and what patterns of variation can reveal about speakers’ language-motivated control and linguistic knowledge.

Perhaps the greatest potential for progress and discovery, however, lies in continuing to explore the behavioral and neurobiological links between speech perception and production. Although made separate by practical and conventional scientific considerations, these two processes are inherently and intimately coupled, and it seems that we will never truly be able to understand the human capacity for spoken communication until they have been conceptually reunited.

Nature and Perception of Speech Sounds

Cite this chapter.

definition of speech production

  • Jean-Claude Junqua 3 &
  • Jean-Paul Haton 4  

Part of the book series: The Kluwer International Series in Engineering and Computer Science ((SECS,volume 341))

205 Accesses

This chapter reviews the fundamentals of speech production, acoustics and phonetics of speech sounds as well as their time-frequency representation. Then, the basic structure of the auditory system and the main mechanisms influencing speech perception are briefly described. Throughout this chapter, we also emphasize the influence of noise on speech production and perception. By introducing basic characteristics of speech sounds and how they are produced and perceived, we intend to provide the essential knowledge needed to understand the following chapters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unable to display preview.  Download preview PDF.

Ainsworth, W. (1976). Mechanisms of Speech Recognition. Pergamon Press.

Google Scholar  

Anglade, Y. (1994). Robustesse de la Reconnaissance Automatique de la Parole: Etude et Application dans un Système d’Aide Vocal pour une Standardiste Mal-Voyante. Ph.D. thesis. Université Henri Poincaré, Nancy I.

Atkinson, J. (1978). Correlation analysis of the physiological factors controlling fundamental voice frequency. J. Acoust. Soc. Am. , 63(1):211–222.

Article   Google Scholar  

Bond, Z., Moore, T., and Gable, B. (1989). Acoustic-phonetic characteristics of speech produced in noise and while wearing an oxygen mask. J. Acoust. Soc. Am. , 85(2):907–912.

Byrd, D. (1993). 54,000 American stops. Technical report, UCLA Working Papers in Phonetics.

Calliope (1989). La Parole et son Traitement Automatique. Masson.

Chiba, T. and Kajiyama, M. (1941). The Vowel, its Nature and Structure. Kaseikan.

Chomsky, N. and Halle, M. (1968). The Sound Pattern of English. Harper and Row.

Coker, C. and Umeda, N. (1975). The importance of spectral details in initial-final constrasts of voiced stops. Journal of Phonetics , 3:63–68.

Datta, A., Ganguli, N., and Majumder, D. (1981). Acoustic features of consonants: A study based on Telugu speech sounds. Acustica , 47(2):72–82.

Deng, L. and Sun, D. (1994). Phonetic classification and recognition using HMM representation of overlapping articulatory features for all classes of English sounds. In ICASSP , pages 45–48.

Draegert, G. (1951). Relationships between voice variables and speech intelligibility in high level noise. Speech Monograph.

Draper, M., Ladefoged, P., and Whiteridge, D. (1959). Respiratory muscles in speech. Journal of Speech and Hearing Research , 2:16–27.

Dreher, J. and O’Neill, J. (1957). Effects of ambient noise on speaker intelligibility for words and phrases. J. Acoust. Soc. Am. , 29:1320–1323.

Dunn, H. (1950). The calculation of vowel resonances, and an electrical vocal tract. J. Acoust. Soc. Am. , 22:151–166.

Elliot, L. (1962). Backward and forward masking of probe tones of different frequencies. J. Acoust. Soc. Am. , 34:1116–1117.

Fant, G. (1960). Acoustic Theory of Speech Production. Mouton.

Fant, G. (1973). Speech Sounds and Features. M.I.T. Press.

Farnsworth, D. (1940). High speed motion pictures of the human vocal cords. Technical report, Bell Lab. Record.

Flanagan, J. (1958). Some properties of the glottal sound source. Journal of Speech and Hearing Research , 1:99–116.

Flanagan, J. (1972). Speech Analysis Synthesis and Perception. Springer-Verlag, 2nd ed.

Fletcher, H. and Munson, W. (1933). Loudness, its definition, measurement, and calculation. J. Acoust. Soc. Am. , 5:82–108.

Fletcher, H. and Munson, W. (1937). Relation between loudness and masking. J. Acoust. Soc. Am. , 9(1).

Fujimura, O. (1962). Analysis of nasal consonants. J. Acoust. Soc. Am. , 34:1865–1875.

Fujisaki, H. and Kunisaki, O. (1978). Analysis, recognition, and perception of voiceless fricative consonants in Japanese. IEEE Trans. ASSP , 26(l):21–27.

Hansen, J. (1988). Analysis and compensation of stressed and noisy speech with application to robust automatic recognition. Ph.D. thesis. Georgia Institute of Technology.

Harris, D. and Dallos, P. (1979). Forward masking of auditory nerve fiber responses. Journal of Neurophysiology , 42:1083–1107.

Heinz, J. and Stevens, K. (1961). On the properties of voiceless fricative consonants. J. Acoust. Soc. Am. , 33(5):589–596.

Hirano, M. (1976). Structure and vibratory behavior of the vocal folds. In Sawashima, M. and Cooper, F.-S., editors, U.S.-Japan Joint Seminar on Dynamics Aspects of Speech Production , pages 13–27. Univ. of Tokyo Press.

Houtgast, T. (1972). Psychophysical evidence for lateral inhibition in hearing. J. Acoust. Soc. Am. , 51(6.2): 1885–1894.

Jakobson, R., Fant, G., and Halle, M. (1952). Preliminaries to Speech Analysis, 1st edition. M.I.T. Press.

Jakobson, R., Fant, G., and Halle, M. (1961). Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates. M.I.T. Press.

Javel, E., McGee, J., Walsh, E., Farley, G., and Gorga, M. (1983). Suppression of auditory-nerve responses. Suppression threshold and growth, iso-suppression contours. J. Acoust. Soc. Am. , 74(3):801–813.

Junqua, J.-C. (1989). Toward robustness in isolated-word automatic speech recognition. Ph.D. thesis. University of Nancy I, STL Monograph.

Junqua, J.-C. (1993). The Lombard reflex and its role on human listeners and automatic speech recognizers. J. Acoust. Soc. Am. , 93(1):510–524.

Junqua, J.-C. and Wakita, H. (1989). A comparative study of cepstral lifters and distance measures for all-pole models of speech in noise. In ICASSP , pages 476–479.

Kiang, N. (1968). A survey of recent developments in the study of auditory physiology. Ann. Otol. Rhinol. Laryngol , 77:656–675.

Kiang, N., Watanabe, T., Thomas, E., and Clark, L. (1965). Discharge Patterns of Single Fibres in the Cat’s Auditory Nerve. M.I.T. Press.

Koenig, W., Dunn, H., and Lacey, L. (1946). The sound spectrograph. J. Acoust. Soc. Am. , 18:19–49.

Ladefoged, P. (1985). The phonetic basis for computer speech processing. In Fallside, F. and Woods, W. A., editors, Computer Speech Processing , pages 3–27. Prentice Hall.

Lahiri, A., Gewirth, L., and Blumstein, S. (1984). A reconsideration of acoustic invariance for place of articulation in diffuse stop consonants: Evidence from a cross-language study. J. Acoust. Soc. Am. , 76(2):391–404.

Lamel, L. (1988). Formalizing Knowledge Used in Spectrogram Reading: Acoustic and Perceptual Evidence of Stops. Ph.D. thesis. Massachusetts Institute of Technology.

Lane, H. and Tranel, B. (1971). The Lombard sign and the role of hearing in speech. Journal of Speech and Hearing Research , 14:677–709.

Lane, H., Tranel, B., and Sisson, C. (1970). Regulation of voice communication by sensory dynamics. J. Acoust. Soc. Am. , 47(2):618–624.

Lombard, E. (1911). Le signe de l’élévation de la voix. Ann. Maladies Oreille, Larynx, Nez, Pharynx , 37:101–119.

Makhoul, J. and Cosell, L. (1976). LPCW: An LPC vocoder with linear predictive warping. In ICASSP , pages 466–469.

Moller, A. (1961). Network model of the middle ear. J. Acoust. Soc. Am. , 33:168–176.

Olive, J., Greenwood, A., and Coleman, J. (1993). Acoustics of American English Speech. A Dynamic Approach. Springer-Verlag.

O’Shaughnessy, D. (1987). Speech Communication: Human and Machine. Addison-Wesley.

Peterson, G. and Barney, H. (1952). Control methods used in a study of vowels. J. Acoust. Soc. Am. , 24(2): 175–184.

Picheny, M., Durlach, N., and Braida, L. (1985). Speaking clearly for the hearing impaired I: Intelligibility differences between clear and conversational speech. Journal of Speech and Hearing Research , 28:96–103.

Picheny, M., Durlach, N., and Braida, L. (1986). Speaking clearly for the hard of hearing TL: Acoustic characteristics of clear and conversational speech. Journal of Speech and Hearing Research , 29:434–446.

Pick, H., Siegel J., Fox, P., Garber, S., and Kearney, J. (1989). Inhibiting the Lombard effect. J. Acoust. Soc. Am. , 85(2):894–900.

Pickett, J. (1956). Effects of vocal force on the intelligibility of speech sounds. J. Acoust. Soc. Am. , 28(5):902–905.

Pickett, J. (1980). The Sounds of Speech Communication. University Park Press.

Pisoni, D., Bernacki, R., Nusbaum, H., and Yuchtman, M. (1985). Some acoustic-phonetic correlates of speech produced in noise. In ICASSP , pages 1581–1584.

Rose, J., Brugge, J., Anderson, D., and Hind, J. (1967). Phase-locked response to low-frequency tones in single auditory nerve fibers of the squirrel monkey. J. Neu-rophysiol. , 30:769–793.

Rose, J., Hind, J., Anderson, D., and Brugge, J. (1971). Some effects of stimulus intensity on response of auditory nerve fibers in the squirrel monkey. Neurophysiol , 34:685–699.

Rostolland, D. (1982a). Acoustic features of shouted voice. Acustica , 50(2): 118–125.

Rostolland, D. (1982b). Phonetic structure of shouted voice. Acustica , 51(2):80–89.

Rostolland, D. (1985). Intelligibility of shouted voice. Acustica , 57(3): 104–121.

Schulman, R. (1985). Articulatory targeting and perceptual constancy of loud speech. Technical report, PERILUS IV, Stockholm University.

Schulman, R. (1989). Articulatory dynamics of loud and normal speech. J. Acoust. Soc. Am. , 85(1):295–312.

Stanton, B., Jamieson, L., and Allen, G. (1988). Acoustic-phonetic analysis of loud and Lombard speech in simulated cockpit conditions. In ICASSP , pages 331–334.

Stevens, K. (1956). Stop consonants. Technical report, Acoustic Lab., Massachusetts Institute of Technology.

Stevens, K. (1971). Airflow and turbulent noise for fricative and stop consonants: Statistic considerations. J. Acoust. Soc. Am. , 50:1180–1192.

Stevens, S. and Volkmann, J. (1940). The relation of pitch to frequency. Am. J. Psychol. , 53(4, part 2):329.

Strevens, P. (1960). Spectra of fricative noise in human speech. Language & Speech , 3:32–49.

Summers, W., Pisoni, D., Bernacki, R., Pedlow, R., and Stokes, M. (1988). Effects of noise on speech production: Acoustic and perceptual analyses. J. Acoust. Soc. Am. , 84(3):917–928.

Traunmüller, H. (1985). The role of the fundamental and the higher formants in the perception of speaker size, vocal effort, and vowel openess. Technical report, Stockholm University.

Ungeheuer, G. (1962). Elemente Einer Akustischen Theorie der Vokalarticulation. Springer-Verlag.

von Békésy, G. (1960). Experiments in Hearing. McGraw-Hill.

Whitehead, R., Metz, D., and Whitehead, B. (1984). Vibration patterns of the vocal folds during pulse register phonation. J. Acoust. Soc. Am. , 75(4): 1293–1996.

Wickelgren, W. A. (1966). Distinctive features and errors in short-term memory for English consonants. J. Acoust. Soc. Am. , 39:388–398.

Zahorian, S. and Rothenberg, M. (1981). Principal-component analysis for low-redundancy encoding of speech spectra. J. Acoust. Soc. Am. , 69(3):832–845.

Zwicker, E. and Feldtkeller, R. (1981). Psychoacoustique: L’oreille Récepteur d’Informations. Masson.

Zwicker, E. and Terhardt, E. (1980). Analytical expressions for critical band rate and critical bandwidth as a function of frequency. J. Acoust. Soc. Am. , 68(5): 1523–1525.

Zwislocki, J. (1959). Electrical model of the middle ear. J. Acoust. Soc. Am. , 31:841

Download references

Author information

Authors and affiliations.

Speech Technology Laboratory, USA

Jean-Claude Junqua

CRIN - INRIA, France

Jean-Paul Haton

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Kluwer Academic Publishers

About this chapter

Junqua, JC., Haton, JP. (1996). Nature and Perception of Speech Sounds. In: Robustness in Automatic Speech Recognition. The Kluwer International Series in Engineering and Computer Science, vol 341. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-1297-0_1

Download citation

DOI : https://doi.org/10.1007/978-1-4613-1297-0_1

Publisher Name : Springer, Boston, MA

Print ISBN : 978-1-4612-8555-7

Online ISBN : 978-1-4613-1297-0

eBook Packages : Springer Book Archive

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research



What does speech production mean?

Definitions for speech production speech pro·duc·tion, this dictionary definitions page includes all the possible meanings, example usage and translations of the word speech production ., princeton's wordnet rate this definition: 0.0 / 0 votes.

speaking, speech production noun

the utterance of intelligible speech

Wikipedia Rate this definition: 0.0 / 0 votes

Speech production

Speech production is the process by which thoughts are translated into speech. This includes the selection of words, the organization of relevant grammatical forms, and then the articulation of the resulting sounds by the motor system using the vocal apparatus. Speech production can be spontaneous such as when a person creates the words of a conversation, reactive such as when they name a picture or read aloud a written word, or imitative, such as in speech repetition. Speech production is not the same as language production since language can also be produced manually by signs. In ordinary fluent conversation people pronounce roughly four syllables, ten or twelve phonemes and two to three words out of their vocabulary (that can contain 10 to 100 thousand words) each second. Errors in speech production are relatively rare occurring at a rate of about once in every 900 words in spontaneous speech. Words that are commonly spoken or learned early in life or easily imagined are quicker to say than ones that are rarely said, learnt later in life, or are abstract.Normally speech is created with pulmonary pressure provided by the lungs that generates sound by phonation through the glottis in the larynx that then is modified by the vocal tract into different vowels and consonants. However speech production can occur without the use of the lungs and glottis in alaryngeal speech by using the upper parts of the vocal tract. An example of such alaryngeal speech is Donald Duck talk.The vocal production of speech may be associated with the production of hand gestures that act to enhance the comprehensibility of what is being said.The development of speech production throughout an individual's life starts from an infant's first babble and is transformed into fully developed speech by the age of five. The first stage of speech doesn't occur until around age one (holophrastic phase). Between the ages of one and a half and two and a half the infant can produce short sentences (telegraphic phase). After two and a half years the infant develops systems of lemmas used in speech production. Around four or five the child's lemmas are largely increased; this enhances the child's production of correct speech and they can now produce speech like an adult. An adult now develops speech in four stages: Activation of lexical concepts, select lemmas needed, morphologically and phonologically encode speech, and the word is phonetically encoded.

ChatGPT Rate this definition: 0.0 / 0 votes

Speech production.

Speech production refers to the process by which thoughts are translated into spoken language. This involves the selection of appropriate words, organizing them in the correct grammatical structure, and then physically producing the necessary sounds through the coordinated action of the lungs, vocal cords, tongue, and lips.

Matched Categories

How to pronounce speech production.

Alex US English David US English Mark US English Daniel British Libby British Mia British Karen Australian Hayley Australian Natasha Australian Veena Indian Priya Indian Neerja Indian Zira US English Oliver British Wendy British Fred US English Tessa South African

How to say speech production in sign language?

Chaldean Numerology

The numerical value of speech production in Chaldean Numerology is: 4

Pythagorean Numerology

The numerical value of speech production in Pythagorean Numerology is: 2

  • ^  Princeton's WordNet http://wordnetweb.princeton.edu/perl/webwn?s=speech production
  • ^  Wikipedia https://en.wikipedia.org/wiki/Speech_Production
  • ^  ChatGPT https://chat.openai.com

Word of the Day

Would you like us to send you a free new word definition delivered to your inbox daily.

Please enter your email address:


Use the citation below to add this definition to your bibliography:.

Style: MLA Chicago APA

"speech production." Definitions.net. STANDS4 LLC, 2024. Web. 20 May 2024. < https://www.definitions.net/definition/speech+production >.


Discuss these speech production definitions with the community:


Report Comment

We're doing our best to make sure our content is useful, accurate and safe. If by any chance you spot an inappropriate comment while navigating through our website please use this form to let us know, and we'll take care of it shortly.

You need to be logged in to favorite .

Create a new account.

Your name: * Required

Your email address: * Required

Pick a user name: * Required

Username: * Required

Password: * Required

Forgot your password?    Retrieve it

Are we missing a good definition for speech production ? Don't keep it to yourself...

Image credit, the web's largest resource for, definitions & translations, a member of the stands4 network, image or illustration of.

definition of speech production

Free, no signup required :

Add to chrome, add to firefox, browse definitions.net, are you a words master, lacking in nutritive value, Nearby & related entries:.

  • speech organ noun
  • speech pathologist
  • speech pattern noun
  • speech perception noun
  • speech processing
  • speech production noun
  • speech production measurement
  • speech reception threshold test
  • speech recognition
  • speech recognition software
  • speech rhythm noun

Alternative searches for speech production :

  • Search for speech production on Amazon

definition of speech production


  1. Speech production

    Speech production is the process by which thoughts are translated into speech. This includes the selection of words, the organization of relevant grammatical forms, and then the articulation of the resulting sounds by the motor system using the vocal apparatus. Speech production can be spontaneous such as when a person creates the words of a ...

  2. Speech Production

    Definition. Speech production is the process of uttering articulated sounds or words, i.e., how humans generate meaningful speech. It is a complex feedback process in which hearing, perception, and information processing in the nervous system and the brain are also involved. Speaking is in essence the by-product of a necessary bodily process ...

  3. Articulating: The Neural Mechanisms of Speech Production

    Abstract. Speech production is a highly complex sensorimotor task involving tightly coordinated processing across large expanses of the cerebral cortex. Historically, the study of the neural underpinnings of speech suffered from the lack of an animal model. The development of non-invasive structural and functional neuroimaging techniques in the ...

  4. Speech Production

    Speech production is a complex process that includes the articulation of sounds and words, relying on the intricate interplay of hearing, perception, and information processing by the brain and ...

  5. Speech Production

    Speech production is one of the most complex human activities. It involves coordinating numerous muscles and complex cognitive processes. The area of speech production is related to Articulatory Phonetics, Acoustic Phonetics and Speech Perception, which are all studying various elements of language and are part of a broader field of Linguistics.

  6. 2.1 How Humans Produce Speech

    Speech is produced by bringing air from the lungs to the larynx (respiration), where the vocal folds may be held open to allow the air to pass through or may vibrate to make a sound (phonation). The airflow from the lungs is then shaped by the articulators in the mouth and nose (articulation). The field of phonetics studies the sounds of human ...

  7. 9.2 The Standard Model of Speech Production

    Figure 9.2 The Standard Model of Speech Production. The Standard Model of Word-form Encoding as described by Meyer (2000), illustrating five level of summation of conceptualization, lemma, morphemes, phonemes, and phonetic levels, using the example word "tiger". From top to bottom, the levels are:

  8. Speech

    Speech is the faculty of producing articulated sounds, which, when blended together, form language. Human speech is served by a bellows-like respiratory activator, which furnishes the driving energy in the form of an airstream; a phonating sound generator in the larynx (low in the throat) to transform the energy; a sound-molding resonator in ...

  9. Speech production

    Definition. Speech production refers to the process of formulating and expressing spoken words or sounds. It involves coordinating muscles, such as those controlling breathing, vocal cords, tongue, and lips. Analogy. Imagine speech production as a symphony orchestra where each musician represents different muscles involved in speaking. They ...

  10. Speech Production

    A theory of speech production provides an account of the means by which a planned sequence of language forms is implemented as vocal tract activity that gives rise to an audible, intelligible acoustic speech signal. Such an account must address several issues. Two central issues are considered in this article.

  11. The anatomical and physiological basis of human speech production

    Human speech production involves a range of physical features which may have evolved as specific adaptations for this purpose; alternatively, they evolved as exaptations, commandeering existing features. Combining knowledge of the anatomical and physiological basis of human speech production, comparisons with other primate species, and ...

  12. Phonetics and Speech Processing

    Phonetics. The process of human speech production relies foremost on breathing out. The lungs expel air during speech at a controlled rate (called speech breathing).The air passes through the larynx, which contains the vocal folds (often called "vocal cords"), whose positioning can be finely tuned by a panoply of laryngeal muscles. The term glottis refers to the space between the vocal folds.

  13. Speech Production

    Speech Production. J. Harrington, C. Mooshammer, in Encyclopedia of Language & Linguistics (Second Edition), 2006 Exemplar Theory. Weaver and many other speech production models based on performance errors adopt the idea from generative phonology that there is a phonological grammar and a component for phonetic implementation that is independent of the words in the lexicon.

  14. 1

    The production of a speech sound may be divided into four separate but interrelated processes: the initiation of the air stream, normally in the lungs; its phonation in the larynx through the operation of the vocal folds; its direction by the velum into either the oral cavity or the nasal cavity (the oro-nasal process); and finally its ...

  15. The Source-Filter Theory of Speech

    To systematically understand the mechanism of speech production, the source-filter theory divides such process into two stages (Chiba & Kajiyama, 1941; Fant, 1960) (see figure 1): (a) The air flow coming from the lungs induces tissue vibration of the vocal folds that generates the "source" sound.Turbulent noise sources are also created at constricted parts of the glottis or the vocal tract.

  16. The Handbook of Speech Production

    The Handbook of Speech Production is the first reference work to provide an overview of this burgeoning area of study. Twenty-four chapters written by an international team of authors examine issues in speech planning, motor control, the physical aspects of speech production, and external factors that impact speech production. Contributions bring together behavioral, clinical, computational ...

  17. Speech perception and production

    Speech production research serves as the complement to the work on speech perception described above. Where investigations of speech perception are necessarily indirect, using listener response time latencies or recall accuracies to draw conclusions about underlying linguistic processing, research on speech production can be refreshingly direct


    noise on speech production and perception. By introducing basic characteristics of speech sounds and how they are produced and perceived. we intend to provide the es­ sential knowledge needed to understand the following chapters. Part A SPEECH COMMUNICATION BY HUMANS AND MACHINES J.-C. Junqua et al., Robustness in Automatic Speech Recognition

  19. Speech production

    speech production: 1 n the utterance of intelligible speech Synonyms: speaking Types: speech the exchange of spoken words susurration , voicelessness , whisper , whispering speaking softly without vibration of the vocal cords stage whisper a loud whisper that can be overheard; on the stage it is heard by the audience but it supposed to be ...

  20. Speech Sound Disorders-Articulation and Phonology

    Speech Sound Disorders. Speech sound disorders is an umbrella term referring to any difficulty or combination of difficulties with perception, motor production, or phonological representation of speech sounds and speech segments—including phonotactic rules governing permissible speech sound sequences in a language.. Speech sound disorders can be organic or functional in nature.

  21. Phonetics

    phonetics, the study of speech sounds and their physiological production and acoustic qualities. It deals with the configurations of the vocal tract used to produce speech sounds (articulatory phonetics), the acoustic properties of speech sounds (acoustic phonetics), and the manner of combining sounds so as to make syllables, words, and ...

  22. Phonation

    The term phonation has slightly different meanings depending on the subfield of phonetics.Among some phoneticians, phonation is the process by which the vocal folds produce certain sounds through quasi-periodic vibration. This is the definition used among those who study laryngeal anatomy and physiology and speech production in general. Phoneticians in other subfields, such as linguistic ...

  23. What does speech production mean?

    Definition of speech production in the Definitions.net dictionary. Meaning of speech production. Information and translations of speech production in the most comprehensive dictionary definitions resource on the web.