Phonetic Feature Encoding in Cortex

Scientific Understanding of Consciousness
Consciousness as an Emergent Property of Thalamocortical Activity

Phonetic Feature Encoding in Cortex

Science 28 February 2014: Vol. 343 no. 6174 pp. 1006-1010

Phonetic Feature Encoding in Human Superior Temporal Gyrus

Nima Mesgarani, Connie Cheung, Keith Johnson, Edward F. Chang

¹Department of Neurological Surgery, Department of Physiology, and Center for Integrative Neuroscience, University of California, San Francisco, CA 94143, USA.

²Department of Linguistics, University of California, Berkeley, CA 94720, USA.

[paraphrase]

During speech perception, linguistic elements such as consonants and vowels are extracted from a complex acoustic speech signal. The superior temporal gyrus (STG) participates in high-order auditory processing of speech, but how it encodes phonetic information is poorly understood. We used high-density direct cortical surface recordings in humans while they listened to natural, continuous speech to reveal the STG representation of the entire English phonetic inventory. At single electrodes, we found response selectivity to distinct phonetic features. Encoding of acoustic properties was mediated by a distributed population response. Phonetic features could be directly related to tuning for spectrotemporal acoustic cues, some of which were encoded in a nonlinear fashion or by integration of multiple cues. These findings demonstrate the acoustic-phonetic representation of speech in human STG.

Phonemes—and the distinctive features composing them—are hypothesized to be the smallest contrastive units that change a word’s meaning (e.g., /b/ and /d/ as in bad versus dad). The superior temporal gyrus (Brodmann area 22, STG) has a key role in acoustic-phonetic processing because it responds to speech over other sounds and focal electrical stimulation there selectively interrupts speech discrimination. These findings raise fundamental questions about the representation of speech sounds, such as whether local neural encoding is specific for phonemes, acoustic-phonetic features, or low-level spectrotemporal parameters. A major challenge in addressing this in natural speech is that cortical processing of individual speech sounds is extraordinarily spatially discrete and rapid.

We recorded direct cortical activity from six human participants implanted with high-density multielectrode arrays as part of their clinical evaluation for epilepsy surgery. These recordings provide simultaneous high spatial and temporal resolution while sampling population neural activity from temporal lobe auditory speech cortex. We analyzed high gamma (75 to 150 Hz) cortical surface field potentials, which correlate with neuronal spiking.

Participants listened to natural speech samples featuring a wide range of American English speakers (500 sentences spoken by 400 people). Most speech-responsive sites were found in posterior and middle STG (37 to 102 sites per participant, comparing speech versus silence, P < 0.01, t test). Neural responses demonstrated a distributed spatiotemporal pattern evoked during listening.

We segmented the sentences into time-aligned sequences of phonemes to investigate whether STG sites show preferential responses. We estimated the mean neural response at each electrode to every phoneme and found distinct selectivity. For example, electrode e1 showed large evoked responses to plosive phonemes /p/, /t/, /k/, /b/, /d/, and /g/. Electrode e2 showed selective responses to sibilant fricatives: /s/, /ʃ/, and /z/. The next two electrodes showed selective responses to subsets of vowels: low-back (electrode e3, e.g., /a/ and /aʊ/), high-front vowels and glides (electrode e4, e.g., /i/ and /j/). Last, neural activity recorded at electrode e5 was selective for nasals (/n/, /m/, and /ŋ/).

To quantify selectivity at single electrodes, we derived a metric indicating the number of phonemes with cortical responses statistically distinguishable from the response to a particular phoneme. The phoneme selectivity index (PSI) is a dimension of 33 English phonemes; PSI = 0 is nonselective and PSI = 32 is extremely selective (Wilcox rank-sum test, P < 0.01). We determined an optimal analysis time window of 50 ms, centered 150 ms after the phoneme onset by using a phoneme separability analysis (f-statistic). The average PSI over all phonemes summarizes an electrode’s overall selectivity. The average PSI was highly correlated to a site’s response magnitude to speech over silence (r = 0.77, P < 0.001, t test) and the degree to which the response could be predicted with a linear spectrotemporal receptive field (STRF, r = 0.88, P < 0.001, t test). Therefore, the majority of speech-responsive sites in STG are selective to specific phoneme groups.

To investigate the organization of selectivity across the neural population, we constructed an array containing PSI vectors for electrodes across all participants. In this array, each column corresponds to a single electrode, and each row corresponds to a single phoneme. Most STG electrodes are selective not to individual but to specific groups of phonemes. To determine selectivity patterns across electrodes and phonemes, we used unsupervised hierarchical clustering analyses. Clustering across rows revealed groupings of phonemes on the basis of similarity of PSI values in the population response. Clustering across columns revealed single electrodes with similar PSI patterns. These two analyses revealed complementary local- and global-level organizational selectivity patterns. We also replotted the array by using 14 phonetic features defined in linguistics to contrast distinctive articulatory and acoustic properties.

The first tier of the single-electrode hierarchy analysis divides STG sites into two distinct groups: obstruent- and sonorant-selective electrodes. The obstruent-selective group is divided into two subgroups: plosive and fricative electrodes. Among plosive electrodes, some were responsive to all plosives, whereas others were selective to place of articulation (dorsal /g/ and /k/ versus coronal /d/ and /t/ versus labial /p/ and /b/) and voicing (separating voiced /b/, /d/, and /g/ from unvoiced /p/, /t/, and /k/). Fricative-selective electrodes showed weak, overlapping selectivity to coronal plosives (/d/ and /t/). Sonorant-selective cortical sites, in contrast, were partitioned into four partially overlapping groups: low-back vowels, low-front vowels, high-front vowels, and nasals.

Both clustering schemes revealed similar phoneme grouping based on shared phonetic features, suggesting that a substantial portion of the population-based organization can be accounted for by local tuning to features at single electrodes (similarity of average PSI values for the local and population subgroups of both clustering analyses; overall r = 0.73, P < 0.001). Furthermore, selectivity is organized primarily by manner of articulation distinctions and secondarily by place of articulation, corresponding to the degree and the location of constriction in the vocal tract, respectively. This systematic organization of speech sounds is consistent with auditory perceptual models positing that distinctions are most affected by manner contrasts compared with other feature hierarchies (articulatory or gestural theories).

We next determined what spectrotemporal tuning properties accounted for phonetic feature selectivity. We first determined the weighted average STRFs of the six main electrode clusters, weighting them proportionally by their degree of selectivity (average PSI). These STRFs show well-defined spectrotemporal tuning, highly similar to average acoustic spectrograms of phonemes in corresponding population clusters (average correlation = 0.67, P < 0.01, t test). For example, the first STRF shows tuning for broadband excitation followed by inhibition, similar to the acoustic spectrogram of plosives. The second STRF is tuned to a high frequency, which is a defining feature of sibilant fricatives. STRFs of vowel electrodes show tuning for characteristic formants that define low-back, low-front, and high-front vowels. Last, STRF of nasal-selective electrodes is tuned primarily to low acoustic frequencies generated from heavy voicing and damping of higher frequencies. The average spectrogram analysis requires a priori phonemic segmentation of speech but is model-independent. The STRF analysis assumes a linear relationship between spectrograms and neural responses but is estimated without segmentation. Despite these differing assumptions, the strong match between these confirms that phonetic feature selectivity results from tuning to signature spectrotemporal cues.

We have thus far focused on local feature selectivity to discrete phonetic feature categories. We next wanted to address the encoding of continuous acoustic parameters that specify phonemes within vowel, plosive, and fricative groups. For vowels, we measured fundamental (F0) and formant (F1 to F4) frequencies. The first two formants (F1 and F2) play a major perceptual role in distinguishing different English vowels, despite tremendous variability within and across vowels. The optimal projection of vowels in formant space was the difference of F2 and F1 (first principal component), which is consistent with vowel perceptual studies. By using partial correlation analysis, we quantified the relationship between electrode response amplitudes and F0 to F4. On average, we observed no correlation between the sensitivity of an electrode to F0 with its sensitivity to F1 or F2. However, sensitivity to F1 and F2 was negatively correlated across all vowel-selective sites (r = –0.49, P < 0.01, t test), meaning that single STG sites show an integrated response to both F1 and F2. Furthermore, electrodes selective to low-back and high-front vowels showed an opposite differential tuning to formants, thereby maximizing vowel discriminability in the neural domain. This complex sound encoding suggests a specialized higher-order encoding of acoustic formant parameters and contrasts with studies of speech sounds in nonhuman species.

To examine population representation of vowel parameters, we used linear regression to decode F0 to F4 from neural responses. To ensure unbiased estimation, we first removed correlations between F0 to F4 by using linear prediction and decoded the residuals. Relatively high decoding accuracies are shown (P < 0.001, t test), suggesting fundamental and formant variability is well represented in population STG responses (interaction between decoder weights with electrode STRFs). By using multidimensional scaling, we found that the relational organization between vowel centroids in the acoustic domain is well preserved in neural space.

For plosives, we measured three perceptually important acoustic cues: voice-onset time (VOT), which distinguishes voiced (/b/, /d/, and /g/) from unvoiced plosives (/p/, /t/, and /k/); spectral peak (differentiating labials /p/ and /b/ versus coronal /t/ and /d/ versus dorsal /k/ and /g/); and F2 of the following vowel. These acoustic parameters could be decoded from population STG responses (P < 0.001, t test). VOTs are temporal cues that are perceived categorically, which suggests a nonlinear encoding.

To examine the nonlinear relationship between VOT and response amplitude for voiced-plosive electrodes compared with plosive electrodes with no sensitivity to voicing feature, we fitted a linear and exponential function to VOT-response pairs. The difference between these two fits specifies the nonlinearity of this transformation. Voiced-plosive electrodes all show strong nonlinear bias for short VOTs.

We performed a similar analysis for fricatives, measuring duration, which aids the distinction between voiced (/z/ and /v/) and unvoiced fricatives (/s/, /ʃ/, /θ/, /f/); spectral peak, which differentiates /f/ and /v/ versus coronal /s/ and /z/ versus dorsal /ʃ/; and F2 of the following vowel. These parameters can be decoded reliably from population responses (P < 0.001, t test).

Because plosives and fricatives can be subspecified by using similar acoustic parameters, we determined whether the response of electrodes to these parameters depends on their phonetic category (i.e., fricative or plosive). We compared the partial correlation values of neural responses with spectral peak, duration, and F2 onset of fricative and plosive phonemes, where each point corresponds to an electrode color-coded by its cluster grouping in. High correlation values (r = 0.70, 0.87, and 0.79; P < 0.001; t test) suggest that electrodes respond to these acoustic parameters independent of their phonetic context. The similarity of responses to these isolated acoustic parameters suggests that electrode selectivity to a specific phonetic features emerges from combined tuning to multiple acoustic parameters that define phonetic contrasts.

We have characterized the STG representation of the entire American English phonetic inventory. We used direct cortical recordings with high spatial and temporal resolution to determine how selectivity for phonetic features is correlated to acoustic spectrotemporal receptive field properties in STG. We found evidence for both spatially local and distributed selectivity to perceptually relevant aspects of speech sounds, which together appear to give rise to our internal representation of a phoneme.

We found selectivity for some higher-order acoustic parameters, such as examples of nonlinear, spatial encoding of VOT, which could have important implications for the categorical representation of this temporal cue. Furthermore, we observed a joint differential encoding of F1 and F2 at single cortical sites, suggesting evidence of spectral integration previously speculated in theories of combination-sensitive neurons for vowels.

Our results suggest a multidimensional feature space for encoding the acoustic parameters of speech sounds. Phonetic features defined by distinct acoustic cues for manner of articulation were the strongest determinants of selectivity, whereas place-of-articulation cues were less discriminable. This might explain some patterns of perceptual confusability between phonemes and is consistent with feature hierarchies organized around acoustic cues, where phoneme similarity space in STG is driven more by auditory-acoustic properties than articulatory ones. A featural representation has greater universality across languages, minimizes the need for precise unit boundaries, and can account for coarticulation and temporal overlap over phoneme-based models for speech perception.

[end of paraphrase]

Science 28 February 2014: Vol. 343 no. 6174 pp. 978-979

The Neural Code That Makes Us Human

Yosef Grodzinsky, Israel Nelken

Department of Linguistics, McGill University, 1085 Dr. Penfield Avenue Montréal, Québec H3A1A7, Canada.

Department of Neurobiology, The Alexander Silberman Institute of Life Sciences, Hebrew University, Jerusalem, 91904 Israel.

The Edmond and Lily Safra Center for Brain Sciences, The Hebrew University of Jerusalem, Givat Ram, Jerusalem 91904, Israel.

[paraphrase]

Speech provides a fascinating window into brain processes. It is understood effortlessly, and despite a huge variability, manifests both within and across speakers. It is also a stable and reliable carrier of linguistic meaning, complex and intricate as it may be. How speech is encoded and decoded has puzzled those seeking to understand how the brain extracts sense from an ambiguous, noisy environment. In the above research paper, Mesgarani et al. demonstrate the neural basis of speech perception by combining linguistic, electrophysiological, clinical, and computational approaches.

How do brains use the pattern of pressure waves in the air that is speech (“speech-as-sound”) and extract meaning (“speech-as-speech”) from it reliably, despite huge variability between speakers and background noise? Studies dating as far back as the 1950s showed that natural speech is highly redundant—speech sounds convey their identity by a large number of disparate acoustic cues. However, to ensure stable cue-to-speech translation by brains, an invariant code—something like a dictionary of speech units—seems necessary. What, then, is the nature of the representation of speech units in the brain, and how do they combine into larger, meaning-bearing pieces?

In the 1930s, linguists Roman Jakobson and Nikolai Trubetzkoy classified consonants and vowels along articulatory dimensions: Their description of the basic units of speech recognition referred to elements such as the place in the oral cavity where air is compressed on its way out (“labial,” “dental,” “velar,” etc.), the manner of air release (“plosive,” “sonorant,” etc.), and whether the vocal cords vibrate or not (“voiced,” “unvoiced”). For example, the sound /p/ is a composite of features—[+labial, −voiced, +plosive]—distinguishable from /b/ [+labial, +voiced, +plosive] and from /t/ [+alveolar, −voiced, +plosive]. Distinctive features, then, help to characterize the nature of invariance, while systematically grouping speech units in clusters. These features have therefore played a central role in speech recognition research.

But what actually happens in human brains during speech perception, and where? It may be that invariance is expressed in terms of articulation-related distinctive features (as proposed by linguists). Invariance may also be reflected already in sensory areas; alternatively, brain processes may achieve invariant representations of speech sounds only outside the auditory system proper. One extreme possibility is that distinctive features correlate with acoustic ones, in which case the invariant coding of sounds may already occur in sensory areas. At the other extreme, as suggested by the influential motor theory of speech perception, speech sounds may well be represented by the articulatory gestures used to produce them. A recent form of this view actually posits mirror neurons in the brain that do precisely that—map sounds onto motor actions. In that case, the invariant representation of speech would by necessity occur in motor areas, outside of the auditory system.

In the above research paper, Mesgarani et al. recorded responses to speech sounds in the brains of human patients who were about to undergo brain surgery for clinical reasons. These recordings give a more detailed view of the electrical activity in the human brain than noninvasive methods such as electroencephalograms or functional magnetic resonance imaging, although they still reflect the average responses of large neuronal populations. Using these electrical signals, the authors demonstrate a high degree of invariance of speech representation as early as in the human auditory cortex by showing that speech sounds of different speakers and in a multitude of contexts nonetheless activate the same brain regions. Moreover, invariance seems to be governed by articulatory distinctive features, thereby supporting the 80-year-old theory of Jakobson and Trubetzkoy. Interestingly, features do not have equal neural representation, and those that induce strong neural invariance have strong acoustic correlates. Speech representation in the auditory cortex, in other words, is governed by acoustic features, but not by just any acoustic feature—the features that dominate speech representation are precisely those that are associated with abstract, linguistically defined distinctive features. Mesgarani et al., who base their investigation on linguistic distinctions, further demonstrate that features are distinguishable by the degree of the neural invariance they evoke, forming an order that is remarkably in keeping with old linguistic observations: Manner of articulation (manifesting early in developing children) produces a neural invariance that is more prominent than that related to place of articulation (manifesting late in children). A hierarchy noted in 1941 for language acquisition is now resurfacing as part of the neural sensitivity to speech sounds.

But linguistic communication is based on larger pieces than the basic building blocks of speech. It also requires rules that create complex combinations from basic units. Linguistic combinatorics is therefore an essential part of verbal communication, allowing it to be flexible and efficient. Here, too, Mesgarani et al. offer some clues. They show that sequencing processes, particularly those that determine voice onset time, tend to be more distributed in neural tissue than the rather localized distinctive features. This suggests that combinatorial rules that concatenate basic elements into bigger units might depend on larger, perhaps somewhat more widely distributed, neural chunks, than the stored representations of basic building blocks. How distributed (and speech-specific) such processes are is not revealed by the Mesgarani et al. study, but evidence about the neural specificity of language combinatorics at other levels of analysis does exist: Operations involved in building complex expressions—sentences with rich syntax and semantics—are relatively localized in parts of the left cerebral hemisphere (and distinct from other combinatorial processes such as arithmetic), even if the neural chunks that support them may be as large as several cubic centimeters.

Although the study of Mesgarani et al. was carried out in English, the findings have universal implications. Cross-linguistic evidence for universal neural representation of higher aspects of linguistic communication also exists, at least to some extent. These results may suggest a shift in view on brain-language relations: from earlier modality-based models, we moved to attempts to identify the neural code for specific linguistic units and concatenating operations. This move carries the hope that someday, the complete neural code for language will be identified, thereby making good on the promise that linguistics be “part of psychology, ultimately biology”.

[end of paraphrase]

Return to — Language

Return to — Hearing and Speech

Return to — Auditory System