Consonant effects on F0 of following vowels

Language Log 2014-06-05

I spent the past couple of days at a workshop on lexical tone, organized by Kristine Yu at UMass. A topic that came up several times was the question of whether "segmental" influences on pitch — for instance, the fact that voiceless consonants are typically associated with a higher pitch in the first part of a following vowel — are diminished or eliminated in languages with lexical tone. Several participants observed that the evidence for this is not very strong: the classical paper on the subject studied a small number of utterances from one speaker in Thai, for example.

So for this morning's Breakfast Experiment™, I wrote a little script that calculates and displays (one way of looking at) these effects in the TIMIT dataset, which includes 10 English sentences spoken by each of 630 speakers. (Specifically, there are two sentences spoken by all 630 speakers; 450 sentences spoken by 7 speakers each; and 1890 sentences spoken by a single speaker.)

I had to go to a meeting before I had a chance to write up the results, but the meeting ended early enough for me to find 15 minutes before lunch, so:

My script pitch-tracked all the sentences, and located all the places where one of the consonants "b", "d", "g", "k", "m", "n", "p", "t" was followed by one of the vowels "aa", "ae", "ah", "ay", "eh", "er", "ey", "ih", "ix", "iy" (in ARPABET). I pulled out the first 50 msec. of estimated F0 values from the designated vowels — 10 estimates at a 5-msec. frame advance. I expressed the F0 estimates in each 10-element vector as ratios to the mean value of that vector.

A plot of the results:

This clearly shows the expected effects, with /p/ /t/ /k/ showing an average fall of about 10%, whereas /b/ /d/ /g/ show about a 3% fall, and /m/ /n/ show even less.

It's nice to see that such a crude technique produces such clean results. This is presumably due to the size of the dataset (small by today's speech-technology standards, but enormous by the standards of most phonetics research), and perhaps the dataset's balanced character (though I suspect that conversational or broadcast-news datasets will show similar effects, if they're large enough). The counts involved in this case (after automatically removing examples with period-doubling or other pitch-tracking errors):

CONSONANTCOUNTb1786d2314g649p1361t2615k2347m2837n2055

Now if we only had TIMIT-like datasets for French, German, Chinese, Thai, Yoruba, Chinantec, Pashto, etc. !

Someday, I hope, we will…