A simple way to model prosody in reading

Language Log 2025-09-27

In "Reading Instruction in the mid 19th century" (8/15/2025), I noted a suggestion, due to Ran Liu of Amira Learning, that a computational analysis of prosodic features could be an effective way to evaluate how well grade-school students understand what they're reading. Beyond that, Maryellen MacDonald has suggested that phrasal prosody can be seen as the phase-level analog of phonemic blending (i.e. putting the sounds of 'c' 'a' 't' together into "cat") — which might help to explain the benefits of McGuffey-style elocution lessons.

Both ideas raise the question of how to evaluate the prosody of a given student's reading. And there's a simple and obvious way to do this, described and exemplified below.

We might rely on a model that predicts duration, vocal effort, pitch, and pausing from the phonology, syntax, semantics, and pragmatics of a phrase — there's an enormous literature aiming to do this analytically — or we could rely on a modern-style end-to-end deep learning system that simply maps character sequences onto predicted acoustics.

But that's going to be complicated, either way, and there's a simpler way to start.

For decades, we've had technology that does a good job of "forced alignment", i.e. aligning speech signals with various levels of symbol-sequences representing them (see e.g. Talkin and Wightman 1994; Fox 2006Yuan and Liberman 2008). So from a sample of model readings for a given passage, we can derive a distribution of relevant acoustic measures, and compare the same measures derived from the performance to be evaluated.

I'll illustrate this with a simple example from the Speech Accent Archive at George Mason University, in which a large number speakers read an "elicitation paragraph":

Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.

Many of the readers are native speakers of various varieties of English, e.g.

Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.

And many others are speakers of other languages — a large fraction of whom are not entirely fluent as readers of English, e.g. this native Russian speaker:

Your browser does not support the audio element.

There are many issues with that last reading, but let's start with something simple, which also applies to U.S. learners of whatever language background — the location and duration of silent pauses.

The second phrase of the elicitation paragraph is a good example. The durations of the Russian speaker's inter-word pauses in milliseconds, as measured by forced alignment, are given below between curly braces in the transcript below:

Your browser does not support the audio element.

Ask her to bring {240} these {520} things with her from the {290} store

In contrast, none of three sample native-English readers above have any within-phrase silent pauses. And their speech rate is also obviously faster:

Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.

There's plenty more to say about the pronunciation variation involved — and the Speech Accent Archive (at least at the time that I downloaded it) has 659 native-English readings to compare, along with even more non-native readings.

The four examples above were literally chosen at random. But I've made a systematic comparison of timing and pausing in all the native-English readers, and a large sample of non-native readers, and the pattern holds pretty well, except for a subset of highly fluent non-native readers. In a sample of learner-data from Amira Learning (one of Penn's partners in the U-GAIN project), the effects seem even stronger. (The process of getting consent to share (some of) that data is still underway…)

Accent comparison is an issue for U-GAIN as well. But this morning's goal is just to indicate an obvious and easy road towards evaluation of student reading fluency, on which the first step is simply a comparison of silent pause locations and durations.

My current guess is that 5 or 10 model readings of each passage will be plenty for that task, but time will tell.