The right boot of the warner of the baron

Language Log 2019-12-06

Here at the UNESCO LT4All conference, I've noticed that many participants assert or imply that the problems of human language technology have been solved for a few major languages, especially English, so that the problem on the table is how to extend that success to thousands of other languages and varieties.

This is not totally wrong — HLT is a practical reality in many applications, and is being rapidly spread to others. And the problem of digitally underserved speech communities is real and acute.

But it's important to understand that the problems are not all solved, even for English, and that the remaining issues also represent barriers for extensions of the technology to other communities, in that the existing approximate solutions are far too hungry for data and far too short on practical understanding and common sense. There are many ways to make this point. We could look at the Winograd Schema Challenge among many other text-understanding problems. On the speech side, we could look at the current state of algorithms for diarization and speaker change detection.

But the attitude that caught my attention at this conference was epitomized in a presentation by Kelly Davis, who introduced Mozilla's Common Voice and DeepSpeech projects. These are great projects, well designed to make it possible for (certain kinds of) new languages to be added with minimal new engineering, since they rely on internet-based collection and validation of read speech from recruited volunteers, and sequence-to-sequence training of a system based on the results of that collection. But Kelly's presentation suggested that these projects have solved the speech-to-text problem for English, so that all we need for each additional language is to recruit enough readers and validators to create an open dataset of 10,000 hours of read speech.

I'm strongly in favor of the Common Voice project, though it would be nice to have a way to add conversational or other forms of spontaneous speech, both because spontaneous speech is different, and because some language communities are primarily oral. Today, though, I want to make the point that this admirable approach is not the end of the story.

Here's the start of a Librivox reading of Jane Austen's Pride and Prejudice:

Your browser does not support the audio element.

Original text:

It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.

However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.

Mozilla DeepSpeech transcript:

it is a true universally acknowledged that a single man in possession of a good fortune must be in want of a while however little known the feelings are views of such a man may be on his first entering a neighbourhood this tree is so well sick in the minds of the surrounding families that he is considered the rightful property of some one or other of their daughters

5 substitutions in 70 words = Word Error Rate (WER) of 7.1%

That's pretty good! Though sequences like "it is a true universally acknowledged" and "in want of a while" are not outputs that we should accept from an entity that knows English, especially give that the phonetics are pretty clear. Those errors are all too typical of the behavior of sequence-to-sequence algorithms like DeepSpeech, and represent a type of error that would probably not be made by a system with an architecture that knows about the secret entities called "words".


If we add a little reverb and clipping, the passage remains intelligible:

Your browser does not support the audio element.

But DeepSpeech now gives us:

it is a true universal is now but a in the man in possession of a good fortune must be in one of the whale however little known the feelings are views in such a man may be on his first entering a neighbourhood this treatise well sick in the minds of the sempalys that he is considered the rightful property of some one or other their daughters

NIST sclite WER 27.1%

[Note: "Word Error Rate" is defined as the sum of word substitutions, deletions and insertions, divided by the number of words in the reference (true) transcript. But it's sometimes not obvious how to count what output as what kind of error, so I've followed the normal practice by relying on the choices made by the cited NIST software.]


And if we add some white noise, it still sounds the same, just with a hiss in the background — but DeepSpeech gets even more confused, to the point where I won't bother trying to put the errors in bold face:

Your browser does not support the audio element.

it is universal in now that is in the land of the session of a good portion must in want o a while however little more deadhouse at such men may be on his recent neighbourhood sisters i sowed the mind of the transmitted the right boot of the warner of the baron

NIST sclite WER is 68.6%


I should note that we could retrain the system while adding lots of such noise and distortion to the original training set, and the performance would improve — on similar inputs with similar kinds of noise and distortion. But the world is full of different kinds of input, in particular conversational (and otherwise spontaneous) speech, and the world is also full of lots of different kinds of noise and distortion. For example, we could add a very little bit of babble noise, which hardly changes the human perception at all:

Your browser does not support the audio element.

But DeepSpeech gets even more confused, and starts leaving out whole stretches:

it is universal in now that is in the man of possession of a good portion must in want o a i however what a mongoose is such men may be on his breunner hood

NIST sclite WER is 75.7%


As a random example of the kind of spontaneous speech that's Out There, here's a "Cookie Theft" description (see "Shelties On Alki Story Forest" for discussion of the genre), recently recorded, of relatively high audio quality:

Your browser does not support the audio element.

Human Transcription:

A: Alright B: Go ahead. A: Okay, a girl is getting a cake out of the cupboard A: and she's almost going to fall on the floor doing it A: Her- A: The girl has a cookie jar in her hand. A: She has- A: She's grabbing an other girl's hand, but looks like she might fall on the floor. A: And this girl over here A: is washing the dishes A: and drying them. A: And actually the sink looks like it's overflowing. A: This girl puts- has an apron on. She looks like she's uh –x

We don't need to add any noise or distortion to cause problems here:

Mozilla DeepSpeech:

ari go head and girls getting it kakakew and she's almost gone on for doing it or the girl had a cooky jar in her hand she has she grabbing another girl's hand but looked like she might fall on the floor and this girl or here is washing the dishes and drying them actually the sink looks i could see her folly is girl but as a burnished like she is

NIST sclite WER 47.7%

"Getting it kakakew" indeed.