"One big Donald Trump AIDS"

Language Log 2017-06-25

As I've observed several times over the years, automatic speech recognition is getting better and better, to the point where some experts can plausibly advance claims of "achieving human parity". It's not hard to create material where humans still win, but in a lot of ordinary-life recordings, the machines do an excellent job.

Just like human listeners, computer ASR algorithms combine "bottom-up" information about the audio with "top-down" information about the context — both the local word-sequence context and various layers of broader context. In general, the machines are more dependent than humans are on the top-down information, in the sense that their performance on (even carefully-pronounced) jabberwocky or word salad is generally rather poor.

But recently I've been noting some cases where an ASR system unexpectedly fails to take account of what seem like some obvious local word-sequence likelihoods. To check my impression that such events are fairly common, I picked a random youtube video from YouTube's welcome page — Bill Maher's 6/23/2017 monologue — and fetched the "auto-generated" closed captions.

Here's an example that combines impressive overall performance with one weird mistake:

Your browser does not support the audio element.

5:07 Mitch McConnell says he wants a vote 5:10 before the 4th of July when Trump voters 5:13 traditionally blow their hands off 5:19 oh the fourth of July hey summers here 5:24 boy it was real Beach weather in Phoenix 5:26 the other day did you see that it was 5:28 122 122 plains could not take off hey 5:34 climate deniers 5:36 if melting IceCaps and rising oceans and 5:40 pandemics aren't enough to scare you not 5:42 being able to leave Phoenix that should 5:50 work

I'll give the machine a pass on "summers" instead of "summer's", and we can ignore the issue of "oh" vs. "ah", and forgive the hallucinated "work" at the end — but "plains could not take off"? In Psalm 114:4 the mountains skipped like rams, but not even then did the plains take off.

A bit later:

Your browser does not support the audio element.

6:32 but speaking of solar Donald Trump broke 6:36 some news at the rally that the wall you 6:39 know the wall between us and Mexico it's 6:41 going to have solar panels on he said it 6:43 was his idea solar battles okay so the 6:47 wall which is never going to be built 6:49 which Mexico is never going to be paying 6:52 for which now has imaginary so propels 6:56 on because if it's one big Donald Trump 6:59 AIDS it's fake news

So the system got "solar panels" right the first time, but then heard "solar battles" and "so propels". In fairness, Maher kind of garbles the last one into something like "solar pels":

Your browser does not support the audio element.

But still, I don't think anyone in the audience heard "so propels".

And then at the end, "if it's one thing Donald Trump hates it's fake news" get turned into "if it's one big Donald Trump AIDS it's fake news":

Your browser does not support the audio element.

In that case, I don't hear any acoustic phonetic excuses. And surely "one thing Donald Trump hates" is a priori a more probable word string than "one big Donald Trump AIDS"…

I don't know which generation of ASR Google is using to generate YouTube captions. But it's possible that this sort of thing is an example of the sometimes-peculiar behavior of RNN language models.