Machine Translation Without the Translation

Lingua Franca 2013-05-31

I have been ruminating this month on why natural language processing (NLP) still hasn’t arrived, and I have pointed to three developments elsewhere that seem to be discouraging its development. First, enhanced keyword search via Google’s influentiality-ranking of results. Second, the dramatic enhancement in applicability of speech recognition that dialog design facilitates. I now turn to a third, which has to do with the sheer power of number-crunching.

Machine translation is the unclimbed Everest of computational linguistics. It calls for syntactic and semantic analysis of the source language, mapping source-language meanings to target-language meanings, and generating acceptable output from the latter. If computational linguists could do all those things, they could hang up the “mission accomplished” banner.

What has emerged instead, courtesy of Google Translate, is something utterly different: pseudotranslation without analysis of grammar or meaning, developed by people who do not know (or need to know) either source or target language.

The trick: huge quantities of parallel texts combined with massive amounts of rapid statistical computation. The catch: low quality, and output inevitably peppered with howlers.

Of course, if I may purloin Dr Johnson’s remark about a dog walking on his hind legs, although it is not done well you are surprised to find it done at all. For Google Translate’s pseudotranslation is based on zero linguistic understanding. Not even word meanings are looked up: The program couldn’t care less about the meaning of anything. Here, roughly, is how it works.

Imagine you have a huge quantity of English text paired with French text that happens to say the same thing. The records of everything ever said in the Canadian Parliament, say, or the complete Harry Potter novels in the original and the French translation. Assume that you can automatically align them, at least approximately, by reference to sentence boundaries, and perhaps some phrase and word boundaries. Get a computer program to construct a gigantic database containing facts of roughly this sort:

French vin n’est tends to align with English wine isn’t, probability = p₁
French vin blanc tends to align with English white wine, probability = p₂
French bon marché tends to align with English cheap, probability = p₃
French avait marché tends to align with English had walked, probability = p₄
French fait rien tends to align either with English doesn’t matter (probability = p₅) or with does nothing (probability = p₆)

And so on. Impagine literally billions of facts like this at your disposal, in a huge database that you can search at lightning speed.

Now write a computer program that takes English text as input and gives as output whatever French word sequence averages out with the highest p_i values (the alignment probabilities). Forget all about meanings and structures of words and phrases and sentences, that’s oldthink; just find the best guess at a suitable French word sequence, based on the odds of alignment probabilities.

The errors attendant on this kind of pseudotranslation via gambling can be hilarious. A single Language Log post reports on a BBC TV show with a gravestone stating in Hebrew that the deceased was “pickled at great expense”; a Malaysian government edict banning “clothes that poke eye”; and an Israeli drink with a label promising that it contains no foreskin.

Unsupervised Google translation into English from Chinese yields some really spectacular howlers (browse some of Victor Mair’s posts on Language Log to see hundreds of examples, from the baffling lie fallow small and pave to the unpalatable-sounding fungus gnat turnovers).

And to those who think it is distasteful to mock Chinglish, I can only say that it is by no means just the Chinese whose troubles with statistical translation have made me giggle. Spanish too yields much unintentional hilarity: a restaurant offering a dish called attacked of elvers; a condominium development constructed with armed structure and crystals; etc.

People print such howlers every day. And the program itself can hardly help: It ignores meaning, grammar, words, content, logic, plausibility, facts, geography, and common sense, and focuses obsessively on working out odds of co-alignment.

Although we are aware that Google Translation has let us down before and we shouldn’t trust it, with nowhere else to turn, we use it anyway. And it does prove useful sometimes (I use it too).

My conjecture is that it is useful enough to constitute one more reason for not investing much in trying to get real NLP industrially developed and deployed.

NLP will come, I think; but when you stake into account the ready availability of (1) Google search, and (2) speech-driven applications aided by dialog design, and (3) the statistical pseudotranslation briefly discussed above, the cumulative effect is enough to reduce the pressure to develop NLP, and will probably delay its arrival for another decade or so.