The state of the machine translation art

Language Log 2014-07-31

I don't know any Hebrew. So when I recently saw a tweet in Hebrew on a Google Plus page of discussion about Gaza tunnel-building, I clicked (with some forebodings) on the "Translate" link to see what it meant. What I got was this:

Some grazing has hurt they Stands citizens Susan Hammer year

This does not even offer enough of an inkling to permit me to guess at what the writer of the original Hebrew might have been saying. It might as well have said "Grill tree ecumenical the fox Shove sample Quentin Garage plastic."

Linguists generally tend to argue that machine translation based entirely on statistical properties of large parallel corpora, without any guidance from lexical or syntactic information, is not going to work. And you can see why we say that. Hitting the "Translate" button may perhaps get you mediocre literal translations with limited errors for perhaps 80% of simple sentences in languages like French: those closely related to English for which huge amounts of parallel text are available. But a lot of the time, especially for minor languages less well represented on the web, it will get you just about nothing.