Why Are We Still Waiting for Natural Language Processing?

Lingua Franca 2013-05-09

Try typing this, or any question with roughly the same meaning, into the Google search box:

Which UK papers are not part of the Murdoch empire?

Your results (and you could get identical ones by typing the same words in the reverse order) will contain an estimated two million or more pages about Rupert Murdoch and the newspapers owned by his News Corporation. Exactly what you did not ask for.

Putting quotes round the search string freezes the word order, but makes things worse: It calls not for the answer (which would be a list including The Daily Telegraph, the Daily Mail, the Daily Mirror, etc.) but for pages where the exact wording of the question can be found, and there probably aren’t any (except this post).

Machine answering of such a question calls for not just a database of information about newspapers but also natural language processing (NLP). I’ve been waiting for NLP to arrive for 30 years. Whatever happened?

In the 1980s I was convinced that computers would soon be able to simulate the basics of what (I hope) you are doing right now: processing sentences and determining their meanings.

To do this, computers would have to master three things. First, enough syntax to uniquely identify the sentence; second, enough semantics to extract its literal meaning; and third, enough pragmatics to infer the intent behind the utterance, and thus discerning what should be done or assumed given that it was uttered.

Take Flying planes is dangerous as an example. The syntactic step includes identifying it as having a singular subject (a gerund-participial clause), accounting for the singular agreement form is. (The distinct sentence Flying planes are dangerous has a plural noun phrase as subject, hence plural agreement on are.)

The semantic step involves (at the very least) noting that Flying planes is dangerous talks about a risky activity. (Again, contrast this with Flying planes are dangerous, which is about aircraft being a danger.)

Pragmatically, however, an utterance often conveys far more than its literal meaning. If Bob has just said “She gets so panicky when her husband is at work,” and Jane responds, “Well, flying planes is dangerous,” you will probably conclude that Bob and Jane know a nervous married woman whose spouse is a pilot, possibly a test pilot, and Jane sees that hazardous job as justifying some level of anxiety. Yet none of this was explicit in the two utterances. Such inferences rely on a complex process of common-sense pragmatic reasoning that we have no idea how to model computationally.

So ignore the pragmatic step; useful work could still be accomplished by a system limited to syntax and semantics. Why do we not find computer programs in general use that can analyze simple questions and provide answers in response? Imagine sending off queries like these, by text or e-mail, to an NLP server:

“Can your site be accessed without using Internet Explorer?”
“What is the validity period for multiple-entry business visas?”
“Do you offer telephone support from the USA?”

Imagine that the machine—not some underpaid message-taker in a Mumbai call center whom you only reach after 10 minutes listening to “The Girl from Ipanema”—could process the questions, look up the answers, and return appropriate responses in a few hundredths of a second. Even if occasionally it replied “Sorry, I didn’t understand the question,” it could still be a boon in the average case.

One company was headed in that direction a few years ago, and had a nascent Wikipedia-based NLP question-answering service up and running in 2008: Powerset. But within a few weeks of its release, Microsoft made them an offer they did not refuse. Now, it seems, the science sleeps with the fishes: www.powerset.com simply redirects to Bing (a plain old keyword-based search engine with which Microsoft is trying to rival Google). Once again there is, to my knowledge, no available system for unaided machine answering of free-form questions via general syntactic and semantic analysis. (Clever ad hoc tweaks make some search engines answer selected simple question forms—try Google on “What is the square root of 1531?”—but don’t be fooled by imitations.)

How could we have drifted into the second decade of the 21st century with absolutely no commercial NLP products? I believe the answer lies in three initially unexpected developments having little to do with NLP. Each enabled a previously near-useless technology to become a useful substitute, reducing the need for the real thing.

Next Monday (May 13, 2013) I’ll discuss the first of those three breakthroughs.