Keyword Search, Plus a Little Magic

Lingua Franca 2013-05-13

I promised last week that I would discuss three developments that turned almost-useless language-connected technological capabilities into something seriously useful. The one I want to introduce first was introduced by Google toward the end of the 1990s, and it changed our whole lives, largely eliminating the need for having full sentences parsed and translated into database query language.

The hunch that the founders of Google bet on was that simple keyword search could be made vastly more useful by taking the entire set of pages containing all of the list of search words and not just returning it as the result but rather ranking its members by influentiality and showing the most influential first. What a page contains is not the only relevant thing about it: As with any academic publication, who values it and refers to it is also important. And that is (at least to some extent) revealed in the link structure of the Web.

This idea led to the development of an astonishingly powerful technique for finding information and answering questions. Page ranking tends to melt away many of the problems that might have led one to think Natural Language Processing would sooner or later be a necessity.

You hardly need a system that can understand the sentence “What problems make breeding pandas in captivity difficult?” when the search string breeding captivity panda calls up a list of sites in which the top-ranking ones (those most referred to by others) contain exactly what you’re looking for.

There is scant need for a system that can parse “Are there lizards that do not have legs but are not snakes?” given that putting legless lizard in the Google search box gets you to various Web pages that answer the question immediately.

It is possible to find questions that are close to unanswerable using nothing but Google’s keyword search (and it is fun to try; I gave a simple example here), but it is difficult.

Google relies on at least four facts, all of them crucial, but especially the fourth one.

Computer memory chips have become so cheap and so tiny that in an office-sized space you can pack enough random-access-memory units to store an utterly gigantic automatically maintained concordance to the whole Web, augmented with copies of huge portions of what is on those sites.
Networks and processors have become so fast that your search command can be delivered to a server far away and checked against the gigantic index in just hundredths of a second.
The number of sites containing all of the words on a list (rather than just some of them) goes down rapidly with the length of the list, and much more rapidly when the words have low probabilities of occurrence.
Humans looking for a certain piece of information can on the whole be trusted to be smart enough to supply a list of words with the crucial property of having low probability in most texts but being guaranteed to occur in texts containing the desired information.

The combination is like magic. You can get very close to finding just the site you want by simply selecting a few words that won’t appear in most Web pages but will be in the ones you want to see. A nontrivial accomplishment (as pointed out to me by the linguist and founding staff member at Powerset Ron Kaplan, it is akin to translating English into a very basic structureless language), and one that not everyone will excel at. But it works.

Where is that giant tokamak machine being built to harness nuclear fusion in confined plasma for power generation, by some huge international collaborative project whose name you can’t remember? Forget build, generation, giant, harness, in, machine, power, project. … Those words are much too common; they’re almost useless to you. But try tokamak international: The top hit is the home page of ITER, which is the project name, and with a click or two you can find the location (near Saint-Paul-lez-Durance, France).

This doesn’t rely on artificial intelligence, it relies on your intelligence. It works so well that it has largely obviated question-answering by means of NLP. Devising computer programs that can understand the grammar and meaning of sentences remains an academic research challenge, and could still be very useful, but the pressure to provide it in a hurry has receded because of Google’s innovation.

In my next post I’ll describe a second development that has had a similar effect on the need for NLP.