Recommended For You

Language Log 2015-08-18

Alexander Spangher, "Building the Next New York Times Recommendation Engine", NYT 8/11/2015:

The New York Times publishes over 300 articles, blog posts and interactive stories a day.

Refining the path our readers take through this content — personalizing the placement of articles on our apps and website — can help readers find information relevant to them, such as the right news at the right times, personalized supplements to major events and stories in their preferred multimedia format.

Spangher describes "Content-Based Filtering", which depends on the distribution of words and word-sequences in the articles you've previously read; and "Collaborative Filtering", which looks at the articles read by other readers who have read some of the same articles that you have. He notes problems with each approach, leading to their new algorithm,

. . . inspired by a technique, Collaborative Topic Modeling (CTM), that (1) models content, (2) adjusts this model by viewing signals from readers, (3) models reader preference and (4) makes recommendations by similarity between preference and content.

He links to the paper that inspired them (Chang Wang and David Blei, "Collaborative Topic Modeling for Recommending Scientific Articles", KDD 2011), and discussed how they've met a "three-part challenge":

Part 1: How to model an article based on its text. Part 2: How to update the model based on audience reading patterns. Part 3: How to describe readers based on their reading history.

The solution, in brief, is to use Latent Dirichlet Allocation to place articles in a low-dimensional topic space; to use the Collaborative Topic Modeling method to iteratively adjust article placement based the apparent topic interests of each article's readers; and to use a weighted average of the topic-space position of articles read as "a quick way to calculate reader preferences".

If you're interested in this sort of thing, read all of Spangher's piece, and the 2011 CTM article, and perhaps some of the 272 articles that Google Scholar lists as citing the CTM article.

But whether you're interested in the details or not, you should take note of an increasingly important kind of technology that doesn't have a name, as far as I know. It's emerged from 50 years of research, and 20 years of increasingly-broad application.

These techniques apply to collections of texts that are associated with a number of other features — in the current example it's articles and readers (and maybe dates and places and authors?); it might be web pages with their domain information and link graph; it might be a bibliometric network of authors, affiliations, journals, publishers, articles; it might be a network of twitter authors, times, places, hashtags; or product reviews along with star ratings, author IDs and product descriptions; or the text of open-ended survey responses along with multiple-choice outcomes and subject demographics; or Facebook posts with authors' demographic information and personality-test results; or  collections of real-estate listings with locations and prices and sales information; or job listings with information about  applicants and outcomes; or . . .

Recommendation systems are just one of many applications. The problems to be solved range from easy to impossible, and the algorithms used range from simple to complex, and from obvious to subtle and surprising. (Sometimes the most subtle and surprising methods are also the simplest…)

There are obvious (and existing) applications in commerce, in medicine, in sociology, in law, in education, in literary studies — given the increasing digitization of communication, it's hard to think of any domain where this kind of technology is not already applied or soon to be applied. Most large companies have at least dipped their toes into this area, and some of them have plunged in enthusiastically. And new companies are springing up like the proverbial mushrooms after a rain.

There are obviously close connections to non-textual problems. The "collaborative filtering" method is content-neutral, so that a music recommendation system using this technique is basically identical to an article recommendation system — but as Spangher observes, there are good reasons to add content-based information to systems based purely on preference networks. For many other applications, combining content analysis with other dimensions of information is essential. And in a large range of cases, the most accessible and useful source of content is text.

Given all of this, it's odd that the technology we're talking about doesn't have a name.