Netflix's stoked-up algorithms

Numbers Rule Your World 2014-03-03

At the start of the year, The Atlantic published a very nice, long article about Netflix's movie recommendation algorithm. You may remember this algorithm (internally known as Cinematch) received a $1 million makeover several years ago (the Netflix Prize), only that the prize-winning entry was deemed too complex--and does not generate sufficient incremental value--to be put into production.

The reporter, Alexis Madrigal, noticed that Netflix has shifted attention from the queue of recommended movies to providing (micro-)genres of movies you might be interested in. His article is a great example of powerful data journalism: he reverse-engineered the internal structure of Netflix's new algorithm by extracting all of the keywords ("About Horses", "Critically Acclaimed", "Visually Striking", to name a few), and then creating all sensible combinations of these keywords (e.g. "Critically Acclaimed, Visually Striking Movies About Horses"), producing the roughly 80,000 possible microgenres used by Netflix. (It's clear that Netflix management endorsed this exercise and article but it's not clear how much proactive support they provided.)

One of my favorite columnists, Felix Salmon, reacted negatively to the change in algorithms, titling his post "Netflix's Dumbed-Down Algorithm". He interpreted the change as foreshadowing the day when Netflix no longer could offer any movie any user places in his/her queue because the third-party content providers have ratcheted the costs too high. It's a longstanding weakness in Netflix's streaming business model.

Felix lamented that the genre-driven recommendations would be far inferior to the original recommendations:

The original Netflix prediction algorithm — the one which guessed how much you’d like a movie based on your ratings of other movies — was an amazing piece of computer technology, precisely because it managed to find things you didn’t know that you’d love. More than once I would order a movie based on a high predicted rating...

The next generation of Netflix personalization, by contrast, ratchets the sophistication down a few dozen notches: at this point, it’s just saying “well, you watched one of these Period Pieces About Royalty Based on Real Life, here’s a bunch more”.

***

Felix is right on the business model but misses the mark on the analytics. As someone who builds predictive models, I had the opposite reaction when reading The Atlantic's piece. I thought Netflix's data engineers learned something from the Netflix Prize "fiasco".

The major change to the analytical approach is shifting from predicting whether you'd like a movie to whether you'd watch a movie. This shift makes a lot of sense to Netflix as a business. It is sensible even from the user's perspective: since when is it that we never watch a bad movie? (Even the movies we place in the queue ourselves could turn out to be bad.)

One big problem with the Netflix Prize was its singular focus on the RMSE metric, which roughly speaking measures the average error of the predicted ratings against actual ratings. The ratings data, though, is extremely skewed, making an average error criterion worse than misleading. By skew, I mean (a) a very small number of popular movies receives the majority of the ratings and (b) a small number of highly active users contribute the majority of movie ratings. Put differently, missing data is far and away the most important feature of the data.

Because of missing data, it is next to impossible to get good predictions for niche movies (with few ratings) or for users who do not actively feed signals into the algorithm. Improving RMSE by 10 percent does not mean every user's prediction improved by 10 percent. The improvement is likely concentrated to user-movie pairings for which there is sufficient data to work with. It would be enlightening if someone does an analysis of the performance of the winning algorithms by segments of users (based on the amount of prior data to work with).

Now, consider predicting what you'd watch next based on the viewing behavior of you (and other users). For every user and movie combination, the user either have or have not watched the movie. Just like that, the missing-data issue vanishes. The result of what Felix sees as "dumbing down" may be a stoking up.

***

As I pointed out in Chapter 5 of Numbersense (in talking about Groupon's bid to personalize offers; link), every business faces a set of conflicting objectives when trying to "personalize" marketing to customers. I believe this Netflix shift shows they have found a good balanced solution.