The gap between data mining and predictive models
Win-Vector Blog 2014-02-21
The Facebook data science blog shared some fun data explorations this Valentine’s Day in Carlos Greg Diuk’s “The Formation of Love”. They are rightly receiving positive interest in and positive reviews of their work (for example Robinson Meyer’s Atlantic article). The finding is also a great opportunity to discuss the gap between cool data mining results and usable predictive models. Data mining results like this (and the infamous “Beer and Diapers story”) face an expectation that one is immediately ready to implement something like what is claimed in: “Target Figured Out A Teen Girl Was Pregnant Before Her Father Did” once an association is plotted.
Producing a revenue improving predictive model is much harder than mining an interesting association. And this is what we will discuss here.
The following very interesting graph is shared in “The Formation of Love”:
The data preparation is not explained in great detail (that is okay, this is popular blog post- not a formal paper requiring a material and methods section). But it looks like the y-axis of the graph is something like directed timeline posts per day, the x-axis is day number (relative to when a relationship is declared), and each dot represents a person or a group of people (more on this last point in a bit). The point of the graph is: there is an increase in timeline posts between people who end up declaring a relationship on Facebook and then a drop off after the declaration of relationship. This is interesting behavioral data. The unwarranted expectation is with such a strong relation you could use immediately use daily posting rates to identify who is likely to declare a relationship and when this might happen (this would be valuable as these are unobserved until “day 0″).
However, as many other commenters have noticed: the range of posts per day is very small ranging from 1.62 well before declaration, to 1.65 right before declaration and settling to 1.54 well after declaration. And the variation away from the smoothing line seems unbelievable low (the fit is way too good). Finally the dots are not grouped at integer rates or rates with small denominators (suggesting each dot is in fact already aggregated from many users, and it is likely a jitter has been added to make the plot more legible). The possibility of aggregation is consistent with the start of the article which says: “This research has been conducted on anonymized, aggregated data.” The question is: how much aggregation is going on? Trends that can be seen in large aggregates are often very hard to apply to individuals. Or alternately very weak individual effects can be made to stand out if you aggregate enough data. There is no question in my mind we have a valid statistically significant result in front of us- it is just is there enough practical or “clinical significance” to immediately drive usable individual predictions?
To start our investigation let’s get an approximate copy of the data from the graph. This can be done by running the graph through an online digitizing tool such as: WebPlotDigitizer. A little work yields the data file NMsg.csv. And a small bit of R code lets us reproduce the graph.
# load librariesrequire(RCurl)require(ggplot2)require(mgcv)require(reshape2)# Load data, and make sure x is increasingurlBase <- 'https://raw2.github.com/WinVector/Examples/master/MiningGap/'mkCon <- function(nm) { textConnection(getURL(paste(urlBase,nm,sep='')))}d <- read.table(mkCon('NMsg.csv'),header=T,sep=',')d <- d[order(d$DaysBeforeAfterRelationship),]# plot something similar to original graphggplot(data=d,aes(x=DaysBeforeAfterRelationship, y=NumberOfTimelinePosts)) + geom_point() + stat_smooth(method = "gam", formula = y ~ s(x))
Which yields a graph fairly similar to the original:
We generated our intensity estimates as follows:
earlyIntensity <- mean( d[d$DaysBeforeAfterRelationship >= -100 & d$DaysBeforeAfterRelationship < -80, 'NumberOfTimelinePosts'])print(earlyIntensity)## [1] 1.622637nearIntensity <- mean( d[d$DaysBeforeAfterRelationship >= -20 & d$DaysBeforeAfterRelationship < 0, 'NumberOfTimelinePosts'])print(nearIntensity)## [1] 1.653082afterIntensity <- mean( d[d$DaysBeforeAfterRelationship >= 80 & d$DaysBeforeAfterRelationship < 100, 'NumberOfTimelinePosts'])print(afterIntensity)## [1] 1.539452
And here is the problem for a predictive model. If we assume (without justification) that users are generating these messages with from a Poisson model with the above given intensities we find it would be difficult to tell from the number of posts if a user is 100 days from declaring a relationship, about to declare one, or long after declaring one. Now the Poisson assumption is unjustified (and whole families of methods, such as over-dispersed models, have been developed to work around its limits) but it is a reasonable first guess for any sort of event counts per interval problem.
Below is a bar-chart of the probability distribution (under our Poisson model with the estimated intensities) of the observed number of posts expected from a person right before they declare a relationship compared to a person long after they declared a relationship (the pair of situations most easy to tell apart by posting rates):
The issue is: the two distributions are almost indistinguishable. We can’t tell how many posts a person would make given their group (long time from declaration, about to declare, or long after declaration), therefore we can’t reliably tell what group a person is in given their post count (a sort of contrapositive version of Bayes’ law).
The code to make the bar chart is as follows:
# What would individuals generating posts# as a Poisson process with a given intensity# look like around 100 days before forming# a relationship, right before forming a relationship,# and long after a relationship is formed?density <- data.frame(NumberOfTimelinePosts=0:10)density$nearCounts <- dpois( density$NumberOfTimelinePosts, lambda=nearIntensity)density$afterCounts <- dpois( density$NumberOfTimelinePosts, lambda=afterIntensity)dm <- melt(density, id.vars=c('NumberOfTimelinePosts'), variable.name='group', value.name='density')ggplot(data=dm) + geom_bar(stat='identity',position='dodge', aes(x=NumberOfTimelinePosts,y=density,fill=group)) + scale_x_continuous(breaks=density$NumberOfTimelinePosts)
But our bar chart is ignoring even larger blockers to making a usable predictive model. 100 days before a person declares a relationship we would not know how many days they are away from declaring (we know after they declare how many days ago they declared, but we don’t know the future) and we don’t know a priori who they are thinking of declaring with. A business doesn’t want a system that uses things that are hard to know (who will get into a relationship with whom) to predict easy measurements (post counts) after the fact. A business wants a real time system that uses things that are easy to measure (post counts) to predict future events that are valuable (for example: who will declare a relationship with whom and when). The original graph shows a relation between things, but a business needs a directional inference (from what is easy and cheap to things that are valuable).
And this is the jump: a plotted relation is not always immediately the basis for a usable predictive model. The popular press and management do not always seem to remember this distinction.
We started the article guessing that the dots must be aggregations of users (otherwise such tight error bars would be beyond what could be expected). Let’s use our (unjustified) Poisson process assumption to estimate how many users are likely aggregated in each dot. To attempt this we need to appeal to three facts:
- The mean equals the variance for a Poisson distribution.
- For independent identically distributed data the expected square distance between pairs of points is twice the variance.
- When you average or aggregate k independently identically distributed items this average has an expected variance of 1/k of the original items.
For individuals emitting posts in a Poisson distribution of intensity 1.5 posts per day we would expect a variance of 1.5 yielding error bars of around +-1.2, much larger than on the graphs we have seen. So we expect each dot is the aggregation of k individuals for some large k.
Suppose x is a random variable that is the average of k independent identically distributed Poisson distributed random variables (each with intensity lambda). Then we expect: E[x]/Var[x] to be independent of lambda and approximately equal to k (we are a little disturbed that our estimate is not dimensionless, but we must remember that Poisson distributions of different intensities have different shapes). Under this reasoning we estimate each dot on the graph represents almost 70,000 individuals and the approximately 193 dots on the graph together represent a study of about 13,000,000 individuals. Under our (unjustified) Poisson assumption somebody with less data than this could not get an aggregate graph as tight as the one presented (at such low activity rates, however it is possible that aggregating data across many days could in fact give usable predictive models with less data).
This illustrates the large gap between a good data mining result and a profitable predictive model.
The R-code to estimate the de-trended variance and k (the degree of aggregation) is a as follows:
# If the original data was generated from# independent individuals with slowly varying# Poisson intensity, then we could estimate the# degree of aggregation to see the mean/variance# ration in the plot.odd <- 2*(1:(dim(d)[[1]]/2))-1varEst <- sum((d[odd,'NumberOfTimelinePosts'] -d[odd+1,'NumberOfTimelinePosts'])^2)/(2*length(odd))print(varEst)## [1] 2.289848e-05meanEst <- mean(d$NumberOfTimelinePosts)print(meanEst)## [1] 1.599492kEst <- meanEst/varEstprint(kEst)## [1] 69851.47print(dim(d)[[1]])## [1] 193print(kEst*dim(d)[[1]])## [1] 13481334
More Win-Vector LLC work on post-hoc inspection of results include: