Spurious Correlations Everywhere: the Tragedy of Big Data

Lingua Franca 2013-03-15

I promised (last Thursday) to say a little more about Keith Chen’s claim that obligatory future-tense marking in your language makes you less prudent in safeguarding your health and wealth.

Chen’s data on languages comes from the World Atlas of Language Structures (WALS), and his evidence on prudence from the World Values Survey (WVS). Both are fully Web-accessible. Sean Roberts, who studies language evolution at the Max Planck Institute for Psycholinguistics in Nijmegen, decided to investigate the other linguistic factors treated in WALS to see how they related to prudence. He compared the goodness of fit for linear regressions on each of a long list of properties of languages (the independent variables), using as the dependent variable the answers that speakers gave to the WVS question “Did you save money last year?”

The results (see this blog post for an informal account) were jaw-dropping. He found that dozens of linguistic variables were better predictors of prudence than future marking: whether the language has uvular consonants; verbal agreement of particular types; relative clauses following nouns; double-accusative constructions; preposed interrogative phrases; and so on—a motley collection of factors that no one could plausibly connect to 401(k) contributions or junk-food consumption.

The implication is that Chen may have underestimated the myriads of meaningless correlations that can be found in large volumes of data about human affairs.

Roberts and a colleague recently published a paper on this topic (“Social Structure and Language Structure: the New Nomothetic Approach” by Sean Roberts and James Winters, Psychology of Language and Communication 16.2 [2012], 89-112). They noted several zany positive correlations of language with behavior; for example, people who speak a subject-object-verb language (like Japanese, Turkish, or Hindi) have more children on average than do people who speak a subject-verb-object language (like English, Indonesian, or Swahili).

Nassim Taleb’s Antifragile (2012, Page 417, quoted by James Winters in a blog comment) contains a relevant remark about why such things might be: “In large data sets, large deviations are vastly more attributable to noise (or variance) than to information (or signal). … The more variables, the more correlations that can show significance. … Falsity grows faster than information.”

We should expect correlations that are statistically significant but ultimately meaningless to pop up all over the place once large quantities of data are available—especially with regard to something like language, given the difficulty of controlling adequately for cultural diffusion, geographical proximity, shared origins, and intervariable linkage.

I suspect that Chen’s correlations mean nothing at all: There is no causal link, and we do not need an explanatory story. In the kind of world we live in, you wrestle every day with a swirling mass of inexplicable correlations, and then you die.

But of course I could be wrong. The issues under consideration are at root empirical. Chen and Roberts have been in contact via the blogosphere to exchange ideas, and a few days ago Roberts explored a new idea that Chen suggested: Investigate how much extra variance a particular linguistic variable accounts for after nonlinguistic variables such as age, sex, and number of children are controlled for.

For a whole series of independent variables (age, sex, employment status, marriage status, education, religion, number of children, WVS survey year), Roberts looked at the improvement in model-fitting that could be obtained by adding a WALS linguistic variable to a model of the propensity to save (using the F-score of the difference in residuals).

His preliminary results are presented here. This time Chen’s future time-reference marking variable ranks almost top in predictive power. The one variable that outranks it is having a grammatical perfective/imperfective aspect distinction, which also relates to temporal properties. That is much more supportive of Chen.

Of course, my remarks about dicey causal intuitions still apply; and it is still quite problematic for Chen that the third-ranking linguistic factor is whether the language has a velar nasal consonant: Why should having an ng-sound, as in singer, predict the savings behavior of the speakers of a language almost as well as the allegedly relevant future tense and perfective aspect marking? But still, this story may not be over.

Initial reports of Chen’s work proved to be an invitation to silliness, which the world’s media accepted enthusiastically (“Why Speaking English Can Make You Poor When You Retire” was the BBC’s headline). But asinine press reports about scientific results don’t nullify them. And in this case the results are still coming in.