The failure of null hypothesis significance testing when studying incremental changes, and what to do about it

Statistical Modeling, Causal Inference, and Social Science 2017-12-26

A few months ago I wrote a post, “Cage match: Null-hypothesis-significance-testing meets incrementalism. Nobody comes out alive.” I soon after turned it into an article with the title given above and the following abstract:

A standard mode of inference in social and behavioral science is to establish stylized facts using statistical significance in quantitative studies. However, in a world in which measure- ments are noisy and effects are small, this will not work: selection on statistical significance leads to effect sizes which are overestimated and often in the wrong direction. After a brief discussion of two examples, one in economics and one in social psychology, we consider the procedural solution of open post-publication review, the design solution of devoting more effort to accurate measurements and within-person comparisons, and the statistical analysis solution of multilevel modeling and reporting all results rather than selection on significance. We argue that the current replication crisis in science arises in part from the ill effects of null hypothesis significance testing being used to study small effects with noisy data. In such settings, apparent success comes easy but truly replicable results require a more serious connection between theory, measurement, and data.

The body of the article begins:

A standard mode of inference in social and behavioral science is to establish stylized facts using statistical significance in quantitative studies. A “stylized fact”—the term is not intended to be pejorative—is a statement, presumed to be generally true, about some aspect of the world. For example, the experiments of Stroop and of Kahenman and Tversky established stylized facts about color perception and judgment and decision making. A stylized fact is assumed to be replicable, and indeed those aforementioned classic experiments have been replicated many times. At the same time, social science cannot be as exact as physics or chemistry, and we recognize that even the most general social and behavioral rules will occasionally fall. Indeed, one way we learn is by exploring the scenarios in which the usual laws of psychology, politics, economics, etc., fail.

The recent much-discussed replication crisis in science is associated with many prominent stylized facts that have turned out not to be facts at all (Open Science Collaboration, 2015, Jarrett, 2016, Gelman, 2016b). Prominent examples in social psychology include embodied cognition, mindfulness, ego depletion, and power pose, as well as sillier examples such as the claim that beautiful parents are more likely to have daughters, or that women are three times more likely to wear red at a certain time of the month.

These external validity problems reflect internal problems with research methods and the larger system of scientific communication. . . .

At this point it is tempting to recommend that researchers just stop their p-hacking. But unfortunately this would not make the replication crisis go away! . . . eliminating p-hacking is not much of a solution if this is still happening in the context of noisy studies.

Null hypothesis significance testing (NHST) only works when you have enough accuracy that you can confidently reject the null hypothesis. You get this accuracy from a large sample of mea- surements with low bias and low variance. But you also need a large effect size. Or, at least, a large effect size, compared to the accuracy of your experiment.

But we’ve grabbed all the low-hanging fruit. In medicine, public health, social science, and policy analysis we are studying smaller and smaller effects. These effects can still be important in aggregate, but each individual effect is small. . . .

I then discuss two examples: the early-childhood intervention study of Gertler et al. which we’ve discussed many times, and a recent social-psychology paper by Burum, Gilbert, and Wilson that happened to come up on the blog around the time I decided to write this paper.

The article discusses various potential ways that science can do better, concluding:

These solutions are technical as much as they are moral: if data and analysis are not well suited for the questions being asked, then honesty and transparency will not translate into useful scientific results. In this sense, a focus on procedural innovations or the avoidance of p-hacking can be counterproductive in that it will lead to disappointment if not accompanied by improvements in data collection and data analysis that, in turn, require real investments in time and effort.

To me, the key point in the article is that certain classical statistical methods designed to study big effects, will crash and burn when used to identify incremental changes of the sort that predominate in much of modern empirical science.

I think this point is important; in some sense it’s a key missing step in understanding why the statistical methods that worked so well for Fisher/Yates/Neyman etc. are giving us so many problems today.

P.S. There’s nothing explicitly Bayesian in my article at all, but arguably the whole thing is Bayesian in that my discussion is conditional on a distribution of underlying effect sizes: I’m arguing that we have to proceed differently given our current understanding of this distribution. In that way, this new article is similar to my 2014 article with Carlin where we made recommendations conditional on prior knowledge of effect sizes without getting formally Bayesian. I do think it would make sense to continue all this work in a more fully Bayesian framework.

The post The failure of null hypothesis significance testing when studying incremental changes, and what to do about it appeared first on Statistical Modeling, Causal Inference, and Social Science.