What’s wrong with null hypothesis significance testing

Statistical Modeling, Causal Inference, and Social Science 2019-12-04

Following up on yesterday’s post, “What’s wrong with Bayes”:

My problem is not just with the methods—although I do have problems with the method—but also with the ideology.

My problem with the method

You’ve heard this a few zillion times before, and not just from me. Null hypothesis significance testing collapses the wavefunction too soon, leading to noisy decisions—bad decisions. My problem is not with “false positives” or false negatives”—in my world, there are no true zeroes—but rather that a layer of noise is being added to whatever we might be able to learn from data and models.

Don’t get me wrong. There are times when null hypothesis significance testing can make sense. And, speaking more generally, if a tool is available, people can use it as well as they can. Null hypothesis significance testing is the standard approach in much of science, and, as such, it’s been very useful. But I also think it’s useful to understand the problems with the approach.

My problem with the ideology

My problem with null hypothesis significance testing is not just that some statisticians recommend it, but that they think of it as necessary or fundamental.

Again, the analogy to Bayes might be helpful.

Bayesian statisticians will not only recommend and use Bayesian inference, but also will try their best, when seeing any non-Bayesian method, to interpret it Bayesianly. This can be helpful in revealing statistical models that can be said to be implicitly underlying certain statistical procedures—but ultimately a non-Bayesian method has to be evaluated on its own terms. The fact that a given estimate can be interpreted as, say, a posterior mode under a given probability model, should not be taken to imply that that model needs to be true, or even close to be true, for the method to work.

Similarly, any statistical method, even one that was not developed under a null hypothesis significance testing framework, can be evaluated in terms of type 1 and type 2 errors, coverage of interval estimates, etc. These evaluations can be helpful in understanding the method under certain theoretical, if unrealistic, conditions; see for example here.

The mistake is seeing such theoretical evaluations as fundamental. It can be hard for people to shake off this habit. But, remember: type 1 and type 2 errors are theoretical constructs based on false models. Keep your eye on the ball and remember your larger goals. When it comes to statistical methods, the house is stronger than the foundations.