Any P-Value Distinguishable from Zero is Insufficiently Informative
Three-Toed Sloth 2015-07-01
Summary:
Attention conservation notice: 4900+ words, plus two (ugly) pictures and many equations, on a common mis-understanding in statistics. Veers wildly between baby stats. and advanced probability theory, without explaining either. Its efficacy at remedying the confusion it attacks has not been evaluated by a randomized controlled trial.
After ten years of teaching statistics, I feel pretty confident in saying that one of the hardest points to get through to undergrads is what "statistically significant" actually means. (The word doesn't help; "statistically detectable" or "statistically discernible" might've been better.) They have a persistent tendency to think that parameters which are significantly different from 0 matter, that ones which are insignificantly different from 0 don't matter, and that the smaller the p-value, the more important the parameter. Similarly, if one parameter is "significantly" larger than another, then they'll say the difference between them matters, but if not, not. If this was just about undergrads, I'd grumble over a beer with my colleagues and otherwise suck it up, but reading and refereeing for non-statistics journals shows me that many scientists in many fields are subject to exactly the same confusions as The Kids, and talking with friends in industry makes it plain that the same thing happens outside academia, even to "data scientists". (For example: an A/B test is just testing the difference in average response between condition A and condition B; this is a difference in parameters, usually a difference in means, and so it's subject to all the issues of hypothesis testing.) To be fair, one meets some statisticians who succumb to these confusions.
One reason for this, I think, is that we fail to teach well how, with enough data, any non-zero parameter or difference becomes statistically significant at arbitrarily small levels. The proverbial expression of this, due I believe to Andy Gelman, is that "the p-value is a measure of sample size". More exactly, a p-value generally runs together the size of the parameter, how well we can estimate the parameter, and the sample size. The p-value reflects how much information the data has about the parameter, and we can think of "information" as the product of sample size and precision (in the sense of inverse variance) of estimation, say $n/\sigma^2$. In some cases, this heuristic is actually exactly right, and what I just called "information" really is the Fisher information.
Rather than working on grant proposals Egged on by a friend As a public service, I've written up some notes on this. Throughout, I'm assuming that we're testing the hypothesis that a parameter, or vector of parameters, $\theta$ is exactly zero, since that's overwhelming what people calculate p-values for — sometimes, I think, by a spinal reflex not involving the frontal lobes. Testing $\theta=\theta_0$ for any other fixed $\theta_0$ would work much the same way. Also, $\langle x, y \rangle$ will mean the inner product between the two vectors.
1. Any Non-Zero Mean Will Become Arbitrarily Significant
Let's start with a very simple example. Suppose we're testing whether some mean parameter $\mu$ is equal to zero or not. Being straightforward folk, who follow the lessons we were taught in our one room log-cabin schoolhouse research methods class, we'll use the sample mean $\hat{\mu}$ as our estimator, and take as our test statistic $\frac{\hat{\mu}}{\hat{\sigma}/\sqrt{n}}$; that denominator is the standard error of the mean. If we're really into old-fashioned recipes, we'll calculate our p-value by comparing this to a table of the $t$ distribution with $n-2$ degrees of freedom, remembering that it's $n-2$ because we're using one degree of freedom to get the mean estimate ($\hat{\mu}$) and another to get the standard deviation estimate ($\hat{\sigma}$). (If we're a bit more open to new-fangled notions, we bootstrap.) Now what happens as $n$ grows?
Well, we remember the central limit theorem: $\sqrt{n}(\hat{\mu} - \mu) \rightarrow \mathcal{N}(0,\sigma^2)$. With a little manipulation, and some abuse of notation, this becomes \[ \hat{\mu} \rightarrow \mu + \frac{\sigma}{\sqrt{n}}\mathcal{N}(