A quick simulation to demonstrate the wild variability of p-values

Statistical Modeling, Causal Inference, and Social Science 2024-10-08

Just did this one in class today. Suppose you run 10 independent experiments, each has the same true effect size and each has the same standard error. Further assume that the experiments are pretty good, and the true effect size is 2 standard errors from zero.

What will the p-values look like?

Here’s a simulation:

set.seed(123)
J <- 10
theta <- rep(0.2, J)
se <- rep(0.1, J)
theta_hat <- rnorm(J, theta, se)
p_value <- 2*(1 - pnorm(abs(theta_hat), 0, se))
print(round(p_value, 3))

And here's what we get:

0.150 0.077 0.000 0.038 0.033 0.000 0.014 0.462 0.189 0.120

Now imagine this was real life. A couple of experiments are consistent with pure noise, a few of them seem to show weak evidence against the null hypothesis, and a few are majorly statistically significant. It would be natural to try to categorize them in some way. Yeah, the difference between "significant" and "not significant" is not itself statistically significant, but a p-value of 0.46 in one case and 0.0002 in another . . . surely this must be notable, right? No.

OK, this is an extreme case, in that there is no underlying variation, and indeed if you fit a multilevel model you could see the lack of evidence for underlying variation.

The larger points are:

1. The p-value is a statement relative to the null hypothesis of no effect. It doesn't have much of an interpretation relative to a real, nonzero effect.

2. The p-value is super-noisy. It's a weird nonlinear transformation of the z-score (which does have a pretty clear interpretation) with all kinds of nonintuitive behavior.

And:

3. You can learn a lot from a simulation experiment. Even something really simple like this one!