4 different meanings of p-value (and how my thinking has changed)

Statistical Modeling, Causal Inference, and Social Science 2024-12-03

Given the discussion of our yesterday’s post on p-values, I thought it could help to re-run a related post from a year ago, 4 different meanings of p-value (and how my thinking has changed), which begins:

The p-value is one of the most common, and one of the most confusing, tools in applied statistics. Seasoned educators are well aware of all the things the p-value is not. Most notably, it’s not “the probability that the null hypothesis is true.” McShane and Gal find that even top researchers routinely misinterpret p-values.

But let’s forget for a moment about what p-values are not and instead ask what they are. It turns out that there are different meanings of the term. . . .

Definition 1. p-value(y) = Pr(T(y_rep) >= T(y) | H), where H is a “hypothesis,” a generative probability model, y is the observed data, y_rep are future data under the model, and T is a “test statistic,” some pre-specified specified function of data. I find it clearest to define this sort of p-value relative to potential future data; it can also be done mathematically and conceptually without any reference to repeated or future sampling, as in this 2019 paper by Vos and Holbert.

Definition 2. Start with a set of hypothesis tests of level alpha, for all values alpha between 0 and 1. p-value(y) is the smallest alpha of all the tests that reject y. This definition starts with a family of hypothesis tests rather than a test statistic, and it does not necessarily have a Bayesian interpretation, although in particular cases, it can also satisfy Definition 1.

Property 3. p-value(y) is some function of y that is uniformly distributed under H. I’m not saying that the term “p-value” is taken as a synonym for “uniform variate” but rather that this conditional uniform distribution is sometimes taken to be a required property of a p-value. It’s not a definition because in practice no one would define a p-value without some reference to a tail-area probability (Definition 1) or a rejection region (Definition 2)—but it is sometimes taken as a property that is required for something to be a true p-value. The relevant point here is that a p-value can satisfy Property 3 without satisfying Definition 1 (there are methods of constructing uniformly-distributed p-values that are not themselves tail-area probabilities), and a p-value can satisfy Definition 1 without satisfying Property 3 (when there is a composite null hypothesis and the distribution of the test statistic is not invariant to parameter values; see Xiao-Li Meng’s paper from 1994).

Description 4. p-value(y) is the result of some calculations applied to data that are conventionally labeled as a p-value. Typically, this will be a p-value under Definition 1 or 2 above, but perhaps defined under a hypothesis H that is not actually the model being fit to the data at hand, or a hypothesis H(y) that itself is a function of data, for example from p-hacking or forking paths. I’m labeling this as a “description” rather than a “definition” to clarify that this sort of p-value is used all the time without always a clear definition of the hypothesis, for example if you have a regression coefficient with estimate beta_hat and standard error s, and you compute 2 times the tail-area probability of |beta_hat|/s under the normal or t distribution, without ever defining a null hypothesis relative to all the parameters in your model. Sander Greenland calls this sort of thing a “descriptive” p-value, capturing the idea that the p-value can be understood as a summary of the discrepancy or divergence of the data from H according to some measure, ranging from 0 = completely incompatible to 1 = completely compatible. For example, the p-value from a linear regression z-score can be understood as a data summary without reference to a full model for all the coefficients.

These are not four definitions/properties/descriptions of the same thing. They are four different things. Not completely different, as they coincide in certain simple examples, but different, and they serve different purposes. They have different practical uses and implications, and you can make mistakes when you use one sort to answer a different question. . . .

The great thing about this post is that it’s purely descriptive (see this comment for elaboration on this particular point). I’m not telling anyone what to do. So this is a rare article about p-values where there’s nothing to argue about!