Understanding p-values: Different interpretations can be thought of not as different “philosophies” but as different forms of averaging.
Statistical Modeling, Causal Inference, and Social Science 2024-12-02
The usual way we think about p-values is that they’re part of null-hypothesis testing inference, and if that’s not your bag, you don’t have use for p-values.
That summary is pretty much the case. I once did write a paper for the journal Epidemiology called “P-values and statistical practice,” and in that paper I gave an example of a p-value that worked (further background is here), but at this point my main interest in p-value is that other people use p-values so it behooves me to understand what they’re doing.
Theoretical statistics is the theory of applied statistics, and part of applied statistics is what other people do.
What this means is that, just as non-Bayesians should understand enough about Bayesian methods to be able to assess the frequency properties of said methods, so should I, as a Bayesian, understand the properties of p-values. Bayesians are frequentists.
The point is, a p-value is a data summary, and it should be interpretable under various assumptions. As we like to say, it’s all about the averaging.
Below are two different ways of understanding p-values. You could think of these as the classical interpretation or the Bayesian interpretation, but I prefer to think of them as conditioning-on-the-null-hypothesis or averaging-over-an-assumed-population-distribution.
So here goes:
1. One interpretation of the p-value is as the probability of seeing a test statistic as extreme as, or more extreme than, the data, conditional on a null hypothesis of zero effects. This is the classical interpretation.
2. Another interpretation of the p-value is conditional on some empirically estimated distribution of effect sizes. This is we did in our recent article by Zwet et al., “A new look at p-values for randomized clinical trials,” using the Cochrane database of medical trials.
Both interpretations 1 and 2 are valid! No need to think of interpretation 2 as a threat to interpretation 1, or vice versa. It’s the same p-value, we’re just understanding it by averaging over different predictive distributions.
What to do with all this theory and empiricism is another question, and there is a legitimate case to be made that following procedures derived from interpretation 2 could lead to worse scientific outcomes, just as of course there is a strong case to be made that procedures derived from interpretation 1 have already led to bad scientific outcomes.
Following that logic, one could argue that interpretation 1, or interpretation 2, or both, are themselves pernicious in leading, inexorably or with high probability, toward these bad outcomes. One can continue with the statement that interpretation 1, or interpretation 2, or both, have intellectual or institutional support that prop them up and allow the relating bad procedures to continue; various people benefit from these theories, procedures, and outcomes.
To the extent there are, or should be, disputes about p-values, I think such disputes should focus on the bad outcomes for which there is concern, not on the p-values themselves or on interpretations 1 and 2, both of which are mathematically valid and empirically supported within their zones of interpretation.