Remember: p-values Are Not Effect Sizes

Win-Vector Blog 2017-09-09

The p-value is a valid frequentist statistical concept that is much abused and mis-used in practice. In this article I would like to call out a few features of p-values that can cause problems in evaluating summaries.

Keep in mind: p-values are useful and routinely taught correctly in statistics, but very often mis-remembered or abused in practice.

From Hamilton’s Lectures on metaphysics and logic (1871). Internet Archive Book Images

What is a statistic?

Roughly, a statistic is any sort of summary or measure about an attribute of a population or sample from a population. For example, for people an obvious statistic is “average height” and we can talk about the mean height of 20 year old male Californians, the mean height of a sample of 20 year old male Californians, or the mean height of a few individuals.

In predictive analytics or data science the most popular summary statistics are often how well a model is doing in prediction or what the difference in the prediction quality of two models over a representative data set. These statistics may be an “agreement metric”, for example R-squared or pseudo R-squared, accuracy, cosine-similarity or AUC, or a “disagreement metric” or loss such as squared-error, RMSE, or MAD.

In medical or treatment contexts a statistic might be the probability of surviving the next year, the number of years of life added, or number of pounds weight change. These statistics are generally what we mean by “effect sizes;” notice they all have units. There are a lot of possible summary statistics, and picking the appropriate one is important.

In any case we have a summary statistic. We should have some notion as to what “large” and “small” values of such a statistic might be (the too-often ignored clinical significance) and we also want an estimate of the reliability of our estimate (the so-called statistical significance of the estimated statistic).

What is a significance or “`p`-value”?

The most commonly reported statistical significance is the frequentist significance of a null hypothesis. To calculate such one must:

Propose a “null hypothesis”: the condition that we are trying to out-compete. This can be something like “the value is a constant”, or “two populations are identical,” “the two models have identical RMSE”, or “two variables are independent.”
Declare what one is going to test. This is mostly picking one or two-sided tests. Are we testing “A is better than B” or “A and B are different”?
Model the probability distribution of the statistic subject to some marginal facts about the data (simple stuff such as the population size) under the null hypothesis.
Use the above distribution to estimate how often your statistic is of interest, for example: P[score(X) ≥ score(observed) | X a statistic distributed under the above null hypothesis]. This is called significance or p. You hope that p is small.

Assume the null hypothesis A=B and a test statistic t that is approximately normally distributed around t=0 when the null hypothesis is true. Then the p-value is the probability of t being as large or larger than what you observe, under the null hypothesis.

The idea is that small p is heuristic evidence that the null hypothesis does not hold, as your observed statistic is considered unlikely under the null hypothesis and your distributional assumptions. Really such tests are unfortunately at best one-sided: it is usually fairly damning if your outcome doesn’t look rare under the null-hypothesis, but only mildly elevating when your outcome does look rare under the null-hypothesis. “Failing to fail” isn’t always the same as succeeding.

Moving from this heuristic indication to saying you have a good result (i.e. you model is “good” or “better”) requires at least priors on model quality (not performance) and often includes erroneous excluded middle fallacies. Saying one given null hypothesis is unlikely to have generated your observed performance statistic in no way says your model was likely good. It would only say so if in addition to making the significance calculations you had also done the work to actually exclude the middle and show that there are no other remotely plausible alternatives explanations.

One of my favorite authors on p-values and their abuse is Professor Andrew Gelman. Here is one of his blog posts.

Some general complaints

The many things I happen to have issues with in common mis-use of p-values include:

p-hacking. This includes censored data bias, repeated measurement bias, and even outright fraud.
“Statsmanship” (the deliberate use of statistical terminology for obscurity, not for clarity). For example: saying p instead of saying what you are testing such as “significance of a null hypothesis”.
Logical fallacies. This is the (false) claim that p being low implies that the probability that your model is good is high. At best a low-p eliminates a null hypothesis (or even a family of them). But saying such disproof “proves something” is just saying “the butler did it” because you find the cook innocent (a simple case of a fallacy of an excluded middle).
Confusion of population and individual statistics. This is the use of deviation of sample means (which typically decreases as sample size goes up) when deviation of individual differences (which typically does not decrease as sample size goes up) is what is appropriate . This is one of the biggest scams in data science and marketing science: showing that you are good at predicting aggregate (say, the mean number of traffic deaths in the next week in a large city) and claiming this means your model is good at predicting per-individual risk. Some of this comes from the usual statistical word games: saying “standard error” (instead of “standard error of the mean or population”) and “standard deviation” (“instead of standard deviation of individual cases”); with some luck somebody won’t remember which is which and be too afraid to ask.

Narrowing down to one complaint

My main complaint is the abuse of p-values as colloquially representing the reciprocal of an effect size (or the reciprocal of a clinical significance).

In practice nobody should directly care about a p-value . They should care about the effect size being claimed (often not even reported) and whether the claim is correct. The p-value is at best a proxy related to only one particular form of incorrectness.

Once you notice people are using p-values as stand-ins for effect sizes you really see the problem.

`p`-values are not effect sizes when there is no effect

When there “is no effect” (i.e., when something like a null hypothesis actually holds) p-values are not consistent estimators! That is, if there is no effect, two different experimenters will likely see two different p-values regardless of how large an experiment either of them runs!

Under the null hypothesis a p-value is exactly uniformly distributed in the interval [0,1] as experiment size goes to infinity. That is by construction. All the fancy statistical methods are designed to ensure that.

This has horrible consequences. Two experimenters studying an effect that does not exist can not confirm each other’s results from only p-values. Suppose one got a p=0.01 (not too unlikely, it happens 1 in 100 times, and with the professionalization of research we have a lot of experiments being run every day) and the other got p=0.64. The two experimenters have no clue if the difference is likely due to chance or to difference in populations and procedures. With an asymptotically consistent summary (such as Cohen’s d) they would know eventually (as they add more data) whether they are seeing the same results.

In fact under the usual “Z,p” style formulations of significance (such as t-testing) we have Z becomes normally distributed (with variance 1) as experiment size goes to infinity, so reporting population Z in addition to p buys you nothing.

`p`-values are not effect sizes when there is an effect

If there is an effect (i.e., your model makes a useful prediction, or your drug helps, no matter how tenuously) then: conditioned on the effect size and population characteristics the p values is uninformative in that it converges to zero. It does not carry any information other than weak facts about the size of the test population (relative to the actual effect size).

Now I know in the real world the effect size and total characterization of the population are in fact unknown (part of what we are trying to estimate). But the above still has an undesirable consequence. One can, if they can afford it, purchase an arbitrarily small p-value just by running a sufficiently large trial. Always remember a low p doesn’t indicate “big effect” it could easily be from large population (which means better-funded institutions can in fact “buy better ps” on weak effects).

In fact under the usual “Z,p” style formulations of significance (such as t-testing) we have Z goes to infinity as experiment size goes to infinity, so reporting Z in addition to p buys you nothing.

An under-used alternative

Cohen’s d (under fairly mild assumptions) converges to an informative value as experiment size increases. Different experiments can increase their probability of reporting d‘s within a given tolerance by increasing experiment size. And not all valid experiments convert to zero (so, Cohen’s d carries some information about effect size). If experimenters don’t see Cohen’s d converging they should start to wonder if they have matching populations and procedures. One can worry about technical issues of Cohen’s d (such as whether one should use partial eta-squared instead), but in any case Cohen’s d is no worse than the usual Z, p (in fact it is much better).

Conclusion

Rely more on effect measures. I think experimenters should emphasize many things before attempting to state a significance. They should report a significance, but always before that emphasize at the very least: a units-based effect size and a dimensionless effect size. Let’s take for example an anti-cholesterol drug.

We should insist on at least three summaries:

Units effect size. The units effect size is critical. It tells people if they should even care if the result is true or not. An anti-cholesterol drug is only interesting if it decreases bad cholesterol by a clinically significant quantity. That is they need to cut LDL cholesterol by a big number such as 10%, 20%, or 50%. And we only care about that based on research linking such reductions to clinically significant decrease in stroke, heart attach, and death-rate. Nobody is going to care to see if the study and statistics are correct if the claimed decrease is 0.5% LDL. We need to know if the drug helps individuals, or if it is just some effect only visible across large populations. Reviewers deserve this number first to know if they should read on.
Dimensionless effect size. Dimensionless effect size is critically important, and so neglected it keeps getting re-invented. Take the your units effect size and divide it by expected variation between individuals. Essentially this ratio is monotone related (modulo squaring, square-rooting, reciprocal, and adding or subtracting from 1) to: Cohen’s d, partial eta-squared, the Sharpe ratio, the coefficient of variation, correlation, pseudo r-squared, r-squared, cosine similarity, or signal to noise ratio). If this number is small (and that has concrete definition) then we are talking about a treatment that can at best be noticed in aggregate. For a drug this might mean the drug is useful (if cheap and without side-effects) as a matter of public health, but not indicative that it will work on a specific individual.
Reliability of the experiment. This is where you report the p-value, and hopefully not just the p-value. Personally I use p-values, but I insist they be called “significances” so we have some chance of knowing what we are talking about (versus dealing with alphabet soup). Roughly the mantra “low p” is considered “highly significant”, which only means the observed outcome is considered implausible under one specific null hypothesis (or family). One should always re-state what the null-hypothesis in fact was.

As a consumer of data science, machine learning, or statistics: always insist on: a units (or clinical) effect size, a dimensionless effect size (Cohen’s d is good enough), and discussion of reliability of the experiment (which is where a p-value goes, but must include a lot more context to be meaningful).