The unfortunate one-sided logic of empirical hypothesis testing

Win-Vector Blog 2016-10-26

I’ve been thinking a bit on statistical tests, their absence, abuse, and limits. I think much of the current “scientific replication crisis” stems from the fallacy that “failing to fail” is the same as success (in addition to the forces of bad luck, limited research budgets, statistical naiveté, sloppiness, pride, greed and other human qualities found even in researchers). Please read on for my current thinking.

The nasty problem

Like many other empirical sciences, data science plays fast and loose with important statistical questions (confusing exploration with confirmation, confusing significances with posteriors, hiding negative results, ignoring selection bias, and so on).

Some really neat writing on such pitfalls include:

The last reference in particular makes two specific claims:

The driving issue is: academic science is a profession measured by publication. This gets perverted (under the theory “the inevitable becomes acceptable”) into: scientists have a right to publish as they need to do so to survive. This is why the simple act of critically reading published papers (presumably why they are published) for statistical typos has been called “methodological terrorism” (presumably under the rubric that the tenured shit on the graduate students, and not the other way around; please see here and here for more commentary).

Some cartoon examples

The above was fairly general. I am going to propose a few cartoon examples to be more specific.

  • When I was working biological science my impression was that about 90% of the published papers were of the form: “we purchased reagent 342234 from the Merck catalogue and used our purchased instrument to measure index of refraction under a grant to cure cancer.” The waste and ridiculousness of this is that usually reagent 342234 could not exist in living tissue (it would kill the tissue or get destroyed or bound) and the index of refraction has very little to do with cancer.
  • From the outside (and this in fact unfair) a lot of psychology studies look like: “we show a simple exercise that has no plausible mechanism linked to our claimed outcome possibly showed effect (below statistical significance) on a deliberately small test population.”
  • From nutrition science “we designed an experiment so small it can’t show unhealthy food is bad for you, therefore unhealthy food isn’t bad for you” (related ranting can be found here).
  • From neuroscience “we show an expensive fMRI lets us draw a pretty picture, from which we will then draw unrelated conclusions” (wonderfully lampooned in Bennett et al. “Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction” Journal of Serendipitous and Unexpected Results, 2010, see here for some discussion, and here for the meaningless circularity of such logic).
  • From the quack physics world (reaction-less drives, free energy, and so on): “we studied something that can not happen (such as reaction-less drive which is impossible due to Noether’s theorem) in conditions (our garage) utterly different than where we claim the effect will be used (deep space).” Examples include the good-old Dean drive and the more modern EmDrive.
  • And in our own data science land: “we combined thousands of advanced models to claim a marginal improvement on classifier performance that will never replicate on new data.”

The realistic alternatives

We start with non-statistical people (such as computer scientists) running one experiment and claiming victory. We try to steer them into at least some repetition to get a crude look at distribution. But repetition by instruction is mere ritual, how do we get to valid empirical science?

What we want from experiments is to know the truth (“is this food good for you or bad for you?”). While that is an unachievable goal in a mere empirical world, knowing the truth should always remain the goal.

Statistically we will settle for solid posterior odds on the statement under question being true. This is a strict formulation of empirical science. It is distinctly Bayesian, and unfortunately unachievable (as it depends on having good objective prior-odds on the effect). Usable subjective or usable un-informative priors are easily to get, but the true “prior odds” of a substantial empirical statement are usually inaccessible (though there is good practical use of so-called “empirical Bayes” methodology).

We move on to ideas of positivism, Popperism, falsifiability, and frequentism. Maybe we can’t work out the odds of a statement being true, but we may be able to eliminate some obvious false statements. Under frequentism we can at least complete a calculation (though it may not mean what we hope to claim). This is where science roughly is, and (despite its limitations and not giving proper posterior odds) I think it is about where we need to be, if done correctly.

The problems with frequentism

Without access to objective priors, I think frequentism is about as good as empirical science is going to commonly get. However to even correctly apply frequentist methods you need to think deeply on at least the two following issues:

  • Frequentist statistics depend on both the sequence of experiments performed and the sequence of experiments even contemplated (counter-factuals)! For example experimenter intent can matter.

    Consider an honest experimenter that says they are going to pick an integer k from the geometric distribution p=1/2 and flip a fair coin that many times reporting the last flip only. Also consider a dishonest experimenter that is going to flip a fair coin until it comes up heads and also reports the last flip. The first reports “heads/tails” at a 50/50 rate, and the second researcher always reports “heads”. Very different outcomes, and the only distinction is procedure- so if we are lied to about that we are at sea.

  • Frequentist results are one-sided. You can never succeed. You can only “fail to fail.” This is where frequentism is most abused. A researcher can show the data observed is very unlikely under the assumption of a (hoped to be falsified) null-hypothesis. This lowers the plausibility of the null hypothesis. The hope is that some of this lost plausibility is captured by the researcher’s pet hypothesis (which is not always the case, as it can be captured by other competing possibilities).

    There is an analogy to this in constructive mathematics. One of the tenants of classical logic is “not not P is equivalent to P.” This is the plan hinted at in null-hypothesis testing if we read “not P” as “no effect” we would read as “falsifying no-effect is equivalent to showing an effect.” This is routinely abused into “falsifying no-effect is equivalent to proving my hypothesis” (i.e. claiming to have supported a particular reason or mechanism for an effect).

    However in constructive logicnot not P is equivalent to P” is not true in general. Though we do have “not not not P is equivalent to not P” in intuitionist logic. In terms of hypothesis testing this reads as: “failing to falsify the null-hypothesis is equivalent to maintaining the null-hypothesis.” This statement has more limited content, which is why it is a tautology even in intuitionist logic.

Conclusion

What are we to do? Accept results that are only run once (with absolutely no statistics)? Teach basic frequentism which will be badly abused? Treat subjective Bayesianism (Bayes’ method with subjective priors) as universal (for example I have a near zero prior on reaction-less drive, but presumably reaction-less drive advocates have a much larger prior; so we will never agree on the interpretation of a reasonable number of experiments). Wait for the impossible city of objective Bayesian priors?

I’d say the usable lesson comes from our digression into logic emphasizing that “failing to fail” is not always the same as success. However in teaching I try to move researchers a bit up on the above sequence and ask them to keep their eyes even further up.