Guttman points out another problem with null hypothesis significance testing: It falls apart when considering replications.
Statistical Modeling, Causal Inference, and Social Science 2021-06-17
Michael Nelson writes:
Re-reading a classic from Louis Guttman, What is not what in statistics, I saw his “Problem 2” with new eyes given the modern replication debate:
Both estimation and the testing of hypotheses have usually been restricted as if to one-time experiments, both in theory and in practice. But the essence of science is replication: a scientist should always be concerned about what will happen when he or another scientist repeats his experiment. For example, suppose a confidence interval for the population mean is established on the basis of a single experiment: what is the probability that the sample mean of the next experiment will fall in this interval? The level of confidence of the first experiment does not tell this. … The same kind of issue, with a different twist, holds for the testing of hypotheses. Suppose a scientist rejects a null hypothesis in favour of a given alternative: what is the probability that the next scientist’s experiment will do the same? Merely knowing probabilities for type I and type II errors of the first experiment is not sufficient for answering this question. … Here are some of the most realistic problems of inference, awaiting an answer. The matter is not purely mathematical, for the actual behaviour of scientists must be taken into account. (p.84)
This statement, literally as old as me [Nelson], both having been “issued” in 1977, is more succinct and more authoritative than most summaries of the current controversy. Guttman is also remarkably prescient in his intro as to the community’s reaction to this and other problems he highlights with conventional approaches:
An initial reaction of some readers may be that this paper is intended to be contentious. That is not at all the purpose. Pointing out that the emperor is not wearing any clothes is in the nature of the case somewhat upsetting. … Practitioners…would like to continue to believe that “since everybody is doing it, it can’t be wrong”. Experience has shown that contentiousness may come more from the opposite direction, from firm believers in unfounded practices. Such devotees often serve as scientific referees and judges, and do not refrain from heaping irrelevant criticisms and negative decisions on new developments which are free of their favourite misconceptions. (p. 84)
Guttman also makes a point I hadn’t really considered, nor seen made (or refuted) in contemporary arguments:
Furthermore, the next scientist’s experiment will generally not be independent of the first’s since the repetition would not ordinarily have been undertaken had the first retained the null hypothesis. Logically, should not the original alternative hypothesis become the null hypothesis for the second experiment?
He also makes the following, almost parenthetical statement, cryptic to me perhaps because of my own unfamiliarity with the historical arguments against Bayes:
Facing such real problems of replication may lead to doubts about the so-called Bayesian approach to statistical inference.
No one is perfect!
My reaction: Before receiving this email, I’d never known anything about Guttman, I’d just heard of Guttman scaling, that’s all. The above-linked article is interesting, and I guess I should read more by him.
Regarding the Bayes stuff: yes, there’s a tradition of anti-Bayesianism (see my discussions with X here and here), and I don’t know where Guttman fits into that. The specific issue he raises may have to do with problems with the coherence of Bayesian inference in practice. If science works forward from prior_1 to posterior_1 which becomes prior_2, then is combined with data to yield posterior_2 which becomes prior_3, and so forth, then this could create problems for analysis of an individual study, as we’d have to be very careful about what we’re including in the prior. I think these problems can be directly resolved using hierarchical models for meta-analysis, but perhaps Guttman wasn’t aware of then-recent work in that area by Lindley, Novick, and others.
Regarding the problems with significance testing: I think Guttman got it right, but he didn’t go far enough, in my opinion. In particular, he wrote, “Logically, should not the original alternative hypothesis become the null hypothesis for the second experiment?”, but this wouldn’t really work, as null hypotheses tend to be specific and alternatives tend to be general. I think the whole hypothesis-testing framework is pointless, and the practical problems where it’s used can be addressed using other methods based on estimation and decision analysis.