The Cult of Statistical Significance [book review]

untitled 2015-07-01

(Post contributed by Christian Robert)

“Statistical significance is not a scientific test. It is a philosophical, qualitative test. It asks “whether”. Existence, the question of whether, is interesting. But it is not scientific.” S. Ziliak and D. McCloskey, p.5

The book, written by economists Stephen Ziliak and Deirdre McCloskey, has a theme bound to attract Bayesians and all those puzzled by the absolute and automatised faith in significance tests. The main argument of the authors is indeed that an overwhelming majority of papers stop at rejecting variables (“coefficients”) on the sole and unsupported basis of non-significance at the 5% level. Hence the subtitle “How the standard error costs us jobs, justice, and lives“… This is an argument I completely agree with, however, the aggressive style of the book truly put me off! As with Error and Inference, which also addresses a non-Bayesian issue, I could have let the matter go, however I feel the book may in the end be counter-productive and thus endeavour to explain why through this review. (I wrote the following review in batches, before and during my trip to Dublin, so the going is rather broken, I am afraid…)

“Advanced empirical economics, which we’ve endured, taught, and written about for years, has become an exercise in hypothesis testing, and is broken. We’re saying the brokenness extends to many other quantitative sciences.” S. Ziliak and D. McCloskey, p. xviii

The first chapters contain hardly any scientific argument, but rather imprecations against those using blindly significance tests. Rather than explaining in simple terms and with a few mathematical symbols [carefully avoided throughout the book] what is the issue with significance tests, Ziliak and McCloskey start with the assumption that the reader knows what tests are or, worse, that the reader does not need to know. While the insistence on thinking about the impact of a significant or insignificant coefficient/parameter in terms of the problem at hand is more than commendable, the alternative put forward by the authors remains quite vague, like “size matters”, “how big is big?”, and so on. They mention Bayesian statistics a few time, along with quotes of Jeffreys and Zellner, but never get into the details of their perspective on model assessment. (In fact, their repeated call on determining how important the effect is seems to lead to some sort of prior on the alternative to the null.) It would have been so easy to pick of the terrible examples mocked by Ziliak and McCloskey and to show what a decent statistical analysis could produce w/o more statistical sophistication than the one required by t-tests.. Instead, the authors have conducted a massive and rather subjective study of the American Economic Review for the 1980’s with regard to the worth of all [statistical] significance studies used in all papers published in the journal, then repeated the analysis for the 1990’s, and those studies constitute the core of their argument. (Following chapters reproduce the same type of analysis in other fields like epidemiology and psychometrics.)

“Fisher realized that acknowledging power and loss function would kill the unadorned significance testing he advocated and fought to the end, and successfully, against them.” S. Ziliak and D. McCloskey, p.144

Ziliak and McCloskey somehow surprisingly seem to focus on the arch-villain Ronald Fisher while leaving Neyman and Pearson safe from their attacks. (And turning Gosset into the good fellow, supposed to be “hardly remembered nowadays” [p.3]. While being dubbed a “lifelong Bayesian” [p.152]) I write “surprisingly” because Fisher did not advise as much the use of a fixed significance level (even though he indeed considered 5% as a convenient bound) as the use of the p-value per se, while Neyman and Pearson introduced fixed 5% significance levels as an essential part of their testing apparatus. (See the previous posts on Error and Inference for more discussions on that. And of course Jim Berger’s “Could Fisher, Jeffreys, and Neyman have agreed on testing?“) Not a surprising choice when considering the unpleasant personality of Fisher, of course! (Another over-the-board attack: “Fisherians do not literally conduct experiments. The brewer did.” [p.27] What was Fisher doing in Rothamsted then? Playing with his calculator?!) The twined fathers of significance testing seem to escape the wrath of Ziliak and McCloskey due to their use of a loss function… Or maybe of defining a precise alternative. While I completely agree that loss functions should be used to decide about models (or predictives, to keep Andrew happy!), the loss function imagined by Neyman and Pearson is simply too mechanistic to make any sense to a decision analyst. Or even to a statistician. We discussed earlier the València 9 paper of Guido Consonni, in connection with more realistic loss functions. Also the authors seem to think power is an acceptable way to salvage significance test, while I never understood the point of arguing in favour of power since, like other risk functions, power depends on the unknown parameter(s) and it is hence improbable that two procedures will get uniformly ordered for all values of the parameter(s), except in textbook situations. For instance, they think that classical sign tests are good guys!

“Significance unfortunately is a useful mean towards personal ends in the advance of science – status and widely distributed publications, a big laboratory, a staff of research assistants, a reduction in teaching load, a better salary, the finer wines of Bordeaux. (…) In a narrow and cynical sense statistical isgnificance is the way to achieve these.” S. Ziliak and D. McCloskey, p.32

In a possibly unnecessary fashion, let me repeat I find it quite sad that a book that addresses such an important issue let aggressivity, arrogance, and under-the-belt rhetorics ruin its purpose. It sounds too much like a crusade against an establishment to be convincing to neophytes and to be taken as a serious warning. (I wonder in fact what is the intended readership of this book, given that it requires some statistical numeracy, but not “too much” to be open-minded about statistical tests! ) Bullying certainly does not help in making one’s case more clearly understood: even though letting mere significance tests at standard levels rule the analysis of a statistical model is a sign of intellectual laziness, or of innumeracy, accusing its perpetrators of intentional harm and cynicism does not feel adequate. Once again, I fully agree that users of statistical methods should not let SAS (or any other commercial software) write their research paper for them but, instead, think about the indications provided by such outputs in terms of the theory and concepts behind their model(s). Interestingly, Ziliak and McCloskey mention for instance the use of simulation and pseudo-data to reproduce the performance of those tests under the assumed model and to calibrate the meaning of tools like p-values. A worthwhile and positive recommendation in an otherwise radically negative and counter-productive book.

“Adam Smith, who is much more than an economist, noted in 1759 that hatred, resentment, and indignation against bad behavior serve, of course, a social purpose (…) “Yet there is still something disagreeable in the passions themselves.”.” S. Ziliak and D. McCloskey, p.55

The first example Ziliak and McCloskey use to make their point falls quite far from the mark: in Chapter 1, discussing the impact of two diets pill A and B with means 20 and 5 and standard deviations 5 and 1/4, respectively, they conclude that B gives a smaller p-value for the test whether or not the pill has an effect. Because 20/10=2 and 5/(1/2)=10. There are two misleading issues there: first, the diets are compared in terms of mean effect, so outside statistics. Second, running a t-test of nullity of the mean is not meaningful in this case. What imports is whether or not a diet is more efficient than the other. Assuming a normal distribution, we have here

$P(A>B) = P(X>-15/\sqrt{25+1/16}) = \Phi(2.996) = 0.999$

which is a pretty good argument in favour of diet pill A. (Of course, this is under the normal assumption and all that, which can be criticised and assessed.) The surprising thing is that Ziliak and McCloskey correctly criticise a similar error about the New Jersey vs. Pennsylvania minimum wage study (Chapter 9, pp.101-103)

“Around the time that significance testing was sinking deeply into the life and human sciences, Jean-Paul Sartre noted a personality type. “There are people who are attracted by the durability of a stone (…)” Sartre could have been talking about the psychological makeup of the most rigid of the significance testers.” S. Ziliak and D. McCloskey, p. 32.

The above quote shows the authors are ready to call on an “authority” as un-scientific as Jean-Sol Partre, which would be enough for me to close the case! Esp. because I am attracted by stones… Except that I came upon the quote

“Fisher-significance is a manly sounding answer, though false. And one can see in the dichotomy of hard and soft a gendered worry, too. The worry may induce some men to cling to Significance Only. (…) Around 1950, at the peak of gender anxiety among middle-class men in the United States, nothing could be worse than to call a man soft..” S. Ziliak and D. McCloskey, pp. 140-141.

which is so completely inappropriate and unrelated as to be laughable… It also shows how far from rational academic arguments Ziliak and McCloskey are ready to delve in order to make their point. (They also blame the massacre of whales and the torturing of lambs, p. 39, on t-tests!) Just as laughable is the characterisation of statistics as the “bourgeois cousin” of probability theory (p.195) at a time where both fields did not truly exist and were clearly mixed in most researchers’ mind (as shown by the titles of Keynes‘ and Jeffreys‘ books).

(Note: the book got published in 2008, hence already got a lot of reviews. However, it did not get much publicised in statistical circles, and even less in mine’s, so I only became aware of it this summer. Here are some reviews on The Endeavour and kwams, who also blogs about the review by Aris Spanos, who interestingly complaints about the authors “using a variety of well-known rhetorical strategies and devices” and the reply from the authors. David Aldous also wrote a convincing and balanced review on amazon about the book. Now, most ironically!, as I was completing this book review, I received the latest issue of Significance that contained an article by Stephen Ziliak on Matrixx v. Siracusano, about the Supreme Court ruling that statistical significance does not imply causation nor association. He and Deirdre McCloskey were experts in this ruling, however, as an academic, I fail to see how a Supreme Court ruling brings any scientific support to the case… Actually, several articles in this issue are linked to the damages caused by the blind use of significance tests. In particular, the xkcd comics about p-values, which in my opinion has more impact than the cult of significance!)