Reading the referee reports of that retracted paper by the science reformers: A peek behind the curtain

Statistical Modeling, Causal Inference, and Social Science 2025-01-26

One interesting sidelight of the story of a much criticized and finally retracted article on replication in psychology is that we get to read some of the referee reports.

These aren’t the reports on the original article—I guess those reviews were positive, otherwise it wouldn’t have been published in the first place—but on a proposed replacement paper; Berna Devezer explains the background.

The reviews are in this document, which came from this public repository.

The first two reviews (by Tal Yarkoni and Deniel Lakens) are really long, and they are both very thoughtful, really the equivalent of publishable articles (or blog posts) on their own. I guess they knew this was a high-stakes case, so they put in extra effort.

The third review was more cursory, and its summary was, “This is an important and novel set of findings” and “The approach is valid, and the quality of data and of its presentation are good.” After reading reviews 1 and 2, I don’t think reviewer 3’s assessment is correct—but I’m not gonna get all upset at reviewer 3. I’ve reviewed hundreds, maybe thousands of papers. I do a quick read, and sometimes I miss the point! That’s why a journal will obtain three reviews—different reviews have different styles, and any given reviewer can make a mistake.

Anyway, for those of you who have never been involved in editing a scholarly journal, it might be interesting for you to read these reviews, just to get a peek inside a system that is usually hidden from outsiders.

Some of the reviewers’ comments are just so on the mark:

The paper’s central claim—i.e., that a high rate of replication failures is not inevitable if one uses optimal practices—is obviously true, and requires no empirical support. The only way it could be false is if there were no reliable effects at all in social science—which is on its face an absurd proposition. The authors write that they “report the results of a prospective replication examining whether low replicability and declining effects are inevitable when using current optimal practices.” But how could low replicability and declining effects possibly be “inevitable”, either with or without optimal practices?

Good point!

A charitable reading of the authors’ central claim might be that what they are really trying to do is quantitatively estimate the impact of using best practices on the replicability of previous findings. That is, the question is not really whether or not replicability is *achievable* (of course it is), but *under what conditions a certain degree of replicability is achievable*, and *how much of an impact certain procedures seem to make*. If the paper were explicitly framed this way, I think it would be posing an important question. A serious effort to quantify the impact of implementing specific rigorous methodologies on replicability would be a valuable service to the field. However, framing the paper this way would also make it clear that, at least in its current presentation, the study has a number of major design limitations that in my view preclude it from providing an informative answer to the question.

Bang! The reviewer is not trying to be mean; he’s just stating some truths.

The generalization target is unclear. It is never made clear in the manuscript what population of effects the authors take their conclusions to apply to. . . . absent the authors’ disclosure of the processes that led them to select these particular effects for inclusion, it is impossible to determine—even in a ballpark sense—what population of studies the present results are meant to apply to.

That sort of thing is often important. It’s the kind of thing that people can be sloppy about in the writing of the paper and that journals will often just let through because nobody cares about going through and getting everything right.

Inadequate description of effects.

That one’s huge. Here’s an example:

For example, for the “Ads” study, Table 1 describes the central result as “Watching a short ad within a soap-operate episode increases one’s likelihood to recommend and promote the company in the ad”. This is incorrect in at least two ways. First, the authors did not measure likelihood of recommendation or promotion of companies; they measured *self-reported* likelihoods of these behaviors. That there is at best a very weak relationship between these things should be obvious, or else McDonald’s would experience a significant boost in revenue every time it aired a single ad on TV, which is obviously not the case (indeed, there is a cottage industry within marketing research questioning whether TV ad campaigns have *any* meaningful effect on sales). Second, the current wording implies that the effect applies to companies in general, when actually the authors only asked about McDonald’s, and used only a single ad comparison (McDonald’s vs. Prudential). This design does not license any general conclusions about “the company in the ad”; it licenses conclusions only about McDonald’s.

This reminds me of our interrogation of a psychology paper that kept describing things inaccurately. It can be so hard for authors to just say exactly what they did!

Of the “Cookies” study, the authors write: “People will be seen as greedier when they take three of the same kind of (free) cookie than when they take three different (free) cookies”. This is an inaccurate description, as no cookie-takers were actually observed; participants were asked to *imagine* how they would feel if they observed people taking cookies. A more accurate description would be: “Participants directed to imagine a specific hypothetical norm-violation scenario rate it as greedier to take three of the same kind of (free) cookie than to take three different (free) cookies.”

You might see this as picky, but I don’t. We’re supposed to be doing science here! You say what you actually did. Even though, I know that it’s absolutely standard practice to not do this (notoriously, “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications,” describing a study that had no measures of power, let alone “instantly becoming more powerful”).

At several points in the paper the authors remark the performed studies with ‘high statistical power’ (e.g., abstract). But this is not true. . . . If this confirmatory study happens to be a study where, as a fluke an effect size was observed that was rather extreme (as should happen, by change, in 16 studies), and the true effect was smaller (as it seems to be, in the replication studies), then this was not a sufficiently powered study. Instead, it *seemed to be* a sufficiently powered study, based on the *observed* effect size. But if we take the four replications as an estimate of the true effect size, the study had low power. Of course, all of this requires some speculation, as we never know the true effect size, but the point is, the authors can not argue the studies all had high power, and the absence of a power analysis (let alone a conservative power analysis, such as a safeguard power analysis) should make the authors even more careful about claiming they had ‘high power’. Power is a curve. . . .

That’s a lot of words, but it’s understandable. It can take a lot of words to explain what’s going on to people who have made a mistake. And it’s just a referee report; there’s no reason the reviewer should put in lots of time to write it crisply. The point is, yeah, the statement in the article was wrong. And, again, the reviewer isn’t being mean, he’s just telling it like it is.