What’s the story behind that paper by the Center for Open Science team that just got retracted?

Statistical Modeling, Causal Inference, and Social Science 2024-09-26

Nov 2023: A paper was published with the inspiring title, “High replicability of newly discovered social-behavioural findings is achievable.” Its authors included well known psychologists who were active in the science-reform movement, along with faculty at Berkeley, Stanford, McGill, and other renowned research universities. The final paragraph of the abstract reported:

When one lab attempted to replicate an effect discovered by another lab, the effect size in the replications was 97% that in the original study. This high replication rate justifies confidence in rigour-enhancing methods to increase the replicability of new discoveries.

This was a stunning result, apparently providing empirical confirmation of the much-mocked Harvard claim that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”

Indeed, upon publication of that article, the renowned journal Nature released a news article titled “What reproducibility crisis? New research protocol yields ultra-high replication rate”:

In a bid to restore its reputation, experimental psychology has now brought its A game to the laboratory. A group of heavy-hitters in the field spent five years working on new research projects under the most rigorous and careful experimental conditions possible and getting each other’s labs to try to reproduce the findings. . . . The study, the authors say, shows that research in the field can indeed be top quality if all of the right steps are taken. . . .

Since then, though, the paper has been retracted. (The news article is still up in its original form, though.)

What happened?

Sep 2024: Here is the retraction notice, in its entirety:

The Editors are retracting this article following concerns initially raised by Bak-Coleman and Devezer.

The concerns relate to lack of transparency and misstatement of the hypotheses and predictions the reported meta-study was designed to test; lack of preregistration for measures and analyses supporting the titular claim (against statements asserting preregistration in the published article); selection of outcome measures and analyses with knowledge of the data; and incomplete reporting of data and analyses.

Post-publication peer review and editorial examination of materials made available by the authors upheld these concerns. As a result, the Editors no longer have confidence in the reliability of the findings and conclusions reported in this article. The authors have been invited to submit a new manuscript for peer review.

All authors agree to this retraction due to incorrect statements of preregistration for the meta-study as a whole but disagree with other concerns listed in this note.

As many people have noticed, the irony here is that the “most rigorous and careful experimental conditions possible” and “all the right steps” mentioned in that news article refer to procedural steps such as . . . preregistration!

June 2023: Back in when the original paper came out, well before its publication, let alone the subsequent retraction, I expressed my opinion that their recommended “rigor-enhancing practices” of “confirmatory tests, large sample sizes, preregistration, and methodological transparency” were not the best ways of making a study replicable. I argued that those sorts of procedural steps are less important than clarity in scientific methods (“What exactly did you do in the lab or the field, where did you get your participants, where and when did you work with them, etc.?”), implementing treatments that have large effects (this would rule out many studies of ESP, subliminal suggestion, etc.), focusing on scenarios where effects could be large, and improving measurements. Brian Nosek, one of the authors of the original article, responded to me and we had some discussion here. I later published a version of my recommendations in an article, Before Data Analysis: Additional Recommendations for Designing Experiments to Learn about the World, for the Journal of Consumer Psychology.

Sep 2024: As noted above, the retraction of the controversial paper was spurred by concerns of Joe Bak-Coleman and Berna Devezer, which the journal published here. These are their key points:

The authors report a high estimate of replicability, which, in their appraisal, “justifies confidence in rigour-enhancing methods to increase the replicability of new discoveries” . . . However, replicability was not the original outcome of interest in the project, and analyses associated with replicability were not preregistered as claimed.

Again, let me emphasize that preregistration and methodological transparency were two of the “rigour-enhancing measures” endorsed in that original paper. So for them to have claimed preregistration and then not to have done it, that’s not some sort of technicality. It’s huge; it’s at the very core of their claims.

Bak-Coleman and Devezer continue:

Instead of replicability, the originally planned study set out to examine whether the mere act of scientifically investigating a phenomenon (data collection or analysis) could cause effect sizes to decline on subsequent investigation . . . The project did not yield support for this preregistered hypothesis; the preregistered analyses on the decline effect and the resulting null findings were largely relegated to the supplement, and the published article instead focused on replicability, with a set of non-preregistered measures and analyses, despite claims to the contrary.

Interesting. So the result that got all the attention (“When one lab attempted to replicate an effect discovered by another lab, the effect size in the replications was 97% that in the original study”) and which was presented as so positive for science (“This high replication rate justifies confidence in rigour-enhancing methods to increase the replicability of new discoveries”) was actually a negative result which went against their preregistered hypothesis! Had the estimated effects declined, that would’ve been a win for the original hypothesis (which apparently involved supernatural effects, but that’s another story); a lack of decline becomes a win in the new framing.

The other claim in the originally-published paper, beyond the positive results in the replications, was that the “rigour-enhancing practices” had a positive causal effect. But, as Bak-Coleman and Devezer note, the study was not designed or effectively carried out to estimate the effect of such practices. There were also lots of other issues, for example changes of outcome measures; you can read Bak-Coleman and Devezer’s article for the details.

Nov 2021 – Mar 2024: After the article appeared but before it was retracted, Jessica Hullman expressed concerns here and here about the placement of this research within the world of open science:

On some level, the findings the paper presents – that if you use large studies and attempt to eliminate QRPs, you can get a high rate of statistical significance – are very unsurprising. So why care if the analyses weren’t exactly decided in advance? Can’t we just call it sloppy labeling and move on?

I care because if deception is occurring openly in papers published in a respected journal for behavioral research by authors who are perceived as champions of rigor, then we still have a very long way to go. Interpreting this paper as a win for open science, as if it cleanly estimated the causal effect of rigor-enhancing practices is not, in my view, a win for open science. . . .

It’s frustrating because my own methodological stance has been positively impacted by some of these authors. I value what the authors call rigor-enhancing practices. In our experimental work, my students and I routinely use preregistration, we do design calculations via simulations to choose sample sizes, we attempt to be transparent about how we arrive at conclusions. . . .

When someone says all the analyses are preregistered, don’t just accept them at their word, regardless of their reputation.

The first comment on that latter post was by Anonymous, who wrote, “Data Colada should look into this.”

The joke here is that Data Colada is a blog run by three psychologists who have done, and continue to do, excellent investigatory work in the science reform movement—they’re on our blogroll! and they recently had to withstand a specious multi-million dollar lawsuit, so they’ve been through a lot—so looking into a prominent psychology paper that misrepresented its preregistration would be right up their alley—except that one of the authors of that paper is . . . a Data Colada author! I continue to have huge respect for these people. Everyone makes mistakes, and it gets tricky when you are involved in a collaborative project that goes wrong.

Sep 2024: After the retraction came out, Jessica published a followup post reviewing the story:

From the retraction notice:

The concerns relate to lack of transparency and misstatement of the hypotheses and predictions the reported meta-study was designed to test; lack of preregistration for measures and analyses supporting the titular claim (against statements asserting preregistration in the published article); selection of outcome measures and analyses with knowledge of the data; and incomplete reporting of data and analyses.

This is obviously not a good look for open science. The paper’s authors include the Executive Director of the Center for Open Science, who has consistently advocated for preregistration because authors pass off exploratory hypotheses as confirmatory.

Jessica continues:

As a full disclosure, late in the investigation I was asked to be a reviewer, probably because I’d shown interest by blogging about it. Initially it was the extreme irony of this situation that made me take notice, but after I started looking through the files myself I’d felt compelled to post about all that was not adding up. . . .

What I still don’t get is how the authors felt okay about the final product. I encourage you to try reading the paper yourself. Figuring out how to pull an open science win out of the evidence they had required someone to put some real effort into massaging the mess of details into a story. . . . The term bait-and-switch came to mind multiple times as I tried to trace the claims back to the data. Reading an academic paper (especially one advocating for the importance of rigor) shouldn’t remind one of witnessing a con, but the more time I spent with the paper, the more I was left with that impression. . . . The authors were made aware of these issues, and made a choice not to be up front about what happened there.

On the other hand:

It is true that everyone makes mistakes, and I would bet that most professors or researchers can relate to having been involved in a paper where the story just doesn’t come together the way it needs to, e.g., because you realized things along the way about the limitations of how you set up the problem for saying much about anything. Sometimes these papers do get published, because some subset of the authors convinces themselves the problems aren’t that big. And sometimes even when one sees the problems, it’s hard to back out for peer pressure reasons.

Sep 2024: Bak-Coleman posted his own summary of the case. It’s worth reading the whole thing. Here I want to point out one issue that didn’t come up in most of the earlier discussions, a concern not just about procedures (preregistration, etc.) but about what was being studied:

Stephanie Lee’s story covers the supernatural hypothesis that motivated the research and earned the funding from a parapsychology-friendly funder. Author Jonathan Schooler had long ago proposed that merely observing a phenomenon could change its effect size. Perhaps the other authors thought this was stupid, but that’s a fantastic reason to either a) not be part of the project or b) write a separate preregistration for what you predict. We can see how the manuscript evolved to obscure this motivation for the study. The authors were somewhat transparent about their unconventional supernatural explanation in the early drafts of the paper from 2020:

According to one theory of the decline effect, the decline is caused by a study being repeatedly run (i.e., an exposure effect). According to this account, the more studies run between the confirmation study and the self-replication, the greater the decline should be.

This is nearly verbatim from the preregistration:

According to one theory of the decline effect, the decline is caused by a study being repeatedly run (i.e., an exposure effect). Thus, we predict that the more studies run between the confirmation study and the self-replication, the greater will be the decline effect.

It is also found in responses to reviewers at Nature, who sensed the authors were testing a supernatural idea even though they had reframed things towards replication by this point:

The short answer to the purpose of many of these features was to design the study a priori to address exotic possibilities for the decline effect that are at the fringes of scientific discourse….

As an aside, it’s wild to call your co-authors and funder the fringes of scientific discourse. Why take money from and work with cranks? Have some dignity. . . .

This utterly batshit supernatural framing erodes en route to the the published manuscript. Instead, the authors refer to these primary hypotheses that date back to the origin of the project as phenomena of secondary interest and do not describe the hypotheses and mechanisms explicitly. They refer only to this original motivation in the supplement of “test of unusual possible explanations.” . . .

It’s fine to realize your idea was bad, but something else to try bury it in the supplement and write up a whole different paper you describe in multiple places as being preregistered and what you set out to study. Peer review is no excuse for misleading readers just to get your study published because the original idea you were funded to study was absurd.

Nevertheless, when you read the paper, you’d have no idea this is what they got funding to study. Their omitted variables and undisclosed deviations in their main-text statistical models make it even harder to discern they were after the decline effect. They were only found in the pre-registered analysis code which was made public during the investigation.

In distancing themselves for two of the three reasons they got funding they mislead the reader about what they set out to study and why. This isn’t a preregistration issue. This is outcome switching, and lying. It’s almost not even by omission because they say it’s the fringes of scientific discourse but it’s the senior author on the paper!

In 2019 I spoke at a conference at Stanford (sorry!) that was funded by those people, and I agree with Bak-Coleman that science reform and supernatural research are strange bedfellows. Indeed, the conference itself was kinda split, with most of the speakers being into science reform but with a prominent subgroup who were pushing traditional science-as-hero crap—I guess they saw themselves as heroic Galileo types. I remember one talk that started going on about how brilliant Albert Einstein or Elon Musk was at the age of 11, another talk all about Nobel prize winners . . . that stuff got me so annoyed I just quietly slipped out of the auditorium and walked outside the building into the warm California sun . . . and there I met some other conference participants who were equally disgusted by that bogus hero-worship thing . . . I’d found my science-reform soulmates! I also remember the talk by Jonathan Schooler (one of the authors of the recently-retracted article), not in detail but I do remember being stunned that he was actually talking about ESP. Really going there, huh? It gave off a 70s vibe, kinda like when I took a psychology class in college and the professor recommended that we try mind-altering drugs. (That college course was in the 80s, but it was the early 80s, and the professor definitely seemed like a refugee from the 60s and 70s; indeed, here he is at some sort of woo-woo-looking website.)

Responses from the authors of the original article

I wanted to supplement the above readings with any recent statements by the prominent authors of the now-retracted article, but I can’t find anything online at the sites of Brian Nosek, Data Colada, or elsewhere. If someone can point me to something, I’ll add the link. Given all the details given by Bak-Coleman and others, it’s hard for me to imagine that a response from the authors would change my general view of the situation, but I could be missing something, also it’s always good to hear more perspectives.

Putting it all together

The interesting thing about this story—its “man bites dog” aspect—is that the people involved in the replication failure are not the usual suspects. This is not a case of Ted-talking edgelords getting caught in an exaggeration. This time we’re talking about are prominent science reformers. Indeed, Brian Nosek, leader of the Center for Open Science, coauthored a wonderful paper a few years ago detailing how they’d fooled themselves with forking paths and how they were saved from embarrassment by running their own replication study.

One thing that concerned me when I first heard this story—and I think Jessica had a similar reaction—was, are these people being targeted? Whistleblowers attract reaction, and lots of people are there waiting to see you fall. I personally get annoyed when people misrepresent my writings and claim that I’ve said things that I never said—this happens a lot, and when it happens I’m never sure where to draw the line between correcting the misrepresentation and just letting it go, because the sort of people who will misrepresent are also the sort of people who won’t correct themselves or admit they were wrong; basically, you don’t want to get in a mudwrestling match with someone who doesn’t mind getting dirty—so I was sensitive to the possibility that Nosek and the other authors of that paper were being mobbed. But when I read what was written by Hullman, Bak-Coleman, Devezer, and others, I was persuaded that they were treating the authors of that paper fairly.

In that case, what happened? This was not a Wansink or Ariely situation where, in retrospect, they’d been violating principles of good science for years and finally got caught. Rather, the authors of that recent-retracted paper included several serious researchers in psychology, along with people who had made solid contributions to science reform, along with a couple of psychology researchers who were more fringey.

So it’s not simple case of “Yeah, yeah, we could’ve expected that all along.” It’s more like, “What went wrong?”

I think three things are going on.

1. I think there’s a problem with trying to fix the replication crisis using procedural reforms, by which I mean things like preregistration, p-value or Bayes-factor thresholds, and changes in the processes of scientific publication. There’s room for improvement in all these areas, no doubt, and I’m glad that people are working on them—indeed, I’ve written constructively on many of these topics myself—but they don’t turn bad science into good science, all they do is offer an indirect benefit by changing the incentive structure and, ideally, motivating better work in the future. That’s all fine, but when they’re presented as “rigour-enhancing methods to increase the replicability of new discoveries” . . . No, I don’t think so.

I think that most good science (and engineering) is all about theory and measurement, not procedure.

Indeed, over-focus on procedure is a problem not just in the science-reform movement but also in statistics textbooks. We go on and on about random sampling, random assignment, linear models, normal distributions, etc etc etc. . . . All these tools can be very useful—when applied to measurements that address questions of interest in the context of some theoretical understanding. If you’re studying ESP or the effects of subliminal smiley faces on attitude or the effects of day of the month on vote intention, all the randomization in the world won’t help you. And statistical modeling—Bayesian or otherwise—won’t help you either, except in the indirect sense of making it more clear that whatever you’re trying to study is overwhelmed by noise. This sort of negative benefit is a real thing; it’s just not quite what reformers are talking about when they talk about “increase the replicability of new discoveries.”

2. Related is the personalization of discourse in meta-science. Terms such as “p-hacking” and “questionable research practices” are in a literal sense morally neutral, but I feel that they are typically used in an accusatory way. That’s the case even with “harking” (hypothesizing after results are known), which I think is actually a good thing! One reason I started using the term “forking paths” is that it doesn’t imply intentionality, but there’s only so much you can do with words.

I think the bigger issue here is that scientific practice is taken as a moral thing, which leads to two problems. First, if “questionable research practices” is something done by bad people, then it’s hard to even talk about it, because then it seems that you’re going around accusing people. I know almost none of the people whose work I discuss—that’s how it should be, because publications are public, they’re meant to be read by strangers—and it would be very rare that I’d have enough information to try to judge their actions morally, even if I were in a position to judge, which I’m not. Second, if “questionable research practices” is something done by bad people, then if you know your motives are pure, logically that implies that you can’t have done questionable research practices. I think that’s kinda what happened with the recently retracted paper. The authors are science reformers! They’ve been through a lot together! They know they’re good people, so when they feel “accused” of questionable research practices (I put “accused” in quotes because what Bak-Coleman and Devezer were doing in their article was not to accuse the authors of anything but rather to describe certain things that were in the article and the metadata), it’s natural for them (the authors) to feel that, no, they can’t be doing these bad things, which puts them in a defensive posture.

The point about intentionality is relevant to practice, in that sins such as “p-hacking,” “questionable research practices,” “harking,” etc., are all things that are easy to fix—just stop doing these bad things!—, and many of the proposed solutions, such as preregistration and increased sample size, require some effort but no thought. Some of this came up in our discussion of the 2019 paper, “Arrested Theory Development: The Misguided Distinction Between Exploratory and Confirmatory Research,” by Aba Szollosi and Chris Donkin.

3. The other thing is specific to this particular project but perhaps has larger implications in the world of science reform. A recent comment pointed to this discussion from an Open Science Framework group from 2013. Someone suggests a multi-lab replication project:

And here was the encouraging reply:

As I wrote here, the core of the science reform movement (the Open Science Framework, etc.) has had to make all sorts of compromises with conservative forces in the science establishment in order to keep them on board. Within academic psychology, the science reform movement arose from a coalition between radical reformers (who viewed replications as a way to definitely debunk prominent work in social psychology they believed to be fatally flawed) and conservatives (who viewed replications as a way to definitively confirm findings that they considered to have been unfairly questioned on methodological grounds). As often in politics, this alliance was unstable and has in turn led to “science reform reform” movements from the “left” (viewing current reform proposals as too focused on method and procedure rather than scientific substance) and from the “right” (arguing that the balance has tipped too far in favor of skepticism).

To say it another way, the science reform movement promises different things to different people. At some level this sort of thing is inevitable in a world where different people have different goals but still want to work together. For some people such as me, the science reform movement is a plus because it opens up a space for criticism in science, not just in theory but actual criticism of published claims, including those made by prominent people and supported by powerful institutions. For others, I think the science reform movement has been viewed as a way to make science more replicable.

Defenders of preregistration have responded to the above points by saying something like, “Sure, preregistration will not alone fix science. It’s not intended to. It’s a specific tool that solves specific problems.” Fair enough. I just think that a lot of confusion remains on this point; indeed, my own reasons for preregistration in my own work are not quite the reasons that science reformers talk about.

Summary

The 2023 paper that claimed, “this high replication rate justifies confidence in rigour-enhancing methods to increase the replicability of new discoveries,” was a disaster. The 2024 replication of the paper makes it less of a disaster. As is often the case, what appears to be bad news is actually the revelation of earlier bad news; it’s good news that it got reported.

Confusion remains regarding the different purposes of replication, along with the role of procedural interventions such as preregistration that are designed to improve science.

We should all be thankful to Bak-Coleman and Devezer for the work they put into this project. I can see how this can feel frustrating for them: in an ideal world, none of this effort would have been necessary, because the original paper would never have been published!

The tensions within the science reform movement—as evidenced by the prominent publication of a research article that was originally designed to study a supernatural phenomenon, then was retooled to represent evidence in favor of certain procedural reforms, and finally was shot down by science reformers from the outside—can be seen as symbolic of, or representative of, a more general tension that is inherent in science. I’m speaking here of the tension between hypothesizing and criticism, between modeling and model checking, between normal science and scientific revolutions (here’s a Bayesian take on that). I think scientific theories and scientific measurement need to be added to this mix.