Well, today we find our heroes flying along smoothly…
Statistical Modeling, Causal Inference, and Social Science 2024-09-24
This is Jessica. I hadn’t planned to be down on open science research again so soon, but I seem to keep finding myself presented with messes associated with it. After an 7+ month investigation instigated by a Matters Arising critique by Bak-Coleman and Devezer, Nature Human Behavior retracted the “feel-good open science story” paper “High replicability of newly discovered social-behavioural findings is achievable” by Protzko et al. From the retraction notice:
The concerns relate to lack of transparency and misstatement of the hypotheses and predictions the reported meta-study was designed to test; lack of preregistration for measures and analyses supporting the titular claim (against statements asserting preregistration in the published article); selection of outcome measures and analyses with knowledge of the data; and incomplete reporting of data and analyses.
This is obviously not a good look for open science. The paper’s authors include the Executive Director of the Center for Open Science, who has consistently advocated for preregistration because authors pass off exploratory hypotheses as confirmatory. Another author is a member of the Data Colada team that has outed others’ questionable research transgressions and helped popularize the ideas that selective reporting and harking threaten the validity of claimed results in psych.
I once thought I did know all about it
If seeing this paper retracted makes you uncomfortable, I don’t blame you. It makes me uncomfortable too. My views on mainstream open science research and advocacy were much more positive a year ago before I encountered all this.
As a full disclosure, late in the investigation I was asked to be a reviewer, probably because I’d shown interest by blogging about it. Initially it was the extreme irony of this situation that made me take notice, but after I started looking through the files myself I’d felt compelled to post about all that was not adding up. When asked to officially participate in the investigation, I agreed but with some major hesitation. I knew that to be comfortable weighing in on the question of retraction, I’d want to think through many possible defenses for how the paper presents its points. That would mean spending more time beyond that I’d already spent going through the OSF to write one of my blog posts to sort through the paper’s arguments and consider whether they could possibly hold up. None of this is at all connected my main gig in computer science.
But ultimately I said yes out of a sense of duty, figuring that as an outsider to this community with no real alliances with the open science movement or any of the authors involved, it would be relatively easy for me to be honest.
The final version of the Matters Arising, now published by the journal, summarizes a number of core issues: the lack of justification, given the study design and missing pre-registration, for implying a causal relationship or even discussing an association between rigor-enhancing practices and the replicability rate the authors observe; the inconsistencies between the replicability definition and those in the literature; the over-interpretation of the statistical power estimate, etc. Hard to get beyond this barrage of points.
Since the rain falls, the wind it blows, and the sun shines
What’s funny though is that I somehow still had sort of expected this to be a difficult call. Maybe I was susceptible to the tendency to want to give such esteemed authors, several of whom have done some work I really respect, the benefit of the doubt. I was obviously aware going into the investigation about the lack of preregistration for the main analyses that they claimed to have preregistered. But I tried to have an open enough mind that I wouldn’t miss any possible value that the paper could still have for readers despite that flaw.
Unfortunately, as I re-read the Protzko et al. paper to consider what, if any, one could learn about the role of rigor-enhancing practices to their results, I quickly found myself unable to resolve a fundamental issue related to how they establish that the replication rate they observe is high in the first place. The reference set of effects they mean when they use terms like “original discoveries” is not consistent throughout the paper, including in their calculations of expected power and replicability, which they use to establish their claim of “high replicability.” Sometimes these refer to effects from the pilots and sometimes used to refer to effects from the confirmatory studies. As a result of the way the authors set up their claims, referring to rigor-enhancing practices characterizing the whole process, they would need the rigor-enhancing practices to apply to both the confirmatory studies and the pilot studies.
But the paper text and other materials contradict themselves about how the practices apply across these two sets of studies. For example, I spent some time looking for the pilot preregistrations (which the paper also claims exist), but found only a handful, suggesting that the paper also can’t back up its claims about preregistration there. Given this contradiction between what they say about their design (and the lack of info on the pilots) and the logic they set up to make one of their central points, I didn’t see the paper could redeem itself, even if we decide to be optimistic about the other issues. Retraction was clearly the right decision. You can read some comments related to what I wrote in my review here.
What I still don’t get is how the authors felt okay about the final product. I encourage you to try reading the paper yourself. Figuring out how to pull an open science win out of the evidence they had required someone to put some real effort into massaging the mess of details into a story. It was frustrating as a reader of the paper trying to match the reported values to the set of effects or processes they used. The term bait-and-switch came to mind multiple times as I tried to trace the claims back to the data. Reading an academic paper (especially one advocating for the importance of rigor) shouldn’t remind one of witnessing a con, but the more time I spent with the paper, the more I was left with that impression. It’s worth noting that the lack of sufficient detail about the pilots was brought up at length in Tal Yarkoni’s review of the original submission, as well as Malte Elson’s review for NHB. The authors were made aware of these issues, and made a choice not to be up front about what happened there.
It is true that everyone makes mistakes, and I would bet that most professors or researchers can relate to having been involved in a paper where the story just doesn’t come together the way it needs to, e.g., because you realized things along the way about the limitations of how you set up the problem for saying much about anything. Sometimes these papers do get published, because some subset of the authors convinces themselves the problems aren’t that big. And sometimes even when one sees the problems, it’s hard to back out for peer pressure reasons.
But even then, there’s still a difference between finding oneself in such a situation and crowing all over the place about the paper as if it is a piece of work that delivers some valuable truth. What’s puzzled me from the start is that this paper was not only published, it was widely shared by the authors as a kind of victory lap for open science.
Don’t you know that your creator is running out of ideas
So while I came into this whole experience relatively open-minded about open science, my views have been colored less positively after learning about this paper and seeing certain other open science advocates defend it. I personally stopped seeing the value of most behavioral experiments a few years ago, because I could no longer get beyond the chasm between the inferences we want to draw and the processes we are limited to when we design them. But I guess I interpreted this as more of a personal tic. Preregistration, open data and methods, better power analysis etc. practices might not be enough to make me feel excited about behavioral experiments, but I assumed that the work open science advocates were doing to encourage these practices was doing some good. I hadn’t really considered that open science could be doing harm, beyond maybe encouraging a different set of rigor signalling games.
This experience has changed my view, from ”live and let live if people find it helpful” to “this is not helpful,” given that producing evidence to change policy (or logical justifications presented as sufficient for policy without empirical evidence) appears to be a goal of open science research like this. Preregister if you find it helpful. Make your materials open because you should. But don’t expect these practices to transform your results into solid science, and don’t trust people that try to tell you it’s as easy as adopting a few simple rituals. I’m now doubtful that the flurry of research on fixing the so-called replication crisis is truly interested in engaging deeply with concepts like statistical power or replicability. I’m left wondering how many other empirical pro-open science papers are rhetorical feats to “keep up the momentum” regardless of what can actually be concluded from the data.
P.S. On a lighter note related to the title of this post (or not so light if you remember how the quote ends), remember Rocky and Bullwinkle? My dad used to always try to get us to watch re-runs when they came on TV. The other references in the post (also from my dad’s era) are from a Bert Jansch song.