The 2019 project: How false beliefs in statistical differences still live in social science and journalism today
Statistical Modeling, Causal Inference, and Social Science 2021-04-05
It’s the usual story. PNAS, New York Times, researcher degrees of freedom, story time. Weakliem reports:
[The NYT article] said that a 2016 survey found that “when asked to imagine how much pain white or black patients experienced in hypothetical situations, the medical students and residents insisted that black people felt less pain.” I [Weakliem] was curious about how big the differences were, so I read the paper.
Clicking through to read the research article, which was published in PNAS, I did not find any claim that the medical students and residents insisted that black people felt less pain.
I did see this sentence from the PNAS article: “Specifically, we test whether people—including people with some medical training—believe that black people feel less pain than do white people.” But, as Weakliem finds out, it turns out there were no average differences:
Medical students who had a high number of false beliefs rated the white cases as experiencing more pain; medical students who had a low number of false beliefs rated the black cases as experiencing more. High and low were defined relative to the mean, so that implied that medical students with average numbers of false beliefs rated the black and white cases about the same.
The authors included their data as a supplement to the article, so I [Weakliem] downloaded it and calculated the means. The average rating for the black cases was 7.622, on a scale of 1-10, while the average rating for the white cases was 7.626—that is, almost identical. The study also asked how the different cases should be treated—135 gave the same recommendation for both of their cases, 40 recommended stronger medication for their white case, and 28 for their black case. Since the total distribution of conditions was the same for the black and white cases, this means that in this sample, treatment recommendations were different for black and whites. However, the difference was not statistically significant at conventional levels (p is about .14)—that is, the sample difference could easily have come up by chance.
So you could conclude that, in this sample, there is no evidence that medical students rate the pain of blacks and whites differently, but perhaps some evidence that they treat white pain more aggressively. (If you just went by statistical significance, you would accept the hypothesis that they treat hypothetical black and white cases the same, but a more sensible conclusion would that you should collect more data). The paper, however, didn’t do this. . . .
Hey, this is a big fat researcher degree of freedom! The authors of this paper easily could’ve summarized their results as, “White people no longer believe black people feel less pain than do white people.” That could’ve been the title of the PNAS article. And then the New York Times article could’ve been, “Remember that finding that white people believe that black people feel less pain? It’s no longer the case.”
OK, I guess not, as PNAS would never have published such a paper. The interaction between beliefs of physical differences, beliefs about pain, and attitudes toward pain treatment—that’s what made the paper publishable. Unfortunately, the patterns that were found could be explainable by noise—but, no problem, there were enough statistical knobs to be turned that the researchers could find statistical significance and declare a win. At that point, maybe they felt that going back and reporting, “No average difference,” would ruin their story.
Weakliem summarizes:
The statement that “the medical students and residents insisted that black people felt less pain” is false: they rated black and white pain as virtually equal. I [Weakliem] don’t blame Villarosa [the author of the NYT article] for that—the way it was written, I could see how someone would interpret the results that way. I don’t really blame the authors either—interaction effects can be confusing. I would blame the journal (PNAS) for (1) not asking the authors to show means for the black and white examples as standard procedure and (2) not getting reviewers who understand interaction effects.
I don’t know if I agree with Weakliem in letting the authors of these articles off the hook. The NYT article did misrepresent the claims in the PNAS article; the PNAS article did come in to test a hypothesis and then never report the result of that test; so both these articles failed their readers, at least regarding this particular claim. Indeed, the title of the NYT article is, “Myths about physical racial differences were used to justify slavery — and are still believed by doctors today”—a message that is completely changed by reporting that the PNAS study found no average belief in pain differences.
As for PNAS, I think it’s too much to expect they can find reviewers who understand interaction effects—that’s really complicated—and I guess it’s too much to expect that they would turn down an article that fits their political preconceptions. But, jeez, can’t they at least be concerned about data quality? Study 1 was based on 121 participants on Mechanical Turk. Study 2 was based on 418 medical students at a single university. I can see the rationale for Study 2—medical students grow up and become doctors, so we should be concerned about their views regarding medical treatment. But I can’t see how it can be considered scientifically legitimate to take data from 121 Mechanical Turk participants and report them in the abstract of the paper as telling us something about “a substantial number of white laypeople.” You don’t need to understand interaction effects to see the problem here; you just need to stop drinking the causal-identification Kool-Aid (the attitude by which any statistically significant difference is considered to represent some true population effect, as long as it is associated with a randomized treatment assignment, instrumental variable analysis, or regression discontinuity).