No, this paper on strip clubs and sex crimes was never gonna get retracted. Also, a reminder of the importance of data quality, and a reflection on why researchers often think it’s just fine to publish papers using bad data under the mistaken belief that these analyses are “conservative” or “attenuated” or something like that.
Statistical Modeling, Causal Inference, and Social Science 2023-10-05
Brandon Del Pozo writes:
Born in Bensonhurst, Brooklyn in the 1970’s, I came to public health research by way of 23 years as a police officer, including 19 years in the NYPD and four as a chief of police in Vermont. Even more tortuously, my doctoral training was in philosophy at the CUNY Graduate Center.
I am writing at the advice of colleagues because I remain extraordinarily vexed by a paper that came out in 2021. It purports to measure the effects of opening strip clubs on sex crimes in NYC at the precinct level, and finds substantial reductions within a week of opening each club. The problem is the paper is implausible from the outset because it uses completely inappropriate data that anyone familiar with the phenomena would find preposterous. My colleagues and I, who were custodians of the data and participants in the processes under study when we were police officers, wrote a very detailed critique of the paper and called for its retraction. Beyond our own assertions, we contacted state agencies who went on the record about the problems with the data as well.
For their part, the authors and editors have been remarkably dismissive of our concerns. They said, principally, that we are making too big a deal out of the measures being imprecise and a little noisy. But we are saying something different: the study has no construct validity because it is impossible to measure the actual phenomena under study using its data.
Here is our critique, which will soon be out in Police Practice and Research. Here is the letter from the journal editors, and here is a link to some coverage in Retraction Watch. I guess my main problem is the extent to which this type of problem was missed or ignored in the peer review process, and why it is being so casually dismissed now. It is a matter of economists circling their wagons?
My reply:
1. Your criticisms seem sensible to me. I also have further concerns with the data (or maybe you pointed these out in your article and I did not notice), in particular the distribution of data in Figure 1 of the original article. Most weeks there seem to be approximately 20 sex crime stops (which they misleadingly label as “sex crimes”), but then there’s one week with nearly 200? This makes me wonder what is going on with these data.
2. I see from the Retraction Watch article that one of the authors responded, “As far as I am concerned, a serious (scientifically sound) confutation of the original thesis has not been given yet.” This raises the interesting question of burden of proof. Before the article is accepted for publication, it is the authors’ job to convincingly justify their claim. After publication, the author is saying that the burden is on the critic (i.e., you). To put it another way: had your comment been in a pre-publication referee report, it should’ve been enough to make the editors reject the paper or at least require more from the authors. But post-publication is another story, at least according to current scientific conventions.
3. From a methodological standpoint, the authors follow the very standard approach of doing an analysis, finding something, then performing a bunch of auxiliary analyses–robustness checks–to rule out alternative explanations. I am skeptical of robustness checks; see also here. In some way, the situation is kind of hopeless, in that, as researchers, we are trained to respond to questions and criticism by trying our hardest to preserve our original conclusions.
4. One thing I’ve noticed in a lot of social science research is a casual attitude toward measurement. See here for the general point, and over the years we’ve discussed lots of examples, such as arm circumference being used as a proxy for upper-body strength (we call that the “fat arms” study) and a series of papers characterizing days 6-14 of the menstrual cycle as the days of peak fertility, even though the days of peak fertility vary a lot from woman to woman with a consensus summary being days 10-17. The short version of the problem here, especially in econometrics, is that there’s a general understanding that if you use bad measurements, it should attenuate (that is, pull toward zero) your estimated effect sizes; hence, if someone points out a measurement problem, a common reaction is to think that it’s no big deal because if the measurements are off, that just led to “conservative” estimates. Eric Loken and I wrote this article once to explain the point, but the message has mostly not been received.
5. Given all the above, I can see how the authors of the original paper would be annoyed. They’re following standard practice, their paper got accepted, and now all of a sudden they’re appearing in Retraction Watch!
6. Separate from all the above, there’s no way that paper was ever going to be retracted. The problem is that journals and scholars treat retraction as a punishment of the authors, not as a correction of the scholarly literature. It’s pretty much impossible to get an involuntary retraction without there being some belief that there has been wrongdoing. See discussion here. In practice, a fatal error in a paper is not enough to force retraction.
7. In summary, no, I don’t think it’s “economists circling their wagons.” I think this is a mix of several factors: a high bar for post-publication review, a general unconcern with measurement validity and reliability, a trust in robustness checks, and the fact that retraction was never a serious option. Given that the authors of the original paper were not going to issue a correction on their own, the best outcome for you was to either publish a response in the original journal (which would’ve been accompanied by a rebuttal from the original authors) or to publish in a different journal, which is what happened. Beyond all this, the discussion quickly gets technical. I’ve done some work on stop-and-frisk data myself and I have decades of experience reading social science papers, but even for me I was getting confused with all the moving parts, and indeed I could well imagine being convinced by someone on the other side that your critiques were irrelevant. The point is that the journal editors are not going to feel comfortable making that judgment, any more than I would be.
Del Pozo responded by clarifying some points:
Regarding the data with outliers in my point 1 above, Del Pozo writes, “My guess is that this was a week when there was an intense search for a wanted pattern rape suspect. Many people were stopped by police above the average of 20 per week, and at least 179 of them were innocent. We discuss this in our reply; non only do these reports not record crimes in nearly all cases, but several reports may reflect police stops of innocent people in the search for one wanted suspect. It is impossible to measure crime with stop reports.”
Regarding the issue of pre-publication and post-publication review in my point 2 above, Del Pozo writes, “We asked the journal to release the anonymized peer reviews to see if anyone had at least taken up this problem during review. We offered to retract all of our own work and issue a written apology if someone had done basic due diligence on the matter of measurement during peer review. They never acknowledged or responded to our request. We also wrote that it is not good science when reviewers miss glaring problems and then other researchers have to upend their own research agenda to spend time correcting the scholarly record in the face of stubborn resistance that seems more about pride than science. None of this will get us a good publication, a grant, or tenure, after all. I promise we were much more tactful and diplomatic than that, but that was the gist. We are police researchers, not the research police.”
To paraphrase Thomas Basbøll, they are not the research police because there is no such thing as the research police.
Regarding my point 3 on the lure of robustness checks and their problems, Del Pozo writes, “The first author of the publication was defensive and dismissive when we were all on a Zoom together. It was nothing personal, but an Italian living in Spain was telling four US police officers, three of whom were in the NYPD, that he, not us, better understood the use and limits of NYPD and NYC administrative data and the process of gaining the approvals to open a strip club. The robustness checks all still used opening dates based on registration dates, which do not associate with actual opening in even a remotely plausible way to allow for a study of effects within a week of registration. Any analysis with integrity would have to exclude all of the data for the independent variable.”
Regarding my point 4 on researchers’ seemingly-strong statistical justifications for going with bad measurements, Del Pozo writes, “Yes, the authors literally said that their measurement errors at T=0 weren’t a problem because the possibility of attenuation made it more likely that their rejection of the null was actually based on a conservative estimate. But this is the point: the data cannot possibly measure what they need it to, in seeking to reject the null. It measures changes in encounters with innocent people after someone has let New York State know that they plan to open a business in a few months, and purports to say that this shows sex crimes go down the week after a person opens a sex club. I would feel fraudulent if I knew this about my research and allowed people to cite it as knowledge.”
Regarding my point 6 that just about nothing ever gets involuntarily retracted without a finding of research misconduct, Del Pozo points to an “exception that proves the rule: a retraction for the inadvertent pooling of heterogeneous results in a meta analysis that was missed during peer review, and nothing more.”
Regarding my conclusions in point 7 above, Del Pozo writes, “I was thinking of submitting a formal replication to the journal that began with examining the model, determining there were fatal measurement errors, then excluding all inappropriate data, i.e., all the data for the independent variable and 96% of the data for the dependent variable, thereby yielding no results, and preventing rejection of the null. Voila, a replication. I would be so curious to see a reviewer in the position of having to defend the inclusion of inappropriate data in a replication. The problem of course is replications are normatively structured to assume the measurements are sound, and if anything you keep them all and introduce a previously omitted variable or something. I would be transgressing norms with such a replication. I presume it would be desk rejected.”
Yup, I think such a replication would be rejected for two reasons. First, journals want to publish new stuff, not replications. Second, they’d see it as a criticism of a paper they’d published, and journals usually don’t like that either.