Stop Cheating—Again

Gödel’s Lost Letter and P=NP 2024-03-26

More than just stopping carelessness

Sholto David is featured in this past Tuesday’s New York Times Science section for his work on cheating in papers on medical research.

The article is titled “Catching Too Many Errors by Sloppy Scientists” but goes beyond that. It follows a similar exposé in the Guardian the previous week. The Times author, Matt Richtel, has written a flurry of columns on recent medical and social science findings, while the Guardian’s Ian Sample is their science editor.

David, who has a PhD in molecular biology from Newcastle University, has been fascinated by errors in science for many years. What caught my attention was that this is a kind of cheating. We have discussed Ken’s long-term research on cheating in chess. I found that David’s work was like Ken’s—except that chess is changed to writing science papers. Since these papers are often on medical results, it is possible that now the cheating could have safety issues. Very interesting.

Twixt Carelessness and Cheating

The two newspaper articles seem to use kid gloves compared to the original work by David that they reference. The Guardian one begins in terms of “flaws in scientific papers” being “rectified.” It then recounts that Harvard’s Dana Farber Cancer Institute is “seeking to retract six papers and correct 31 more.” In our views, the eyes should stop at “retract six papers.”

The Times article has the same six-and-31 sentence after linking an earlier Times story under “flawed and manipulated data,” and later speaks of his finding “mistakes, or malfeasance.” The interview portion of the article gets to the crux:

More generally, do the mistakes you find seem innocent to you?

I think we can all understand that sometimes an image may have been copied and pasted by mistake. But there are more complicated examples where images are being rotated, transformed or stretched. Those kinds of examples are less savory. There are other examples […where…] the image has been assembled in a way that you couldn’t reconstruct a sensible experiment from. Sometimes there are cases where people are using Photoshop to more extensively edit images.

David’s original Dana Farber article has a lot of examples. He doesn’t mince words, saying at the outset, “Far worse skeletons than plagiarism lurk in the archives…” It’s snarky but on-point and, frankly, scary. Here is one example, showing large identical regions in scans from two different patients out of six in a study:

This reminds Ken of space photos with multiple images of the same galaxy owing to relativistic gravitational lensing—but here the duplication is much more precise. Elisabeth Bik, a Dutch microbiologist and scientific integrity consultant, has identified thousands of such cases. She co-won the 2021 John Maddox prize for her work.

Telling When Images Are The Same

David used a simple criterion: If two images from different papers are the same and there is no reference, or of two different tests in the same paper, then this is cheating.

The trouble of course is: how hard is it to tell if the images are essentially the same? There are a ton of papers and thus a lot of images. What David did was clever—he found a cheap way to tell quickly if images are the same on a mass scale. This is a kind of hashing result that I found cool.

For how he does this, see here. Roughly, he sent the image to a list of distances between objects in the images. These distances formed a type of hash that he used to encode the image. These hashes were much easier to keep track of then the whole image and so this made testing to see if two were the same. This was a cheap test and clearly made it easy for him to discover if a paper cheated.

See this for more examples of hashing of images.

The real question is, what could cause people to do such copy-and-pasting? When is it not inadvertence? Ken is no more a biologist than I, but he can contribute a temptation he faced that might lend some insight. Over to Ken now.

Oh, Scrubbitt!!—?

In the second half of 2019, I completed an update to my chess model that took four agonizing years. I then turned to measuring in greater detail the quality of my model’s predictions of chess moves, in particular what I had gained from the update. This post from late 2019 explored error models and prediction metrics. Per an update in January 2020, the best model for the true probability ${p}$ of a move in terms of my model’s probability ${q}$ and a fittable error term ${\epsilon}$ turned out to be

$\displaystyle \log(p) = \log(q)(1 \pm \epsilon).$

This makes projections of low-probability moves—such as blunders—have smaller absolute error but higher relative error than probable moves. Fitting ${\epsilon}$ against the Elo chess ratings of thousands of players would tell how sharp the projections are and also how sharpness tails off when estimating the world’s best players. For reasons described earlier here, I’ve felt compelled to build my model separately for every chess program (and major “engine” version) employed to test players. My update used the then-current versions 11 of the open-source Stockfish engine and 13.3 of the commercial Komodo engine, plus the earlier versions 7 of Stockfish and 10 of Komodo.

Here are the results I obtained for ${\epsilon}$ against rating, graphed using Andrew Que’s online regression applet:

Cue the ditty from the TV show Sesame Street:

“One of these things is not like the others. One of these things just doesn’t belong.”

The first three diagrams gave a highly consistent picture: my model is uniformly sharp except as ratings go above the 2600 “super-grandmaster” level. Stockfish 11, however, went off the rails—as if to say its rendition of my model became infinitely accurate by that level.

Mixing with another metric in a 90:10 split at least removed the infinities, though it still left the curve turning in the wrong direction:

The pandemic hit—driving me into the world of online chess with a 100–200x higher cheating rate—before I resolved this issue. Because the computer code for using all four engines was the same, and because this plot told me the issue with Stockfish 11 should be fixable, I filled in conservative coefficients along lines of the other programs. I had an all-important fail-safe: verifying in myriad randomized resampling trials that the test’s outputs over (presumably non-cheating) players conforms to the normal distribution. The new test using Stockfish 11 passed this check much the same as the tests developed with the other three chess programs.

Later in 2020 I adjusted a key hyperparameter to be even more conservative than what I’d used in the first diagram here, and the expected kind of curve emerged in numerically stable fashion at last:

Thenceforth I calibrated the test according to Hoyle. It all ultimately mattered little, because the new cheating test—using whichever engine—turns out to be no more or less sensitive than my regular one. I still report its results as “provisional.”

My point is, I temporarily behaved as though I had mentally copied and pasted in one of the other graphs. It was “OK” because I had other guardrails. I am still largely a one-man band. It would enlighten to learn why this sort of thing happens in large labs with teams of helpers.

Open Problems

What are the best image hashing methods? Should journals test all images in submitted papers to see if they have ever been used before?