What do I think of this Bayesian analysis of the origins of covid?

Statistical Modeling, Causal Inference, and Social Science 2025-02-04

A colleague links to this recent article, “A Bayesian Assessment of the Origins of COVID-19 using Spatiotemporal and Zoonotic Data,” written by economist Andrew Levin, and asks, “What do you think of this analysis? Is it sound? Does it mean a sensible person should be very convinced it was a lab leak?”

Here’s the abstract of the paper in question:

This paper uses Bayesian methods in conjunction with spatiotemporal and zoonotic data to evaluate the odds ratio for two hypotheses regarding the origin of the COVID-19 pandemic, namely, an accidental laboratory leak of a chimera virus or the transmission of a natural virus from an infected wildlife mammal. The overall Bayes factor is decomposed into 4 components: (1) the odds that the outbreak would occur in the People’s Republic of China (PRC); (2) the odds that the outbreak would occur in Wuhan, conditional on its location in PRC; (3) the odds of observing the spatiotemporal pattern of confirmed COVID-19 cases with no known link to the specific wholesale market where wildlife mammals were being sold, conditional on the outbreak taking place in Wuhan; and (4) the odds of observing the spatiotemporal pattern of confirmed vendor cases at that market, conditional on the outbreak taking place in Wuhan. These four conditional Bayes factors are estimated as 2.3:1, 20:1, 27:1, and 12:1, respectively, and hence the overall odds ratio is 14,900:1, indicating overwhelming evidence in favor of the hypothesis that the pandemic resulted from an accidental lab leak. This conclusion is robust to alternative specifications of the detailed statistical analysis.

My answer to my colleague’s question posed above is, I don’t know. I corresponded with the author of that paper as well as the author of the paper linked here. It’s hard to compute Bayesian probabilities for this problem, not so much because of the priors but because of the likelihood, which is the probability of the data given the model. One problem is the selection of what is considered to be data, the other problem is that the model (“lab leak” or “wet-market leak” or whatever) is not clearly specified–in statistics jargon, these are “composite hypotheses.” This is not a criticism of this particular paper per se; it’s just a general difficulty with this sort of analysis. It’s not clear to me that Bayesian inference is the right way to attack this sort of problem. But I’ve been intimidated by the technical biological details in all these analyses so I haven’t looked at them personally.