Reverse-engineering priors in coronavirus discourse

Normal Deviate 2020-04-28

Last week we discussed the Santa Clara county study, in which 1.5% of the people tested positive for coronavirus.

The authors of the study performed some statistical adjustments and summarized with a range of 2.5% to 4.2% for infection rates in the county as a whole, leading to an estimated infection fatality rate of 0.12% to 0.20%, a strong conclusion because it might be taken to imply that coronavirus is not much more deadly than the flu. As discussed on this blog and elsewhere, there were some problems with the statistical analysis, and these conclusions were not supported by these data alone.

Here’s what I said to reporter Michael Schulson:

It’s not like I’m saying they’re wrong, and someone else is saying they’re right. . . . I think their conclusions were a bit strong . . . they have reasons for believing what they believe beyond what’s in this paper. Like they have their own scientific understanding of the world, and so basically they’re coming into it saying, ‘Hey, we believe this. We think that this disease, this virus has a very low infection fatality rate,’ and then they gather data and the data are consistent with that. If you have reason to believe that story, these data support it. If you don’t have such a good reason to believe that story, you can see that the data are kind of ambiguous.

A commenter asked how I could have said such a thing, as it seemed in contradiction to my earlier statement that the authors of that paper had “screwed up” on the statistics.

But it’s not a contradiction at all.

Let me explain. The data presented in that Santa Clara study were consistent with underlying infection rates in the target population of anywhere between 0% and about 5%. The 0% would correspond to a 1.5% false positive rate of the tests (which is consistent with the data presented in that paper), and the 5% would correspond to a 0% false positive rate plus some random luck plus some adjustments. I can’t be sure of the 5% because I don’t have all the detail on the adjustments; also the study has other data such as reported symptoms that could be informative here, but those data have not been released, even in summary form, so I can’t really do anything with that.

A range of 0% to 5% . . . OK, we know it’s not 0%, as some people in the county had already had the disease. So I stand by my statement that the study did not offer strong support that the rate was between 2.5% and 4.2%; on the other hand, the data in the study were consistent with those claims.

As I wrote in the comment thread:

It seems clear to me that the authors of that study had reasons for believing their claims, even before the data came in. They viewed their study as confirmation of their existing beliefs. They had good reasons, from their perspective. Their reasons are based on their larger understanding of what’s happening with the coronavirus. They have priors, and what I’m saying is that the data from their recent surveys is consistent with their priors. I think that’s why they came on so strong.

But, as I said, if you don’t have such a good reason to believe that story, you can see that the data are kind of ambiguous. I don’t know enough about the epidemic to have strong priors. So, to me, these surveys are consistent with the authors’s priors, but they’re also consistent with other priors.

It’s a Bayesian thing. Part of Bayesian reasoning is to think like a Bayesian; another part is to assess other people’s conclusions as if they are Bayesians and use this to deduce their priors. I’m not saying that other researchers are Bayesian—indeed I’m not always so Bayesian myself—rather, I’m arguing that looking at inferences from this implicit Bayesian perspective can be helpful, in the same way that economists can look at people’s decisions and deduce their implicit utilities. It’s a Neumann thing: again, you won’t learn people’s “true priors” any more than you’ll learn their “true utilities”—or, for that matter, any more than a test will reveal students’ “true abilities”—but it’s a baseline.

One problem here is that people are expecting too much from one study. Ultimately, we will learn by replication.

There was a replication of sorts in Los Angeles county, where a team including some of the same researchers reported an estimate of 2.8% to 5.6% with antibodies to the virus. There was also a study in Miami reporting 6%.

On the other hand, if there are a lot of false positives, then we’d expect these to be overestimates. We can’t really know right now. For the Los Angeles and Miami studies, all we have are press releases. This doesn’t mean the results are wrong; it’s just hard to know.

There was a study reporting 20% with antibodies in New York City. Nobody thinks that is all or even mostly false positives; obviously a lot of people in the city have been exposed to the virus. There are sampling issues so maybe underlying rate was really only 10% or 15% at the time the study was done . . . we can’t really know!

We need better data, followups, open science, coordination, etc. I’m glad that people are gathering data and doing studies, and I’m glad that people are pointing out mistakes in studies, which can allow us all to do better.

The discussion can be frustrating because it becomes politicized and people take sides; see discussion here, for example. I regret saying that the authors of the Santa Clara study “owe us all an apology.” I stand by my reactions to that paper, but I don’t think that politicization is good. In an ideal world, I would not have made that inflammatory statement and the authors would’ve updated their paper to include more information and recognize the ambiguity of their results.