“The Null Hypothesis Screening Fallacy”?

Statistical Modeling, Causal Inference, and Social Science 2017-07-03

Rick Gerkin writes:

A few months ago you posted your list of blog posts in draft stage and I noticed that “Humans Can Discriminate More than 1 Trillion Olfactory Stimuli. Not.” was still on that list. It was about some concerns I had about a paper in Science (http://science.sciencemag.org/content/343/6177/1370). After talking it through with them, the authors of that paper eventually added a correction to the article. I think the issues with that paper are a bit deeper (as I published elsewhere: https://elifesciences.org/content/4/e08127) but still it takes courage to acknowledge the merit of the concerns and write a correction.

Meanwhile, two of the principal investigators from that paper produced a new, exciting data set which was used for a Kaggle-like competition. I won that competition and became a co-first author on a *new* paper in Science (http://science.sciencemag.org/content/355/6327/820).

And this is great! I totally respect them as scientists and think their research is really cool. They made an important mistake in their paper and since the research question was something I care a lot about I had to call attention to it. But I always looked forward to moving on from that and working on the other paper with them, and it all worked out.

That is such a great attitude.

Gerkin continues:

Yet another lesson that most scientific disputes are pretty minor, and working together with the people you disagreed with can produce huge returns. The second paper would have been less interesting and important if we hadn’t been working on it together.

What a wonderful story!

Here’s the background. I received the following email from Gerkin a bit over a year ago:

About 3 months ago there was a paper in Science entitled “Humans Can Discriminate More than 1 Trillion Olfactory Stimuli” (http://www.sciencemag.org/content/343/6177/1370). You may have heard about it through normal science channels, or NPR, or the news. The press release was everywhere. It was a big deal because the conclusion that humans can discriminate a trillion odors was unexpected, previous estimates having been in the ~10000 range. Our central concern is the analysis of the data.

The short version: They use a hypothesis testing framework — not to reject a null hypothesis with type 1 error rate alpha — but to essentially convert raw data (fraction of subjects discriminating correctly) into a more favorable form (fraction of subjects discriminating significantly above chance), which is subsequently used to estimate an intermediate hypothetical variable, which, when plugged into another equation produces the final point estimate of “number of odors humans can discriminate”. However, small changes in the choice of alpha during this data conversion step (or equivalently small changes in the number of subjects, the number of trials, etc), by virtue of their highly non-linear impact on that point estimate, undermine any confidence in that estimate. I’m pretty sure this is a misuse of hypothesis testing. Does this have a name? Gelman’s fallacy?

I replied:

People do use hyp testing as a screen. When this is done, it should be evaluated as such. The p-values themselves are not so important, you just have to consider the screening as a data-based rule and evaluate its statistical properties. Personally, I do not like hyp-test-based screening rules: I think it makes more sense to consider screening as a goal and go from there. As you note, the p-value is a highly nonlinear transformation of the data, with the sharp nonlinearity occurring at a somewhat arbitrary place in the scale. So, in general, I think it can lead to inferences that throw away information. I did not go to the trouble of following your link and reading the original paper, but my usual view is that it would be better to just analyze the raw data (taking the proportions for each person as continuous data and going from there, or maybe fitting a logistic regression or some similar model to the individual responses).

Gerkin continued:

The long version: 1) Olfactory stimuli (basically vials of molecular mixtures) differed from each other according to the number of molecules they each had in common (e.g. 7 in common out of 10 total, i.e. 3 differences). All pairs of mixtures for which the stimuli in the pair had D differences were assigned to stimulus group D. 2) For each stimulus pair in a group D, the authors computed the fraction of subjects who could successfully discriminate that pair using smell. 3) For each group D, they then computed the fraction of pairs in D for which that fraction of subjects was “significantly above chance”. By design, chance success had p=1/3, so a pair was “significantly above chance” if the fraction of subjects discriminating it correctly exceeded that given by the binomial inverse CDF with x=(1-alpha/2), p=1/3, N=# of subjects. The choice of alpha (an analysis choice) and N (an experimental design choice) clearly drive the results so far. Let’s denote by F that fraction of pairs exceeding the threshold determined by the inverse CDF. 4) They did a linear regression of F vs D. They defined something called a “limen” (basically a fancy term for a discrimination threshold) and set it equal to the solution to 0.5 = beta_0 + beta_1*X, where the betas are the regression coefficients. 5) They then plugged X into yet another equation with more parameters, and the result was their estimate of the number of discriminable olfactory stimuli.

My reply: I’ve seen a lot of this sort of thing, over the years. My impression is that people are often doing these convoluted steps, not so much out of a desire to cheat but rather because they have not ever stepped back and tried to consider their larger goals. Or perhaps they don’t have the training to set up a model from scratch.

Here’s Gerkin again:

I think it was one of those cases where an experimentalist talked to a mathematician, and the mathematician had some experience with a vaguely similar problem and suggested a corresponding framework that unfortunately didn’t really apply to the current problem. The kinds of stress tests one would apply to resulting model to make sure it makes sense of the data never got applied.

And then he continued with his main thread:

If you followed this, you’ve already concluded that their method is unsound even before we get to step 4 and 5 (which I believe are unsound for unrelated reasons). I also generated figures showing that reasonable alternative choices of all of these variables yield estimates of the number of olfactory stimuli ranging from 10^3 to 10^80. I have Python code implementing this reanalysis and figures available at http://github.com/rgerkin/trillion. But what I am wondering most is, is there a name for what is wrong with that screening procedure? Is there some adage that can be rolled out, or work cited, to illustrate this to the author?

To which I replied:

I don’t have any name for this one, but perhaps one way to frame your point is that the term “discriminate” in this context is not precisely determined. Ultimately the question of whether two odors can be “discriminated” should have some testable definition: that is, not just a data-based procedure that produces an estimate, but some definition of what “discrimination” really means. My guess is that your response is strong enough, but it does seem that if someone estimates “X” as 10^9 or whatever, it would be good to have a definition of what X is.

Gerkin concludes with a plea:

The one thing I would really, really like is for the fallacy I described to have a name—even better if it could be listed on your lexicon page. Maybe “The Null Hypothesis Screening Fallacy” or something. Then I could just refer to that link instead of to some 10,000 words explanation of it, everytime this comes up in biology (which is all the time).

P.S. Here’s my earlier post on smell statistics.

The post “The Null Hypothesis Screening Fallacy”? appeared first on Statistical Modeling, Causal Inference, and Social Science.