More on the role of hypotheses in science

Statistical Modeling, Causal Inference, and Social Science 2021-07-17

Just to be clear before going on: when I say “hypotheses,” I’m talking about scientific hypotheses, which can at times be very specific (as in physics, with Maxwell’s equations, relativity theory) but typically have some looseness to them (a biological model of how a particular drug works, a political science model of changes in public opinion, etc.). I’m not talking about the “null hypothesis” discussed in classical statistics textbooks.

Last year we posted a discussion of the article, “A hypothesis is a liability,” by Itai Yanai and Martin Lercher. Yanai and Lercher had written:

There is a hidden cost to having a hypothesis. It arises from the relationship between night science and day science, the two very distinct modes of activity in which scientific ideas are generated and tested, respectively. . . .

My reaction was that I understand that a lot of scientists think of science as being like this, an alternation between inspiration and criticism, exploratory data analysis and confirmatory data analysis, creative “night science” and rigorous “day science.” Indeed, in Bayesian Data Analysis we talk about the separate steps of model building, model fitting, and model checking.

But . . . I didn’t think we should enthrone this separation.

Yanai and Lercher contrast “the expressed goal of testing a specific hypothesis” with the mindset of “exploration, where we look at the data from as many angles as possible.” They continue:

In this mode, we take on a sort of playfulness with the data, comparing everything to everything else. We become explorers, building a map of the data as we start out in one direction, switching directions at crossroads and stumbling into unanticipated regions. Essentially, night science is an attitude that encourages us to explore and speculate. . . .

What’s missing here is a respect for the ways in which hypotheses, models, and theories can help us be more effective explorers.

My point here is not to slam Yanai and Lercher; as noted, my colleagues and I have expressed similar views in our books. It’s just that, the more I think about it, the more I am moving away from a linear or even a cyclical view of scientific or statistical practice. Rather than say, “First night science, then day science,” or even “Alternate night, day, night, day, etc. to refine our science,” I’d prefer to integrate the day and night approaches, with the key link being experimentation.

But my perspective is just one way of looking at things. Another angle comes from Teppo Felin, who writes:

A small, interdisciplinary group of us wrote a response to Yanai & Lercher’s Genome Biology piece. It looks like their original piece has become a bit of a social media hit, with 50k+ downloads and lots of attention (according to Altmetric).

You can find our response [by Teppo Felin, Jan Koenderink, Joachim Krueger, Denis Noble, and George Ellis] here, the data-hypothesis relationship. We’re extremely surprised at their use (and interpretation) of the gorilla example, as well as the argument more generally. Yanai & Lercher in turn wrote a response to that, here. And then we in turn wrote another response, titled “data bias.”

We’re definitely fighting an uphill battle with our argument. Data is “hot” these days and theory passé. And the audience of Genome Biology is largely computer scientists and geneticists. They absolutely loved the “hidden gorilla” setup of the original Yanai-Lercher article last year.

I’m just shocked that this type of gimmicky, magic-like experimental approach continues to somehow be seen as valid and insightful. It’s just a form of scientific entrapment, to prove human folly and bias. Now we’ve had gorillas hidden in CT scans, in health data, and even in pictures of space. The supposed arguments and conclusions drawn from these studies are just plain wrong.

Here’s what Felin et al. have to say:

Data is critical to science. But data itself is passive and inert. Data is not meaningful until it encounters an active, problem-solving observer. And in science, data gains relevance and becomes data in response to human questions, hypotheses, and theories. . . . Y&L’s arguments suffer from a common bias where data is somehow seen as independent of hypothesis and theory. . . .