Workflow and the role of hypothesis-free data analysis
Statistical Modeling, Causal Inference, and Social Science 2021-07-19
In our discussion a couple days ago on the role of hypotheses in science, Lakeland wrote:
Even “this data is relevant to the question we’re studying” is already a hypothesis. There’s no such thing as hypothesis free data analysis.
I’ve sometimes said similar things, in that I like to interpret exploratory graphics as model checks, where the model being checked might be implicit; see for example this recent paper with Jessica Hullman.
But, thinking about this more, I wouldn’t quite go so far as Lakeland. I’m thinking there’s a connection between his point and the idea of workflow, or performing multiple analyses on data. For example: I just went on Baby Name Voyager and started typing in names. This was as close to hypothesis-free data analysis as you can get. But after I saw a few patterns, I started to form hypotheses. For example, I typed in Stephanie and saw how the name frequency has dropped so fast during the past twenty years. Then I had a hypothesis: could it be alternative spellings? So I tried Stefany etc. Then I got to wondering about Stephen. That wasn’t a hypothesis, exactly, more of a direction to look. I had a meta-hypothesis that I might learn something by looking at the time trend for Stephen. I saw a big drop since the 1950s. Also for Steven (recall that earlier hypothesis about alternative spellings). And so on.
My point is that a single static data analysis (for example, looking up Stephanie in the Baby Name Voyager) can be motivated by curiosity or a meta-hypothesis that I might learn something interesting, but as I start going through workflow, hypothesizing is inevitably involved.
I’m thinking now that this is a big deal, connecting some of our statistical thoughts about modeling and model checking and hypotheses with scientific practice and the philosophy of science. Statistical theory and textbooks and computation tend to focus on one model at a time, or one statistical procedure at a time; in the workflow perspective we recognize that we are performing a series of statistical analyses.
It’s hard for me to imagine doing a series of analyses without forming some hypotheses and without thinking of how to refine these hypotheses or adjudicate among alternative theories of the world. One quick data analysis, though, that’s different. I sincerely think I looked at that Stephanie graph out of pure curiosity. As noted above, deciding to look at some data out of curiosity could be said to reflect a meta-hypothesis that something interesting may turn up, but I would not classify that as much of a hypothesis at all. After looking at the graph, though, the decision of what to look at next is definitely hypothesis-informed.
Similarly, I can conduct a survey and ask a bunch of questions without having any hypothesis of how people respond; I can just think it’s a good idea to gather these data. But I think it would be hard to conduct a follow-up survey without making some hypotheses. (Again, I’m speaking here of scientific or engineering hypotheses, not “hypotheses” in the sense of that horrible statistical theory of “hypothesis testing.”)
So . . . hypothesizing plays a crucial role in statistical workflow, even though I don’t think a hypothesis is necessary to get started.