The easygoing relationship between computer scientists and null hypothesis significance testing

Statistical Modeling, Causal Inference, and Social Science 2022-12-13

This is Jessica. As you might expect, as a professor in a computer science department I spend a lot of time around computer scientists. As someone who is probably more outward looking than the average faculty member, there are things I like about CS, like the emphasis on getting the abstractions right and on creating new things rather than just studying the old, but also some dimensions where I know the things I like to think about or find interesting are “out of distribution.” One thing that continues to surprise me is the unquestioning way in which many faculty and students assume significance testing is the standard for scientific data analysis.

Some examples, by area:

ML, systems: We rely too heavily on comparing point estimates to assess performance across different [models/methods/systems]. Let’s fix this with significance testing!

Privacy: Let’s noise up this data, but better make sure they can still do t-tests!

Big data/databases: Let’s do zillions of t-tests simultaneously!

Theory: Let’s design a mechanism to allow for optimal data-driven science, by which we mean NHST!

Visualization: Let’s turn graphs into devices for NHST!

HCI: Let’s make a GUI so people can do NHST without any prior exposure to statistics!

On some level, this is not that surprising. CS majors often take a probability class, but when it comes to stats for data analysis, many don’t go beyond a basic intro stats course. And early non-major stats courses often devote a lot of time to statistical testing. Estimation, exploratory analysis, and anything else that might precede NHST are treated as mostly instrumental. So classical stats becomes synomous with NHST for many. Of course in CS, prediction gets a lot of attention but it’s sort of its own beast, treated like an engineering tool that powers everything everywhere.

I expect the average computer scientist sees little reason to care, for example, about what happened when a bunch of psychologists doing small N studies overrelied on NHST. There’s a fallback attitude that issues caused by humans will never be very relevant objects of study because the primary artifacts are code, and that that kind of squishy social science stuff doesn’t belong in CS (though as the joke beloved to people who do deal with the human-computer interface goes, “The three hardest problems in computer science are people and off by one errors.”)

And so, I seem to find myself somewhat regularly in a position where I am more or less performing some angst over the problems with NHST in an effort to get students or faculty colleagues to reconsider their unquestioning assumption that significance testing is how scientists analyze data. I can think of many situations where I’ve tried to explain why NHST, as practiced and philosophically, is not so rational. I bring up Andrew-isms like it doesn’t make sense to treat effects as present or absent because there’s always some effect, the question is how big, or what to do about the fact that the difference between significant and not significant is often not significant, etc. Sometimes I can tell I capture their attention for a moment, but rarely do I feel like I’ve really convinced someone there’s a problem that might affect their research. For instance, I get responses that start with phrases like ‘If this is true …’ and I’m pretty sure it isn’t just me getting blown off for being female, because I’ve seen similar reactions when like-minded colleagues point out the issues.

Repeatedly encountering all this resistance can almost make one feel a little bit guilty, like here your colleagues are obviously having a fulfilling relationship with their chosen interpretation of statistics and yet you’re insisting for some reason on dredging up weird anomalies with seemingly weak links to what they do, like some sort of witch determined to sow doubts in the healthy partnership between computer science and stats. But of course I don’t actually feel guilty because I think they need to hear it, even if I derail a few conversations.

I guess one question is how a computer scientist’s orientation to NHST is qualitatively different than that of someone in another field that uses stats. For example, how does a psychology researcher’s perspective on NHST differ from that of a computer science researcher? I would expect that computer scientists are probably worse at than psychologists is anticipating misuse, again because understanding human behavior has never been perceived as being critical to doing great CS research. I think there’s a genuine belief that NHST is the answer based on believing that if it’s used properly (which can’t be that hard right? just don’t fake the data and make sure there’s enough of it), it provides the most direct answer to the questions people care about: is this thing real. On the surface, it can seems like a concise solution to a large class of problems, which doesn’t deserve to be conflated with the flaws of some humans who used it for very different seeming purposes.

I also think there’s a genuine confusion about what the alternative would be if one doesn’t use NHST. Sometimes researchers make it explicit that they can’t imagine alternatives (e.g., here), in which case at least the value that someone like me can provide is clearer (giving them examples of alternative ways of expressing findings from an analysis). But, for that to work, I first have to convince them there’s a problem. Maybe the resistance is also partly a function of discrete thinking being built into CS. Advocating against NHST to some computer scientists can certainly feel like trying to convince them that we should replace binary.

On a more positive note, when I realized that much of the stat/science reform discussion hasn’t reached many computer scientists I started including some background in a CS research in a class I teach to first year PhDs. I’ve taught it a few times and they seem interested when I present some of the core issues and draw connections to CS research (like we do here). I’m also teaching a graduate seminar courst next quarter on explanation and reproducibility in data-driven science where we’ll discuss papers from stats, social science, and ML related to what it means for an explanation of model behavior to be valid and reproducible. Maybe all this will help me figure out how to better target my anti-NHST spiel to CS assumptions.