“Quality control” (rather than “hypothesis testing” or “inference” or “discovery”) as a better metaphor for the statistical processes of science

Statistical Modeling, Causal Inference, and Social Science 2017-10-29

I’ve been thinking for awhile that the default ways in which statisticians think about science—and which scientists think about statistics—are seriously flawed, sometimes even crippling scientific inquiry in some subfields, in the way that bad philosophy can do.

Here’s what I think are some of the default modes of thought:

Hypothesis testing, in which the purpose of data collection and analysis is to rule out a null hypothesis (typically, zero effect and zero systematic error) that nobody believes in the first place;

Inference, which can work in the context of some well-defined problems (for example, studying trends in public opinion or estimating parameters within an agreed-upon model in pharmacology), but which doesn’t capture the idea of learning from the unexpected;

Discovery, which sounds great but which runs aground when thinking about science as a routine process: can every subfield of science really be having thousands of “discoveries” a year? Even to ask this question seems to cheapen the idea of discovery.

A more appropriate framework, I think, is quality control, an old idea in statistics (dating at least to the 1920s; maybe Steve Stigler can trace the idea back further), but a framework that, for whatever reason, doesn’t appear much in academic statistical writing or in textbooks outside of the subfield of industrial statistics or quality engineering. (For example, I don’t know that quality control has come up even once in my own articles and books on statistical methods and applications.)

Why does quality control have such a small place at the statistical table? That’s a topic for another day. Right now I want to draw the connections between quality control and scientific inquiry.

Consider some thread or sub-subfield of science, for example the incumbency advantage (to take a political science example) or embodied cognition (to take a much-discussed example from psychology). Different research groups will publish papers in an area, and each paper is presented as some mix of hypothesis testing, inference, and discovery, with the mix among the three having to do with some combination of researchers’ tastes, journal publication policies, and conventions within the field.

The “replication crisis” (which has been severe with embodied cognition, not so much with incumbency advantage, in part because to replicate an election study you have to wait a few years until sufficient new data have accumulated) can be summarized as:

– Hypotheses that seemed soundly rejected in published papers cannot be rejected in new, preregistered and purportedly high-power studies;

– Inferences from different published papers appear to be inconsistent with each other, casting doubt on the entire enterprise;

– Seeming discoveries do not appear in new data, and different published discoveries can even contradict each other.

In a “quality control” framework, we’d think of different studies in a sub-subfield as having many sources of variation. One of the key principles of quality control is to avoid getting faked out by variation—to avoid naive rules such as reward the winner and discard the loser—and instead to analyze and then work to reduce uncontrollable variation.

Applying the ideas of quality control to threads of scientific research, the goal would be to get better measurement, and stronger links between measurement and theory—rather than to give prominence to surprising results and to chase noise. From a quality control perspective, our current system of scientific publication and publicity is perverse: it yields misleading claims, is inefficient, and it rewards sloppy work.

The “rewards sloppy work” thing is clear from a simple decision analysis. Suppose you do a study of some effect theta, and your study’s estimate will be centered around theta but with some variance. A good study will have low variance, of course. A bad study will have high variance. But what are the rewards? What gets published is not theta but the estimate. The higher the estimate (or, more generally, the more dramatic the finding), the higher the reward! Of course if you have a noisy study with high variance, your theta estimate can also be low or even negative—but you don’t need to publish these results, instead you can look in your data for something else. The result is an incentive to have noise.

The above decision analysis is unrealistically crude—for one thing, your measurements can’t be obviously bad or your paper probably won’t get published, and you are required to present some token such as a p-value to demonstrate that your findings are stable. Unfortunately those tokens can be too cheap to be informative, so a lot of effort has to be taken to make research projects look scientific.

But all this is operating under the paradigms of hypothesis testing, inference, and discovery, which I’ve argued is not a good model for the scientific process.

Move now to quality control, where each paper is part of a process, and the existence of too much variation is a sign of trouble. In a quality-control framework, we’re not looking for so-called failed or successful replications; we’re looking at a sequence of published results—or, better still, a sequence of data—in context.

I was discussing some of this with Ron Kennett and he sent me two papers on quality control:

Joseph M. Juran, a Perspective on Past Contributions and Future Impact, by A. Blanton Godfrey and Ron Kenett

The Quality Trilogy: A Universal Approach to Managing for Quality, by Joseph Juran

I’ve not read these papers in detail but I suspect that a better understanding of these ideas could help us in all sorts of areas of statistics.

The post “Quality control” (rather than “hypothesis testing” or “inference” or “discovery”) as a better metaphor for the statistical processes of science appeared first on Statistical Modeling, Causal Inference, and Social Science.