Two spans of the bridge of inference
Statistical Modeling, Causal Inference, and Social Science 2024-11-08
This is Jessica. Larry Hedges relayed a quote to me recently that I thought others here might appreciate. It appears in an old Annals of Mathematical Statistics paper by Tukey and Cornfield:
In almost any practical situation where analytical statistics is applied, the inference from the observations to the conclusion has two parts, only the first of which is statistical. A genetic experiment on Drosophila will usually involve flies of a certain race of a certain species. The statistically based conclusions cannot extend beyond this race, yet the geneticist will usually, and often wisely, extend the conclusion to (a) the whole species, (b) all Drosophila, or (c) a larger group of insects. This wider extension may be implicit or explicit, but it is almost always present. If we take the simile of the bridge crossing a river by way of an island, there is a statistical span from the near bank to the island, and a subject-matter span from the island to the far bank. Both are important. By modifying the observation program and the corresponding analysis of the data, the island may be moved nearer to or farther from the distant bank, and the statistical span may be made stronger or weaker. In doing this it is easy to forget the second span, which usually can only be strengthened by improving the science or art on which it depends. Yet a balanced understanding of, and choice among, the statistical possibilities requires constant attention to the second span. It may often be worth while to move the island nearer to the distant bank, at the cost of weakening the statistical span-particularly when the subject-matter span is weak.
The example is about generalization from experimental evidence, where we often fixate on removing threats to the statistical inference while leaving the additional “work” required to apply the results for policy informal and outside the bounds of the research. But more broadly the quote is a nice metaphor for the inevitable limitations of statistical methods for getting us all the way there in solving real world problems. The most interesting part of statistics for me has often been trying to make sense of what is happening at the edges of the technical solutions, where some yet unformalized judgment has to come in and make it work.
Sometimes it’s about understanding how people are using estimates or predictions from statistical models, where, for example, the interfaces we provide to model outputs can shape what researchers conclude or what policies are enacted. Sometimes it’s about how we conceive of our goals going into modeling, like when we’re designing studies and have to pull effect size estimates from some foggy internal model of plausible effects. Often it’s about assessing the extent to which assumptions hold in real world settings where models are applied. My interest in theoretical work on calibration for decision-making, for example, is partly kept up by my hesitance to accept certain assumptions as realistic descriptions of how predictions are used in practice.
Sometimes it’s about what happens in between the deployments of some algorithmic solution. I’m reminded of a talk Susan Murphy gave at a workshop on individualized prediction that Ben Recht organized last summer. She presented a number of reflections on her work in reinforcement learning for health care, where they are deploying online learning algorithms in apps to learn personalized policies for nudging people at risk of coronary disease to exercise, or people at risk of dental disease to brush their teeth. Comments she made that stuck with me were about the iterative learning that happens between experiments, where there’s an inevitable “discovery” phase that involves pooling data across individuals and try to identify themes that can help them tweak the algorithm or better initialize it for the start of the next round. In other words, a lot of the important learning remains outside of the formalized algorithmic loop.
When researchers identify the stuff happening at the edges, and start taking it seriously, the results can be big. Where would research or development on topics like data visualization and interactive data analysis be without Tukey’s vision of exploratory data analysis to signal their significance? Andrew’s philosophy of model checking and his and others’ work on workflow also come to mind, as well as work by Beth Tipton and others on generalizability and taking heterogeneity seriously in behavioral research. There are many great examples. Sometimes entire new fields spring up in acknowledgement of the overlooked second span. A more recent example is research on algorithmic fairness and bias arising largely from an observation that the traditional machine learning pipeline, where performance is optimized in aggregate over a population, leaves a big gap when it comes to applying model confidently in practice in fields like medicine or law, where we care about doing right by individuals.
Still it seems that there’s often hesitance to “break the frame” implied by conventions in highly technical fields. Maybe researchers sense that the same tools won’t necessarily solve the problems of the second span. Or that they won’t garner the same respect. Or they’re so busy following the technical train that they don’t get around to looking beyond its trajectory. But that’s ok. If nothing else, less competition for researchers like me!