Causality and Crime: In science as in genre storytelling, the thrill of the unexpected can only come with reference to (and in confounding) some preexisting norm.
Statistical Modeling, Causal Inference, and Social Science 2025-04-07
In his final book, Perplexing Plots, the late David Bordwell wrote:
It’s not accidental that mystery stories are drawn to tricky shufflings of viewpoint or chronology. A plotline built on a detective’s present-time inquiry into past events helps us understand when the order of events is rearranged or character perspective changes.
This reminds me of two things related to statistics:
1. From my article with Jessica, Is Your Chart a Detective Story? Or a Police Report?:
Every data visualization is a story, a plot to be unraveled—but some are more approachable than others. Modern statistical displays of data—grids of scatterplots for inspecting correlations, for example—succeed by being transparent and allowing trends in the data to stand out. In contrast, classic data visualizations often succeed, paradoxically, by being a bit opaque: a puzzle that a reader figures out. . . .
In science we are delighted by unexpected brilliance, which we immediately try to systematize. The same goes for visualization: When we see a new and revelatory graph, we want to take it apart and see how it works. . . .
We can liken this experience to narrative, a lens through which many great (and lesser) works of art have been interpreted. Narrative involves some interplay between plot and perspective, events and interpretation, storyline and characters. Similarly, the practice of science can be viewed as the interplay between data and models. Data are the facts. Models are the characters whose perspectives and assumptions shape what we take away from the story. At the simplest level, the choice of how to visualize data structures the viewer’s experience of those data by promoting certain comparisons over others. It’s a character choice, a choice of model. . . .
Much has been written about how different forms of narrative involve the reader in different ways, from the relatively passive engagement of viewers of a film, to the more active involvement of those following a serial television drama, to the experience of people reading novels who must in a sense create entire movies in their heads. Data visualizations can fall in different places along this continuum. The stories told by some are so strong and clear that they require little from the viewer. Others are far more demanding. One could draw an analogy to works of art that are more or less accessible to the audience—but with the difference that hard-to-follow art is often intentionally ambiguous, whereas challenging visualizations are meant to be understood. In that sense, visualizations are more like video games than art or music. They invoke a trial-and-error experience reminiscent of the “active learning” approaches studied by educational psychologists.
As with video games, it is often the more unconventional visualizations that are the most appealing ones, even to broad audiences. That which is not familiar is more challenging; and aesthetic choices, like the use of pleasing shapes and symmetry, can help entice the viewer to try and solve the puzzle. . . .
What is exciting and unconventional is also a function of our expectations. Music is said to be compelling to the extent that it balances expectation and surprise: A note is interesting when it catches us off-guard, but then it should also make sense within the larger pattern of the piece as it develops. The same is true for storytelling: The thrill of the unexpected can only come with reference to (and in confounding) some preexisting norm.
In addition to addressing the issue of the pleasures of difficulty in narrative, this seems closely related to another of Bordwell’s points, which is that genre fiction can be highly experimental in form and makes this accessible by placing those innovations in a stylized context that is comfortable to readers or viewers.
2. The forward logic of the data generation process and the reverse logic of inference. In a statistical “generative model” or “directed acyclic graph,” there is a logical order: decisions and outcomes happen in time and can influence what comes in the future. In statistical learning, we start with the data and go backward to make inference about parameters that have already been generated and forward to make inference about predictive quantities. When we fit a model and apply it to the future, we’re going back and forth in time.
I think I’ve written something on the logic of data generation and the logic of inference in statistics, but no amount of searching turns anything up. The closest is my article with Guido, Why ask why? Forward causal inference and reverse causal questions, which also appears in Regression and Other Stories as section 21.5, “Causes of effects and effects of causes.”