Individual probability, model multiplicity, and multicalibration

Statistical Modeling, Causal Inference, and Social Science 2025-03-26

This is Jessica. I’ve been posting recently on questions related to individual probability, i.e., assigning probability to individual events, related to a course I just wrapped up where this was a theme. For example, previously we talked about how statistical exchangeability–where the joint distribution of a random variable is unaffected by the order in which it’s observed–is from one perspective all you need to reconcile differences between “groupist” (e.g., probability as long run frequency over a set of similar events) and “individualist” notions of the probability of a single event (e.g., a probability as based on some expert’s beliefs about a specific event conditioned on some prior experience). A fundamental problem with individual probabilities is the reference class problem: to assign a probability to an event that will only happen once, we have to identify some group of events that we believe capture the essential characteristics of the event in question and estimate the probability over that. But often there will be several equally-appropriate-seeming reference groups that we must choose between. For example, if we have a criminal defendant with a combination of prior arrests and convinctions that we have never seen together before, which subset of these features do we use in estimating their probability of commiting another crime if released, assuming that using different reference groups result in different conditional probabilities? 

The reference class problem is associated with the idea that there is often no unique way to assign a conditional probability to a particular event. Consequently, predictive multiplicity–the fact that machine learning problems often admit multiple competing models that perform more or less equally well–has been accepted as a consequence of the underspecification of individual probabilities in the machine learning literature. Multiplicity is sometimes described as the Rashomon effect, or through the idea of a Rashomon set, i.e., a set of models that predict equally accurately but which make conflicting predictions on some subset of data space. For example, in Breiman’s well known paper on the two cultures of statistical modeling, he uses the Rashomon effect to argue that we have to be careful not to draw conclusions about the process that generated some data using a single explanatory model unless we can somehow rule out all the competing models. This possibility has been discussed more recently in conjuction to concerns like algorithmic fairness, where the presence of other equally accurate models that assign some specific person the opposite prediction is considered unsettling. 

But is multiplicity due to the underspecification of individual probabilities really a fundamental property of predictive models learned from data? Imagine you have two models that are equally supported by the data (and make predictions close to the true conditional probabilities for the various possible reference groups). But they disagree non-trivially in their predictions. Should we accept that such situations exist and cannot necessarily be resolved? 

 A new paper by Roth and Tolbert reframes this as “the reference class problem at scale.” More concretely, say you have some distribution over a collection of elements representing different combinations of features, which might represent, e.g., different people’s records. Assume some true function F that maps from features of these records to outcomes. We can define a set of reference classes representing subsets of the elements; for example, if we have a universe of elements representing different combinations of an age, race, and income variable, then we can define subsets for different possible combinations of the values of these variables.

A model can be considered consistent with a reference class if, when you average the model’s predictions over elements in that reference class, you are within some error epsilon of the true rate over that reference class.

One way to then think about predictive multiplicity is a case where we have at least two models that are consistent on all of the reference classes for some error bound epsilon, but that frequently make different predictions. In Roth and Tolbert’s characterization, this means that the probability that their predictions differ by more than some small amount epsilon is greater than epsilon. 

Now, returning to the previous question, should we accept such multiplicity as a fact of life in machine learning, an inevitable symptom of an underspecified learning problem? 

The recent work by Roth and Tolbert and a more technical version by Roth et al. argues that this kind of multiplicity is always resolvable, at least in theory. The answer to finding resolution lies in multicalibration, which I’ve discussed previously on the blog. A multicalibrated model is one that, for any efficiently identifiable, possibly intersecting set of groups that are supported by the data, its predictions are approximately calibrated. 

Multicalibration can be achieved by algorithms that work via a boosting process. For any reference class with sufficient probability mass on which the model is not found to be consistent, with access to data sampled from the same distribution we can produce a new model with a lower squared error, and continue doing this until we arrive at a model that is consistent with all of the reference classes.

This leads Roth and colleagues to argue that: 

although individual probabilities are unknowable, they are contestable via a computationally and data efficient process that must lead to agreement. Thus we cannot find ourselves in a situation in which we have two equally accurate and unimprovable models that disagree substantially in their predictions—providing an answer to what is sometimes called the predictive or model multiplicity problem. 

In other words, they show that with sufficient data we can improve one or both of the models until we have a multicalibrated model. Put this way, this is not so surprising. The contributions of these papers are more nuanced of course, and including showing how bounds on the amount of data needed scales with parameters like the error bound and probability mass of the considered subsets. 

I like how framing predictive multiplicity in terms of models being inconsistent with reference groups nicely connects underspecification of individual probabilities with model multiplicity. This link has previously been left vague. My only real complaint (which applies to lots of theory papers related to calibration) is the downplaying of the distinction between theoretically possible and practically possible in some of the statements. E.g., in the quote above, saying “we cannot find ourselves in a situation in which we have two equally accurate and unimprovable models that disagree substantially in their predictions” requires some qualification, because we can absolutely find ourselves in a situation where we have multiple models that disagree and we don’t have sufficient data (because we’re dealing with rare classes/very large labels and limited data). 

One question this work has me now thinking about is when observing model multiplicity is still useful, even if you could reconcile the models to some extent through cross-calibration. It’s directly relevant to some work that Abhraneel Sarma, Dawei Xie and I have been doing related to Cynthia Rudin’s vision that predictive multiplicity is a good thing, e.g., because it provides room for identifying models that align with human preferences like fairness or monotonicity constraints on the relationship between features and outcome. Our part of this has been to develop an interactive interface (and more generally think through a workflow for incorporating domain expertise in model selection) for Rashomon sets of Generalized Additive Models. Here, the individual models in the set can differ in terms of what features they include, how they are coded, and what their shape functions look like. 

It seems that when a model’s predictions are intended as a decision aid for a human expert, rather than trying to reconcile the multiplicity entirely by cross-calibrating we may do better by deploying a model that aligns with the domain experts’ preferences. One reason that I’ve brought up before is that trusting the calibration data can seem to contradict the reasons for including humans in a decision process in the first place: often we want them there because we are wary of distributions shifting or the model predictions unknowingly reflecting artifacts in the training data. Involving the expert in model selection to ensure the deployed model captures key aspects of how human experts believe features relate to the outcome (e.g., having asthma should not result in lower risk) may produce better decisions in practice by providing some robustness against shifts, even if we could have improved calibration according to our prior data. There are also problems that arise when experts have access to some highly performant model but it conflicts with their expectations about how features should be used. I’ve heard stories about this happening in medical settings where its discovered after a fancy new model is deployed that it’s basically being ignored.