Calibration “resolves” epistemic uncertainty by giving predictions that are indistinguishable from the true probabilities. Why is this still unsatisfying?

Statistical Modeling, Causal Inference, and Social Science 2024-12-31

This is Jessica. The last day of the year is like a good time for finishing things up, so I figured it’s time for one last post wrapping up some thoughts on calibration. 

As my previous posts got into, calibrated prediction uncertainty is the goal of various posthoc calibration algorithms discussed in machine learning research, which use held out data to learn transformations on model predicted probabilities in order to achieve calibration on the held out data. I’ve reflected a bit on what calibration can and can’t give us in terms of assurances for decision-making. Namely, it makes predictions trustworthy for decisions in the restricted sense that a decision-maker who will choose their action purely based on the prediction can’t do better than treating the calibrated predictions as the true probabilities. 

But something I’ve had trouble articulating as clearly as I’d like involves what’s missing (and why) when it comes to what calibration gives us versus a more complete representation of the limits of our knowledge in making some predictions. 

The distinction involves how we express higher order uncertainty. Let’s say we are doing multiclass classification, and fit a model fhat to some labeled data. Our “level 0” prediction fhat(x) contains no uncertainty representation at all; we check it against the ground truth y. Our “level 1” prediction phat(.|x) predicts the conditional distribution over classes; we check it against the empirical distribution that gives a probability p(y|x) for each possible y. Our “level 2” prediction is trying to predict the distribution of the conditional distribution over classes, p(p(.|x), e.g. a Dirichlet distribution that assigns probability to each distribution p(.|x), which we can distinguish using some parameters theta.

From a Bayesian modeling perspective, it’s natural to think about distributions of distributions. A prior distribution over model parameters implies a distribution over possible data-generating distributions. Upon fitting a model, the posterior predictive distribution summarizes both “aleatoric” uncertainty due to inherent randomness in the generating process and “epistemic” uncertainty stemming from our lack of knowledge of the true parameter values. 

In some sense calibration “resolves” epistemic uncertainty by providing point predictions that are indistinguishable from the true probabilities. But if you’re hoping to get a faithful summary of the current state of knowledge, it can seem like something is still missing. In the Bayesian framework, we can collapse our posterior prediction of the outcome y for any particular input x to a point estimate, but we don’t have to. 

Part of the difficulty is that whenever we evaluate performance as loss over some data-generating distribution, having more than a point estimate is not necessary. This is true even without considering second order uncertainty. If we train a level 0 prediction of the outcome y using the standard loss minimization framework with 0/1 loss, then it will learn to predict the mode. And so to the extent that it’s hard to argue one’s way out of loss minimization as a standard for evaluating decisions, it’s hard to motivate faithful expression of epistemic uncertainty.

For second order uncertainty, the added complication is there is no ground truth. We might believe there is some intrinsic value in being able to model uncertainty about the best predictor, but how do we formalize this given that there’s no ground truth against which to check our second order predictions? We can’t learn by drawing samples from the distribution that assigns probability to different first order distributions p(.|x) because technically there is no such distribution beyond our conception of it. 

Daniel Lakeland previously provided an example I found helpful of the difference between putting Bayesian probability on a predicted frequency, where there’s no sense in which we can check the calibration of the second order prediction. 

Related to this, I recently came across a few papers by Viktor Bengs et al that formalize some of this in an ML context. Essentially, they show that there is no well-defined loss function that can be used in the typical ML learning pipeline to incentivize the learner to make correct predictions that are also faithful as expressions of the epistemic uncertainty. This can be expressed in terms of trying to find a proper scoring rule. In the case of first order predictions, as long as we use a proper scoring rule as the loss function, we can expect accurate predictions, because a proper scoring rule is one for which one cannot score higher by deviating from reporting our true beliefs. But there is no loss function that incentivizes a second-order learner to faithfully represent its epistemic uncertainty like a proper scoring rule does for a first order learner. 

This may seem obvious, especially if you’re coming from a Bayesian tradition, considering that there is no ground truth against which to score second order predictions. And yet, various loss functions have been proposed for estimating level 2 predictors in the ML literature, such as minimizing the empirical loss of the level 1 prediction averaged over possible parameter values. These results make clear that one needs to be careful interpreting the predictors they give, because, e.g., they can actually incentivize predictors that appear to be certain about the first order distribution. 

I guess a question that remains is how to talk about incentives for second order uncertainty at all in a context where minimizing loss from predictions is the primary goal. I don’t think the right conclusion is that it doesn’t matter since we can’t integrate it into a loss minimization framework. Having the ability to decompose predictions by different sources of uncertainty and be explicit about what our higher order uncertainty looks like going in (i.e., by defining a prior) has scientific value in less direct ways, like communicating beliefs and debugging when things go wrong.