Calibration is sometimes sufficient for trusting predictions. What does this tell us when human experts use model predictions?

Statistical Modeling, Causal Inference, and Social Science 2024-11-01

This is Jessica. I got through a long string of deadlines and invited talks and now I’m back to thinking about calibration and decision-making.

In a previous post I was wondering about the relationship between calibration and Bayesian use of information – Bayesian theory implies calibration, but when does calibration imply Bayesian use of information? I am still interested in this question, and may follow up with a post on this topic later.

But more broadly I have been considering what to do with some recent observations in theoretical computer science involving calibration and decision-making. For example, what’s the significance of observations on the sufficiency (or lack thereof, depending on how you look at it) of calibration for good decisions? Or the possibility of identifying predictors that we expect to be calibrated for many downstream decision tasks, or over many possibly intersecting groups represented in the data? How realistic is it to expect to achieve these solutions in practice?

I think these questions are important in light of renewed interest in calibration. Rather than stuff all of this in a single blog post, today I’ll discuss a few issues with embracing calibration as the goal when predictions are inputs to human experts’ decisions.

Calibration is enough for good decisions in a restricted sense

Consider the following:

“Calibrated predictions have the property that it is optimal for a decision maker to optimize assuming that the prediction is correct.”

“Calibration promises that, simultaneously for all downstream problems, the policy that treats the predictions of a calibrated model as correct and acts accordingly is uniformly best amongst all policies.”

It would be easy to conclude from these statements that “Calibration is all you ever need for good decisions.” But the above statements assume a decision-maker who has no access to additional contextual information other than the prediction. In such a case, if you know the probability is calibrated, then the decision-maker can’t do better than assuming it is the true probability in making their decisions.

Stated this way, the sufficiency of calibration for decision-making can seem to border on tautological: If you evaluate decisions using expected value and assume the decision-maker will only decide based on the prediction from your model, then calibration is enough to trust the predictions. The result is partly a function of how we are evaluating things.

If, on the other hand, you think a decision-maker may have access to additional information about a decision instance (that is not available to the model making the predictions), then predictions that are calibrated with respect to the data that the model can access but not this additional information are not necessarily enough. See, e.g., this demonstration by Corvelo Benz and Gomez Rodriguez of how predictions calibrated over the model’s information are not always sufficient for a decision-maker with access to additional features to discover the optimal policy.

It’s interesting how casual statements that “calibration is all we need for good decisions” can also seem to contradict aspirations for integrating predictive models in many cases where human experts currently make decisions. In practice, it seems that there’s often an expectation that a human expert might have additional information over a statistical model. And so we don’t just want the human expert to blindly trust model predictions. If that were the case, we could just automate things.

Part of this relates to limitations on how we know that a model is calibrated in practice. I expect it’s still fairly common when seeking to deploy AI models to help decision-makers in fields like medicine to evaluate model calibration prior to the human expert using it in practice, and maybe sporadically at most after that. For example, we might do checks on calibration on data from the training/test distribution. But often we maintain some skepticism about whether we got the problem definition right. So part of the assurance provided by having a human on hand is that they might be able to recognize when the model’s assumptions no longer hold, even when the model “thinks” its perfectly calibrated.

There’s also often an expectation that a doctor, for example–through interacting with a patient to gain more details, or observing hard to formalize signals like their mood–has the potential to make a better decision than the model could have made alone by combining their own information with that contained in model predictions.** If we allow for the possibility that an expert consulting model predictions might have access to other relevant information, the value of calibration gets more complicated than suggested by the statements above.

Calibration says nothing about how good individual decisions are.

Another human-facing issue that doesn’t often get acknowledged in the theoretical literature on calibration is that often decision-makers must care about being accountable or legally liable for individual decisions. For example, doctors, and the organizations that employ them, need to be able to demonstrate a lack of negligence for specific decisions if pressed. This might entail showing that given the available information, one could not have made a better or less harmful decision.

One way to define perfect calibration uses “outcome distinguishability.” This says that the true distribution of individual-outcome pairs (x, y*) is indistinguishable from the distribution of pairs (x, ytilde) generated by simulating a y using the true (empirical) probability, ytilde ∼ Ber(ptilde(x)). This helps illustrate that calibration is subject to multiplicity: there may be multiple predictors that are outcome indistinguishable from the true distribution but which assign different predictions to individual outcomes. We can achieve calibration while still producing predictions that are biased, for example. So knowing predictions are calibrated doesn’t tell us about the limitations of the available information for making a decision on a particular instance. What would be better is something Berk Ustun and I have been thinking a lot about recently.

That calibration isn’t sufficient for good individual decisions is also apparent if you consider the existence of algorithms that achieve calibration in adversarial, online (i.e., sequential) settings, where calibration can be achieved, for example, by injecting noise into predictions. There’s a difference between making good individual decisions and achieving calibration via clever posthoc error accounting.

I would hope these points are clear enough from reading the emerging work on calibration. But I’m not sure if they always are. Overall, I’m excited that calibration for decision-making is a popular topic, and I think there’s a lot to still characterize. I’m just not sure we’ve developed the processes or theory to know how valuable these results are in practice, particularly in cases where predictions inform human experts. So there’s a risk that framing calibration as sufficient for good decisions fixates attention on the wrong thing.

**Whether this folk belief is justified by evidence or philosophy is a separate question worth discussing in its own post.