When is calibration enough?

Statistical Modeling, Causal Inference, and Social Science 2024-08-14

This is Jessica. Calibration has been a topic of recent work in theoretical computer science, such as on multicalibration—calibration of model predictions in light of additional contextual information, such as groups represented in the data —and posthoc calibration of predictions from a blackbox model. Calibration is typically understood as the property of a prediction model or algorithm where for all predictions that are approximately b, the event realizes at a rate of b, and this is true for all values of b (where b is binned to some extent). It can be defined over a distribution or over a sequence. The online (sequential) setting is more challenging, because we can imagine Nature picking outcomes adversarially to throw us off. If the model/algorithm is calibrated in the sequential setting, then expected calibration error–the weighted average error of the predictions–goes to zero as the number of periods increases. 

Some of this recent work explores the value of calibration to decision-making. Calibrated predictions are said to be valuable because payoff-maximizing decision makers can trust them. For example, Noarov and Roth argue in this blog post that simultaneously for all downstream problems, the policy that treats the predictions of a calibrated model as correct and acts accordingly is uniformly best. Elsewhere similar points are made. Sometimes this is stated in terms of no swap regret: any decision maker (regardless of their utility function) who best responds to a sequence of calibrated predictions cannot improve their total payoff in retrospect by changing every occurrence of some action i to a different action j. These points assume that the decision-maker uses only the predicted probability to select the action that minimizes their expected loss; i.e., no additional information about the true state is provided. 

Given that it is defined based on what realizes over the long term, it seems obvious that calibration alone can’t ensure that we make good decisions over a shorter time horizon. For example, while no deterministic sequence of predictions can be calibrated for all possible realizations of the sequence, a randomized algorithm can produce forecasts that are approximately calibrated regardless of what realizes. But as the well-known Foster and Vohra paper shows, we don’t necessarily need to have knowledge of the true data-generating process to achieve this. If all my decisions carry similar (low) stakes I might not care that I’m losing utility in the short term, but if some decisions are much more important than others, I might want to know more about the forecast I’m getting than just that it’s calibrated. So how we achieve calibration seems hard to ignore as soon as we think about decision-making in finite time. Theory papers tend to talk about how regret or error scale with the number of periods, but not so much about what kind of information we get from any single prediction.

This “how” question has me wondering about the connection between certain algorithms for achieving calibration and Bayesian learning. On the one hand, it seems very unsatisfying from a Bayesian perspective to say that calibration is sufficient for good decisions, because of the pre-occupation with long range frequencies. On the other hand, arguments that calibration is valuable for decision-making adopt a Bayesian perspective by assuming that the decision-maker will choose the action that minimizes expected loss over a posterior distribution. Wald shows that with a discrete space of possible states of the world, when we refrain from considering states of nature that are known to be impossible, we find that the only admissible decision strategies—those for which we can conclude there is no other strategy with lower expected loss over the distribution of states—are Bayesian ones.

I’m familiar with Dawid’s 1982 summary of the Bayesian perspective on calibration. In this view, calibration is expected given coherence, i.e., that the forecaster’s conditional probability estimates abide by the rules of probability theory. In theory, calibration is not even useful for diagnosing model misfit. However, Dawid himself problematizes this by pointing out that in practice, expecting calibration every time one uses a Bayesian model is unrealistic. Andrew has talked about how applied Bayesian analysis does often involve checking calibration, which matches my experience. Bob describes how the theory offers only “cold comfort” given that we don’t usually expect to have gotten the model right. So calibration is not so important in Bayesian learning theory, but in practice can be an important signal of how to improve the model to some Bayesians. 

I am specifically interested in the decision-making perspective though. From the perspective of Wald’s approach, inference is just a special case of decision-making. Jaynes argues that even if you don’t buy Wald’s formulation of the admissibility requirement for decision strategies (because it doesn’t account for prior information), there are other inevitabilities of the Bayesian approach in estimation that even a sampling theorist must admit. If we care about making the best decision under uncertainty about the state, we need to make our best (i.e., conditional on all the available evidence) prediction of the probability of the state. 

I’m used to thinking of calibration as sort of a minimal quality we expect of good predictions, not a property sufficient for good decisions. But now I’m wondering what, if anything, we can say about our use of information in arriving at a prediction when we know that our predictions are calibrated in the sense described above? Bayesian theory implies calibration but does calibration imply anything in the other direction? Can we say something about the degree to which forecasts from online algorithms are aggregating information like a Bayesian? I’ve seen some of the online algorithms described as non-Bayesian, but this isn’t really satisfying.

Comments, critiques, pointers to related discussions all welcome.