How far can exchangeability get us toward agreeing on individual probability?

Statistical Modeling, Causal Inference, and Social Science 2025-01-17

This is Jessica. What’s the common assumption behind the following?

- Partial pooling of information over groups in hierarchical Bayesian models
- In causal inference of treatment effects, saying that the outcome you would get if you were treated (Y^a) shouldn’t change depending on whether you are assigned the treatment (A) or not
- Acting as if we believe a probability is the “objective chance” of an event even if we prefer to see probability as an assignment of betting odds or degrees of belief to an event

The question is rhetorical, because the answer is in the post title. These are all examples where statistical exchangeability is important. Exchangeability says the joint distribution of a set of random variables is unaffected by the order in which they are observed.

Exchangeability has broad implications. Lately I’ve been thinking about it as it comes up at the ML/stats intersection, where it’s critical to various methods: achieving coverage in conformal prediction, using counterfactuals in analyzing algorithmic fairness, identifying independent causal mechanisms in observational data, etc.

This week it came up in the course I’m teaching on prediction for decision-making. A student asked whether exchangeability was of interest because often people aren’t comfortable assuming data is IID. I could see how this might seem like the case given how application-oriented papers (like on conformal prediction) sometimes talk about the exchangeabilty requirement as an advantage over the usual assumption of IID data. But this misses the deeper significance, which is that it provides a kind of practical consensus between different statistical philosophies. This consensus, and the ways in which it’s ultimately limited, is the topic of this post.

Interpreting the probability of an individual event

One of the papers I’d assigned was Dawid’s “On Individual Risk,” which, as you might expect, talks about what it means to assign probability to a single event. Dawid distinguishes “groupist” interpretations of probability that depend on identifying some set of events, like the frequentist definition of probability as the limiting frequency over hypothetical replications of the event, from individualist interpretations, like a “personal probability” reflecting the beliefs of some expert about some specific event conditions on some specific prior experience. For the purposes of this discussion, we can put Bayesians (subjective, objective, and pragmatic, as Bob describes them here) in the latter personalist-individualist category.

On the surface, the frequentist treatment of probability as an “objective” quantity appears incompatible with the individualist notion of probability as a descriptor of a particular event from the perspective of the particular observer (or expert) ascribing beliefs. If you have a frequentist and a personalist thinking about the next toss of a coin, for example, you would expect the probability the personalist assigns to depend on their joint distribution over possible sequences of outcomes, while the frequentist would be content to know the limiting probability. But de Finetti’s theorem shows that if one believes a sequence of events to be exchangeable, then you can’t distinguish their beliefs about those random variables from conceiving of independent events with some underlying probability. Given a sequence of exchangeable Bernoulli random variables X1, X2, X3, … , you can think of a draw from their joint distribution as sampling p ~ mu, then drawing X1, X2, X3, … from Bernoulli(p). So the frequentist and personalist can both agree, under exchangeability, that p is meaningful for decision making. David Spiegalhalter recently published an essay on interpreting probability that he ended by commenting on how remarkable this pragmatic consensus is.

But Dawid’s goal is to point out ways in which the apparent alignment is not as satisfactory as it may seem in resolving the philosophical chasm. It’s more like we’ve thrown a (somewhat flimsy) plank over it. Exchangeability may sometimes get us across by allowing the frequentist and personalist to coordinate in terms of actions, but we have to be careful how much weight we put on this.

The reference set depends on the state of information

One complication is that the personalist’s willingness to assume exchangeability to conceive of the probability of some individual event depends on the information they have. Dawid uses the example of trying to predict the exam score of some particular student. If they have no additional information to distinguish the target student from the rest, the personalist might be content to be given an overall limiting relative frequency p of failure across a set of students. But as soon as they learn something that makes the individual student unique, p is no longer the appropriate reference for the individual student’s probability of passing the exam.

As an aside, this doesn’t mean that exchangeability is only useful if we think of members of some exchangeable set as identical. There may still be practical benefits of learning from the other students in the context of a statistical model, for example. See, e.g., Andrew’s previous post on exchangeability as an assumption in hierarchical models, where he points out that assuming exchangeability doesn’t necessarily mean that you believe everything is indistinguishable, and if you have additional information distinguishing groups, you can incorporate that in your model as group-level predictors.

But for the purposes of personalists and frequentists agreeing on a reference for the probability of a specific event, the dependence on information is troubling. Can we avoid this by making the reference set more specific? What if we’re trying to predict a particular student’s score on a particular exam in a world where that particular student is allowed to attempt the same exam as many times as they’d like? Now that the reference group refers to the particular student and particular exam, would the personalist be content to accept the limiting frequency as the probability of passing the next attempt?

The answer is, not necessarily. This imaginary world still can’t get us to the generality we’d need for exchangeability to truly reconcile a personalist and frequentist assessment of the probability.

Example where the limiting frequency is constructed over time

Dawid illustrates this by introducing a complicating (but not at all unrealistic) assumption: that the student’s performance on their next try on the exam will be affected by their performance on the previous tries. Now we have a situation where the limiting frequency of passing on repeated attempts is constructed over time.

As an analogy, consider drawing balls from an urn, where when we draw our first ball, there is 1 red ball and 1 green ball in it. Upon drawing a ball, we immediately return and add an additional ball of the same color. At each draw, each ball in the urn is equally likely of being drawn, and the sequence of colors is exchangeable.

Given that p is not known, which do you think the personalist would prefer to consider as the probability of a red ball on the first draw: the proportion of red balls currently in the urn, or the limiting frequency of drawing a red ball over the entire sequence?

Turns out in this example, the distinction doesn’t actually matter: the personalist should just bet 0.5. So why is there still a problem in reconciling the personalist assessment with the limiting frequency?

The answer is that we now have a situation where knowledge of the dynamic aspect of the process makes it seem contradictory for the personalist to trust the limiting frequency. If they know it’s constructed over time, then on what ground is the personalist supposed to assume the limiting frequency is the right reference for the probability on the first draw? This gets at the awkwardness of using behavior in the limit to think about individual predictions we might make.

Why this matters in the context of algorithmic decision-making

This example is related to some of my prior posts on why calibration does not satisfy everyone as a means of ensuring good decisions. The broader point in the context of the course I’m teaching is that when we’re making risk predictions (and subsequent decisions) about people, such as in deciding who to grant a loan or whether to provide some medical treatment, there is inherent ambiguity in the target quantity. Often there are expectations that the decision-maker will do their best to consider the information about that particular person and make the best decision they can. What becomes important is not so much that we can guarantee our predictions behave well as a group (e.g., calibration) but that we understand how we’re limited by the information we have and what assumptions we’re making about the reference group in an individual case.