Survey Statistics: 2 flavors of calibration

Statistical Modeling, Causal Inference, and Social Science 2025-06-03

I get confused when the same word is used for 2 different things. (I get confused a lot.)

In Survey Statistics, the word “calibration” is used in 2 different ways. Both are attempts to align estimates from our survey to an external source of data (e.g. census tables):

  1. Poststratification: Calibrate our estimates of means E[Y] to population data about another variable X.
  2. Intercept Correction: Calibrate our estimates of regressions E[Y|X] to aggregate data about E[Y].

Using more sources of data makes a statistical method good (a principle Andrew learned from Hal Stern). Kuriwaki et al. 2024 use both flavors of calibration to estimate Republican vote share by race and congressional district. Here is their Figure 4:

Suppose we want to estimate E[Y], the population mean. But we only have Y in the survey sample. For example, suppose Y is voting Republican. We can use the sample mean, Ehat[Y | sample], but what if survey-takers are more or less Republican than the population ?

If we have population data on X, e.g. racial group, we can estimate Republican vote share by racial group E[Y|X] and aggregate according to the known distribution of racial groups, invoking the law of total expectation: E[Y] = E[E[Y|X]]. So if our sample has the wrong distribution of racial groups, at least we fix that with some calibration. Replacing “E” with estimates “Ehat”, poststratification calibrates our estimate of population mean E[Y] to the known distribution of X, using E[Ehat[Y | X, sample]].

For example, suppose 60% of white voters vote Republican, E[Y | X = white] = 60%, and E[Y | X = non-white] = 25%. Suppose P(X = white) = 70%. But our sample has the wrong distribution of racial groups, e.g. P(X = white | sample) = 50%. Without correcting this, our estimate would be: 60% * 50% + 25% * 50% = 42.5%, too low due to our sample containing too few white voters. But with the correct population distribution of X: 60% * 70% + 25% * 30% = 49.5%, roughly the 2016 election results. So instead of assuming E[Y] = E[Y | sample], our new calculation assumes E[Y | white] = E[Y | white, sample] and E[Y | non-white] = E[Y | non-white, sample].

For more, see Lumley 2010 Chapter 7: Poststratification, Raking, and Calibration. For implementation try:

survey::calibrate()
Preview
To summarize poststratification:
  • want: population mean E[Y]
  • have: Y,X in sample, X in population

But let’s flip this around. From election results we know E[Y]. We want our estimates of vote by racial group E[Y|X] to be calibrated to those known totals.

  • want: E[Y|X]
  • have: Y,X in sample, X in population, AND population mean E[Y]

This is a different flavor of calibration ! With poststratification, it was the totals themselves we wanted to estimate E[Y], and the regression E[Y|X] was a step along the way. Now the regression E[Y|X] is what we want to estimate, but the known total E[Y] helps us. So when we estimate E[Y|X], we constrain it to aggregate to the known E[Y]. One way this is done is called the “Logit Shift” in Rosenman et al. 2023 and “Intercept Correction” in Ghitza and Gelman 2020. This intercept correction isn’t needed if we do a regression of Y on X in the population. As described in regression textbooks, the fitted values Yhat (i.e. Ehat[Y|X]) will aggregate to the mean Y if your model includes an intercept. What breaks that in survey statistics is that we do the regression in the sample, not the population.

For example, suppose our sample includes too many Democratic voters overall because they are more likely to take surveys. So our Yhat for all racial groups are incorrectly shifted towards Democrats. We can correct the intercept of our model by subtracting enough so that the calibrated Yhats now aggregate to the correct lower proportion of Democrats.

Simplifying Kuriwaki et al. 2024:

  • They first calibrate (2nd flavor) E[Z|X] to add Z (education) to the auxiliary data X (racial group), using known education aggregates from census tables.
  • They then calibrate (1st flavor) estimates of Republican vote share among racial groups E[Y|X] using auxiliary data X,Z. This is done by estimating E[Y|X,Z] and then averaging over the distribution of Z | X.
  • They then calibrate (2nd flavor again) E[Y|X] to known aggregate E[Y] from election results.

They use “calibration” to refer to both flavors (bolding my own):

Second, we improve upon existing survey modeling methods by developing two new calibration techniques… multilevel regression and poststratification (MRP) for small area estimation. MRP uses hierarchical modeling and calibration weights …Furthermore, we develop a two way survey calibration, which simultaneously calibrates estimates to both election results by geography and an external survey, instead of only to geography.

Calibration in machine learning

I get even more confused when other fields (e.g. machine learning) also use “calibration”.

Our 2nd flavor of calibration, the “intercept correction”, ensures that E[Y] = E[Yhat]. This is sometimes called “mean calibration. As noted above, we get this for free if sample and population are the same, and our regression of Y on X includes an intercept. It doesn’t matter if our model is correct at all !

A stricter form of calibration requires E[Y | Yhat] = Yhat. See for example p.30 of PATTERNS, PREDICTIONS, AND ACTIONS: A story about machine learning, by Moritz Hardt and Benjamin Recht. Or Jessica’s posts, e.g. here. This holds if our model for E[Y|X] is correct, which is much harder than just throwing in an intercept term to get mean calibration.