Survey Statistics: improving with structure

Statistical Modeling, Causal Inference, and Social Science 2026-04-07

We’ve met Mr. P (Multilevel Regression and Poststratification). We’ve met Mrs. P (Multilevel Regression with Synthetic Poststratification). Now let’s meet Ms. P (Multilevel Structured regression with Poststratification) from Gao et al. 2021.

Let’s first review Poststratification:

  • We want the population mean E(Y)
  • We have Y,X in sample, X in population.
  • So we calibrate our estimates of E(Y) to population distribution of X.

Gao et al. 2021‘s example:

  • Y = support for gay marriage
  • X = sex, race, income, state, age, education
  • sample data on Y, X from National Annenberg Election Survey 2008
  • population data on X from the American Community Survey (ACS)

By the law of total expectation: E(Y) = E(E(Y|X)). When our estimate of E(Y|X) is the sample mean of Y for folks with that X, the aggregate estimate is classical Poststratification (no honorific). When our estimate of E(Y|X) is based on a model that regularizes across X, the aggregate estimate is Mr. P.

How to regularize across X ? Suppose age is one of the X variables. Gao et al. 2021 consider 3 priors for the coefficients a_j of age groups j = 1,…,J:

  1. Independent Normal: a_j ~ N(0, sigma)
  2. Autoregressive: a_j | a_{j-1} ~ N(rho a_{j-1}, sigma) with rho in (-1,1)
  3. Random Walk: a_j | a_{j-1} ~ N(a_{j-1}, sigma)

The first is often used in Mr. P. The next two belong to Ms. P, as they use the ordinal structure of age, where age group j is closer to age group j + 1 than age group j + 5. Using this structure can help Ms. P regularize more, with smaller sigma and more borrowing information across ages.

Gao et al. 2021 simulate data from E(Y|X) = logit^-1(… f(X_age[j])…) where the function f(X_age[j]) is how support varies by age. They consider 3 smooth functions f(x). For data simulated from each, they fit models with the 3 priors above. Ms. P out-performs Mr. P, with the Random Walk structure doing best:

This footnote in the Stan documentation might explain why Random Walk outperforms Autoregressive:

In practice, it can be useful to remove the constraint to test whether a non-stationary set of coefficients provides a better fit to the data. It can also be useful to add a trend term to the model, because an unfitted trend will manifest as non-stationarity.

Gao et al. 2021 also consider spatial priors, where neighboring PUMAs are more correlated. They simulate data both with spatial smoothness and from independent Normals, to confirm that the prior “does not force spatial structure when it’s not present”. Did they skip this for the age model ?

Along with the simulation studies, Gao et al. 2021 also apply Ms. P to the National Annenberg Election Survey 2008 and ACS example above.