Iterative imputation and incoherent Gibbs sampling

Statistical Modeling, Causal Inference, and Social Science 2024-12-17

Seeing this post by Tim Morris on the difference between iterative imputation and Gibbs sampling reminded me of some articles that my colleagues and I have written on the topic:

[2001] Using conditional distributions for missing-data imputation. Discussion of “Conditionally specified distributions” by B. Arnold et al. Statistical Science 3, 268-269. (Andrew Gelman and T. E. Raghunathan).

[2014] Multiple imputation for continuous and categorical data: Comparing joint and conditional approaches. Political Analysis 22, 497-519. (Jonathan Kropko, Ben Goodrich, Andrew Gelman, and Jennifer Hill).

[2014] On the stationary distribution of iterative imputations. Biometrika 1, 155-173. (Jingchen Liu, Andrew Gelman, Jennifer Hill, Yu-Sung Su, and Jonathan Kropko).

It’s something that’s been discussed in the Bayesian statistics community for a long time—I remember talking about it with Jun Liu around 1992 or so.

For a very simple example, consider these two incoherent conditional specifications: x|y ~ normal(0, 1) y|x ~ normal(x, 1).

These are obviously incoherent: in the specification for x|y, x and y are independent; in the specification for y|x, they are dependent.

What happens if you do Gibbs updating (following Morris, I won’t call it “Gibbs sampling,” as there is no underlying joint distribution)? Iteration 1: Draw x from normal(0, 1) Iteration 2: Draw y from normal(x, 1) Iteration 3: Draw x from normal(0, 1) Iteration 4: Draw y from normal(x, 1) Etc.

What is the joint distribution of (x,y)? It depends on when you stop. If you stop at an odd iteration, the joint distribution is (x,y) ~ MVN((0,0), ((1,0),(0,2))). If you stop at an even iteration, the joint distribution is (x,y) ~ MVN((0,0), ((1,1),(1,2))). If you want a single joint distribution, you could stop at a random iteration, in which case the joint distribution will depend on your random process, but for any reasonable choice of random stopping point will be approximately (x,y) ~ 0.5*MVN((0,0), ((1,0),(0,2))) + 0.5*MVN((0,0), ((1,1),(1,2))).

Things get more complicated when you have more than two variables, because then there are other possible updating schedules. You could update x,y,z,x,y,z,…, or x,y,x,z,y,z,…, or do a random update—and these could lead to different joint distributions, even after averaging over a random stopping point. There have to be some conditions under which you could bound the variation around these possible distributions—at the extreme case where the conditional specifications are coherent and the process is ergodic, there’s no variation at all—but I don’t know what they are.

Iterative imputation for missing data is kinda like the above, except the conditional distributions are themselves estimated from available data. In some ways this makes the problem harder to analyze (which is why the math in the third article linked above gets so complicated), but in other ways it makes things better that the distributions are estimated from a common dataset, as this should enforce some approximate coherence among the conditional specifications.

Lots to chew about, from practical, theoretical, and computational perspectives.