Wacky priors can work well?

Statistical Modeling, Causal Inference, and Social Science 2013-03-15

Dave Judkins writes:

I would love to see a blog entry on this article, Bayesian Model Selection in High-Dimensional Settings, by Valen Johnson and David Rossell. The simulation results are very encouraging although the choice of colors for some of the graphics is unfortunate. Unless I am colorblind in some way that I am unaware of, they have two thin charcoal lines that are indistinguishable.

When Dave Judkins puts in a request, I’ll respond. Also, I’m always happy to see a new Val Johnson paper. Val and I are contemporaries—he and I got our PhD’s at around the same time, with both of us working on Bayesian image reconstruction, then in the early 1990s Val was part of the legendary group at Duke’s Institute of Statistics and Decision Sciences—a veritable ’27 Yankees featuring Mike West, Merlise Clyde, Michael Lavine, Dave Higdon, Peter Mueller, Val, and a bunch of others. I always thought it was too bad they all had to go their separate ways.

Val also wrote two classic papers featuring multilevel modeling, one on adjustment of college grades (leading to a proposal that Duke University famously shot down), and one on primate intelligence.

Anyway, to get to the paper at hand . . . Johnson and Rossell write:

We demonstrate that model selection procedures based on nonlocal prior densities assign a posterior probability of 1 to the true model as the sample size n increases when the number of possible covariates p is bounded by n and certain regularity conditions on the design matrix pertain.

This doesn’t bother me but it doesn’t seem particularly relevant to anything I would study. The true model is never in the set of models I’m fitting. Rather, the true model is always out of reach, a bit more complicated then I ever have the data and technology to fit.

They also write:

In practice, it is usually important to identify not only the most probable model for a given set of data, but also the probability that the identified model is correct.

I take Johnson and Rossell’s word that this describes their practice but it doesn’t describe mine. I know ahead of time that the probability is zero that the identified model is correct.

I’m not trying to be glib here. This is really how I operate. Models, fitting, regularization, prediction, inference: for me, it’s all approximate.

On the practical side, though, the method proposed in the paper might be great. The proposal is for Bayesian regression where each coefficient has a prior distribution that is a mix of a spike at zero and a funny-shaped distribution for the nonzero values. I’d be interested in comparing to a direct Bayesian approach that keeps all the coefficients in the model and just uses a hierarchical prior that partially pools everything to zero.

P.S. To answer Dave’s implicit question: I think Figure 1 would’ve worked better as three small graphs on common scale. It would be more readable and actually take up less space. Also, set the y-axis to go to zero at zero, and remove the box (in R talk, use plot(…,bty=”l”). Figures 2 through 4 would be better as denser grids of plots; that is, use more graphs and fewer lines per graph. Also label the lines directly rather than with that legend, and for chrissake don’t have a probability scale that goes below 0 and above 1. Actually, what’s with those y-axes? 0, .002, .039, .5, .961, .998, 1. Huh?