Answering two questions, one about Bayesian post-selection inference and one about prior and posterior predictive checks
Statistical Modeling, Causal Inference, and Social Science 2024-12-11
Richard Artner writes:
I have two questions. The first one is, of course, related to Bayesian post-selection inference (which I first asked you about five years ago. Back then, you admitted that Bayesians are not immune to overfitting when using the same data to modify/extend a model and to predict with it. Recently, you even mentioned this issue in your new article on Bayesian workflow. However, I have not yet seen research from you (or others) that investigates the effect of a data- (and thought-driven) workflow that replaces models with better ones (better likelihoods and better priors) on overfit and poor out-of-sample predictions. After all, invalid p-values, confidence/credible intervals and biased point estimates are just the other side of the same coin (with overfit and poor predictions representing the other side of the coin). I am currently considering starting a research line that focuses on post-selection inference in Bayesian workflows and any guidance (literature ?) would be greatly appreciated.
The other question concerns the prior predictive distribution whose meaning I struggle with a lot. Henceforth, I provide a quick summary of my issues/thoughts.
A classical, parametric statistical model for measurements is a family of probability distributions P that depend on a finite number of unknown parameters (e.g., the Gaussian family for a single measurement which is determined by a location and a scale parameter). In the M-closed case, the probability distribution that corresponds to those parameter values is the true data-generating process (TDGP). A Bayesian model consists of the likelihood in a conjunction with a prior distribution. Because of this prior distribution, we can make predictions about the measurements via the prior predictive distribution, but when does it make sense to do so? The prior predictive is the marginal distribution of the measurements (i.e., the result of integrating out all model parameters in the joint distribution). Given its definition, the meaning of the prior predictive depends on the choice of prior distribution for the parameter vector. Here are three distinct interpretations of the prior distribution which result in different interpretations of the prior predictive distribution:
View 1: The prior distribution is chosen such that the prior predictive distribution corresponds to (our best guess of) the TDGP. View 2 (subjective Bayesianism): The prior describes our beliefs about the relative plausibilities of the probability distributions in the model family P. View 3 (your preferred view?): The prior works as a regularization device that helps mitigating the impact of sampling error on model fit. Similar to View 1, the prior serves a pragmatic goal under this view and the prior predictive can be very different from the TDGP.
Under View 1, the Bayesian model is akin to a fitted classical model. This view requires very strong priors as well as hierarchies (see toy-example below). Furthermore, it can be sensible to compare Bayesian models via Bayes factors under View 1. Under Views 2 and 3, Bayes-factor tests don’t make any sense as they solely depend on the prior predictive distribution of two models (evaluated at the observed data, respectively).
Personally, I have drawn the following conclusions: The role of the prior predictive (and the posterior predictive) distribution is unclear under Views 2 and 3. The prior predictive can only be close to the TDGP if there is a hierarchy (e.g., a random effects model where the random effects (the parameters) are first drawn from a certain probability distribution. Afterwards, data are generated using the drawn random effect. Rinse and repeat.) We can illustrate this with a toy example: 3 six-sided dice of equal size are in a bag and one of them is drawn to generate data. One die is fair, one has a 1 on two of its sides and the third one has a 1 on three of its sides. Situation 1: We draw a die from the bag, roll it, and report whether it showed a 1. Afterwards, we return it to the bag and repeat the process. Situation 2: We draw one die from the bag and repeatedly roll it (and always report if it showed a 1). For both situations, an obvious choice for a Bayesian model is a binomial likelihood with a parameter prior of P(theta=1/6)=P(theta=1/3)=P(theta=1/2)=1/3 since we know that each die has the same chance of being drawn. In Situation 1, the prior predictive is the TDGP. In situation 2, it is not. In situation 2, the prior predictive is very different from the TDGP regardless of the die drawn. Nevertheless, our Bayesian model is arguably the best possible model for this toy example (at least from View 2).
Given the above observations/conclusions, I find it difficult to agree with conclusions such as Bayes factors measure prior predictive performance and Bayes factors evaluate priors, cross validations evaluate posteriors. To me, the conclusions drawn by Bob Carpenter in those two blog posts only hold under View 1 (which is really only possible in special cases that involve hierarchies) but not under Views 2 and 3.
How do you interpret the prior predictive? What is its relation to the TDGP? When are Views 1, 2, and 3 reasonable? Are there other relevant views on the prior predictive that are not considered here?
Given that I am unsure about the meaning of the prior predictive, I also struggle with the meaning of the posterior predictive. In particular, it is unclear to me why one would want to make predictions via the posterior predictive under views 2 and 3. If there is little available data, it will be similar to the prior predictive and very different to the TDGP. If there is plenty of data, we could use the posterior predictive and we would be fine (due to the Bernstein-van Mises theorem), but we could also just work with posterior quantiles in that case. Isn’t it always better to predict using posterior quantiles (e.g., the median for our best guess and the .1 and .9 quantile to express uncertainty) instead of the posterior predictive under View 3?
My reply:
In answer to the first question about workflow and post-selection inference, I’d recommend that you look at model-building workflow (building models, checking models, expanding models) as part of a larger scientific workflow, which also includes substantive theory, measurement, and data collection. In statistics and machine learning—theory and application alike—we focus so much on the analysis of some particular dataset in the context of some existing theory. But real science and engineering almost always involves designing new experiments, incorporating new data into our analyses, and trying out new substantive models. From that perspective, looking at post-selection inference is fine—it represents a sort of minimal adjustment of an analysis, in the same way that the sampling standard error from a survey is a minimal statement of uncertainty, representing uncertainty under ideal conditions, not real conditions.
In my own applied work, the sorts of adjustments that would be needed to adjust for model selection would be pretty minor, so it’s not where I focus my theoretical or methodological research efforts.
For bad analyses, it’s another story. For example, if Daryl Bem of those notorious ESP papers were to have used Bayesian methods, he still could’ve selected the hell out of things and found big fat posterior probabilities. But then I think the appropriate way to address this problem would not be to take these (hypothetical) apparently strong results and adjust them for model selection. Rather, I would recommend modeling all the data together using some sort of hierarchical model, which would have the effect of partially pooling estimates toward zero.
To me, doing poorly-motivated model selection and then trying to clean it up statistically is kinda like making a big mess and then trying to clean it up, or blowing something up and then trying to put it back together. I’d rather try to do something reasonable in the first place. And then, yes, there are still selection issues—there’s not one single reasonable hierarchical model, or only one single reasonable regularized machine learning algorithm, or whatever—but the selection becomes a much smaller part of the problem, which in practice gets subsumed by multiple starting points, cross validation, new data and theories, external validation, etc.
In answer to the second question about prior and predictive distributions, let me start by correcting this statement of yours: “A Bayesian model consists of the likelihood in a conjunction with a prior distribution.” The more accurate way to put this is: A Bayesian model consists of a data model in a conjunction with a prior distribution. The data model is the family of probability distributions p(y|theta). The likelihood is p(y|theta), considered as a function of theta for the observed data, y. As discussed in chapters 6, 7, and 8 of BDA3, the data model and the likelihood are not the same thing. There can be many data models that correspond to the same likelihood function. For Bayesian inference conditional on the data and model, you don’t need the data model + prior, you only need the likelihood + prior. But for model checking—prior predictive checking, posterior predictive checking, and everything in between (really, all of this can be considered as different forms of posterior predictive checking, conditioning on different things)—the likelihood isn’t enough; you need the data model too. Again, we have some examples of this in BDA, the simplest of which is a switch from a binomial sampling model to a negative-binomial sampling model with the same likelihood but different predictive distributions.
You ask, “when does it make sense” to make prior or posterior predictions. My answer is that you can interpret such predictions directly, in a literal sense as predictions of hypothetical future or alternative data based on the model. Suppose you have a hierarchical model with modeled data y, local parameters alpha, hyperparameters phi, and unmodeled data x, and your posterior distribution is p(alpha,phi|x,y) proportional to p(phi|x)p(alpha|phi,x)p(y|alpha,phi,x). Just to fix ideas, think of alpha as corresponding to an “urn” from which the data y are drawn, and think of phi as a “room” that is full of urns, each of which corresponds to a different value of alpha. Finally, think of the prior distribution of phi as a “building” full of rooms. The building is your model. – So the generative model is: Go to the building, sample a room at random from that building, then sample an urn at random from the urns in that room, then sample data y from your urn.
Suppose you fit the model and obtain posterior simulations (alpha,phi)^s, s=1,…,S. – You can simulate new data y* from the prior predictive distribution, which would correspond to picking a new room, a new urn, and new data. For each simulation s, you can do this by drawing phi* from p(phi|x), then drawing alpha* from p(alpha|phi*,x), then drawing y* from p(y|alpha*,phi*,x). – Or you can simulate new data y* from the posterior predictive distribution for new data from new groups, which would correspond to staying in the same room but then drawing a new urn and new data from that urn. For each simulation s, you can do this by keeping phi^s, then drawing alpha* from p(alpha|phi^s,x), then drawing y* from p(y|alpha*,phi^s,x). – Or you can simulate new data y* from the posterior predictive distribution for new data from existing groups, which would correspond to staying in the same room and keeping the same urn and then drawing new data from that urn. For each simulation s, you can do this by keeping phi^s, keeping alpha^s, then drawing y* from p(y|alpha^s,phi^s,x). These three different distributions correspond to different scenarios. They can all be interpreted directly. This has nothing to do with “relative plausibility” or whatever; they’re just predictive distributions of what you might see in alternative rooms and urns, if the model were true.
It’s not necessary to describe this all using multilevel models—you can distinguish between prior and posterior predictive checks with a simple model with just y and theta—but I find the multilevel modeling framework to be helpful in that it allows me to better visualize the different sorts of replications being considered. You could also consider other distributions in a non-nested model, for example predictive data on new patients, new time periods, new hospitals, etc.
Regarding Bob’s posts about Bayes factors and prior predictive checks: I take his point to be not philosophical but mathematical. His point is that the Bayes factor is mathematically an integration over the prior predictive distribution. This is obvious—you can just look at the integral—but it seems that people get confused about the Bayes factor because they look at it in terms of what is supposed to do (give the posterior probability that a model is true, something that I think is typically the wrong question to ask, for reasons discussed in chapter 7 of BDA3 and also this article with Shalizi) rather than what it does. In that sense, Bob’s post fits into a long tradition of statistical research and exposition including Neyman, Tukey, and others who work to understand methods in terms of what they actually do. This does not address your questions about prior and posterior predictive distributions—for that I refer you to the above paragraph about rooms and urns, which is largely drawn from my 1996 paper with Meng and Stern—; you just have to read Bob’s posts literally, not as Bayesian positions but as agnostic “machine learning” descriptions of what prior and posterior predictive checks do.