LOST CAUSES IN STATISTICS II: Noninformative Priors

Normal Deviate 2013-07-16

I thought I would post at a higher frequency in the summer. But I have been working hard to finish some papers which has kept me quite busy. So, apologies for the paucity of posts.

Today I’ll discuss another lost cause: noninformative priors.

I like to say that noninformative priors are the perpetual motion machines of statistics. Everyone wants one but they don’t exist.

By definition, a prior represents information. So it should come as no surprise that a prior cannot represent lack of information.

The first “noninformative prior” was of course the flat prior. The major flaw with this prior is lack of invariance: if it is flat in one parameterization it will not be flat in most other parameterizations. Flat prior have lots of other problems too. See my earlier post here.

The most famous noninformative prior (I’ll stop putting quotes around the phrase from now on) is Jeffreys prior which is proportional to the square root of the determinant of the Fisher information matrix. While this prior is invariant, it can still have undesirable properties. In particular, while it may seem noninformative for a parameter ${\theta}$ it can end up being highly informative for functions of ${\theta}$ . For example, suppose that ${Y}$ is multivariate Normal with mean vector ${\theta}$ and identity covariance. The Jeffreys prior is the flat prior ${\pi(\theta) \propto 1}$ . Now suppose that we want to infer ${\psi = \sum_j \theta_j^2}$ . The resulting posterior for ${\psi}$ is a disaster. The coverage of the Bayesian ${1-\alpha}$ posterior interval can be close to 0.

This is a general problem with noninformative priors. If ${\pi(\theta)}$ is somehow noninformative for ${\theta}$ , it may still be highly informative for sub-parameters, that is for functions ${\psi = g(\theta)}$ where ${\theta\in \mathbb{R}^d}$ and ${\psi: \mathbb{R}^d \rightarrow \mathbb{R}}$ .

Jim Berger and Jose Bernardo wrote a series of interesting papers about priors that were targeted to be noninformative for particular functions of ${\theta}$ . These are often called reference priors. But what if you are interested in many functions of ${\theta}$ . Should you use a different prior for each function of interest?

A more fundamental question is: what does it mean for a prior to be noninformative? Of course, people have argued about this for many, many years. One definition, which has the virtue of being somewhat precise, is that a prior is noninformative if the ${1-\alpha}$ posterior regions have frequentist coverage equal (approximately) to ${1-\alpha}$ . These are sometimes called “matching priors.”

In general, it is hard to construct matching priors especially in high-dimensional complex models. But matching priors raise a fundamental question: if your goal is to match frequentist coverage, why bother with Bayes at all? Just use a frequentist confidence interval.

These days I think that most people agree that the virtue of Bayesian methods is that it gives you a way to include prior information in a systematic way. There is no reason to formulate a “noninformative prior.”

On the other hand, in practice, we often deal with very complex, high-dimensional models. Can we really formulate a meaningful informative prior in such problems? And if we do, will anyone care about our inferences?

In 1996, I wrote a review paper with Rob Kass on noninformative priors (Kass and Wasserman 1996). We emphasized that a better term might be “default prior” since that seems more honest and promises less. One of our conclusions was:

“We conclude that the problems raised by the research on priors chosen by formal rules are serious and may not be dismissed lightly: When sample sizes are small (relative the number of parameters being estimated), it is dangerous to put faith in any default solution; but when asymptotics take over, Jeffreys’s rules and their variants remain reasonable choices.”

Looking at this almost twenty years later, the one thing that has changed is the “the number of parameters being estimated” which these days is often very, very large.

My conclusion: noninformative priors are a lost cause.

Reference

Kass, Robert E and Wasserman, Larry. (1996). The selection of prior distributions by formal rules. Journal of the American Statistical Association, 91, 1343-1370.