TO CONDITION, OR NOT TO CONDITION, THAT IS THE QUESTION

Normal Deviate 2013-03-15

TO CONDITION, OR NOT TO CONDITION, THAT IS THE QUESTION

Between the completely conditional world of Bayesian inference and the completely unconditional world of frequentist inference lies the hazy world of conditional inference.

The extremes are easy. In Bayesian-land you condition on all of the data. In Frequentist-land, you condition on nothing. If your feet are firmly planted in either of these idyllic places, read no further! Because, conditional inference is:

The undiscovered Country, from whose bourn No Traveller returns, Puzzles the will, And makes us rather bear those ills we have, Than fly to others that we know not of.

1. The Extremes

As I said above, the extremes are easy. Let’s start with a concrete example. Let ${Y_1,\ldots, Y_n}$ be a sample from ${P\in {\cal P}}$ . Suppose we want to estimate ${\theta = T(P)}$ ; for example, ${T(P)}$ could be the mean of ${P}$ .

Bayesian Approach: Put a prior ${\pi}$ on ${P}$ . After observing the data ${Y_1,\ldots, Y_n}$ compute the posterior for ${P}$ . This induces a posterior for ${\theta}$ given ${Y_1,\ldots, Y_n}$ . We can then make statements like

$\displaystyle \pi( \theta\in A|Y_1,\ldots, Y_n) = 1-\alpha.$

The statements are conditional on ${Y_1,\ldots, Y_n}$ . There is no question about what to condition on; we condition on all the data.

Frequentist Approach: Construct a set ${C_n = C(Y_1,\ldots, Y_n)}$ . We require that

$\displaystyle \inf_{P\in {\cal P}} P^n \Bigl( T(P)\in C_n \Bigr) \geq 1-\alpha$

where ${P^n = P\times \cdots \times P}$ is the distribution corresponding to taking ${n}$ samples from ${P}$ . We the call ${C_n}$ a ${1-\alpha}$ confidence set. No conditioning takes place. (Of course, we might want more than just the guarantee in the above equation, like some sort of optimality; but let’s not worry about that here.)

(I notice that Andrew often says that frequentists “condition on ${\theta}$ ”. I think he means, they do calculations for each fixed ${P}$ . At the risk of being pedantic, this is not conditioning. To condition on ${P}$ requires that ${P}$ be a random variable which it is in the Bayesian framework but it is not a random variable in the frequentist framework. But I am probably just nit picking here.)

2. So Why Condition?

Suppose you are taking the frequentist route. Why would you be enticed to condition? Consider the following example from Berger and Wolpert (1988).

I write down a real number ${\theta}$ . I then generate two random variables ${Y_1, Y_2}$ as follows:

$\displaystyle Y_1 = \theta + \epsilon_1,\ \ \ Y_2 = \theta + \epsilon_2$

where ${\epsilon_1}$ and ${\epsilon_2}$ and iid and

$\displaystyle P(\epsilon_i = 1) = P(\epsilon_i = -1) = \frac{1}{2}.$

Let ${P_\theta}$ denote the distribution of ${Y_i}$ . The set of distributions is ${{\cal P} = \{ P_\theta:\ \theta\in\mathbb{R}\}}$ .

I show Fred the frequentist ${Y_1}$ and ${Y_2}$ and he has to infer ${\theta}$ . Fred comes up with the following confidence set:

$\displaystyle C(Y_1,Y_1) = \begin{cases} \left\{ \frac{Y_1+Y_2}{2} \right\} & \mbox{if}\ Y_1 \neq Y_2\\ \left\{ Y_1-1 \right\} & \mbox{if}\ Y_1 = Y_2. \end{cases}$

Now, it is easy to check that, no matter what value ${\theta}$ takes, we have that

$\displaystyle P_\theta\Bigl(\theta\in C(Y_1,Y_2)\Bigr) = \frac{3}{4}\ \ \ \mbox{for every}\ \theta\in \mathbb{R}.$

Fred is happy. ${C(Y_1,Y_2)}$ is a 75 percent confidence interval.

To be clear: if I play this game with Fred every day, and I use a different value of ${\theta}$ every day, we will find that Fred traps the true value 75 percent of the time.

Now suppose the data are ${(Y_1,Y_2) = (17,19)}$ . Fred reports that his 75 percent confidence interval is ${\{18\}}$ . Fred is correct that his procedure has 75 percent coverage. But in this case, many people are troubled by reporting that ${\{18\}}$ is a 75 percent confidence interval. Because with these data, we know that ${\theta}$ must be 18. Indeed, if we did a Bayesian analysis with a prior that puts positive density on each ${\theta}$ , he would find that ${\pi(\theta=18|Y_1=17,Y_2=19) = 1}$ .

So, we are 100 percent certain that ${\theta = 18}$ and yet we are reporting ${\{18\}}$ as a 75 percent confidence interval.

There is nothing wrong with the confidence interval. It is a procedure, and the procedure comes with a frequency guarantee: it will trap the truth 75 percent of the time. It does not agree with our degrees of belief but no one said it should.

And yet Fred thinks he can retain his frequentist credentials and still do something which intuitively feels better. This is where conditioning comes in.

Let

$\displaystyle A = \begin{cases} 1 & \mbox{if}\ Y_1 \neq Y_2\\ 0 & \mbox{if}\ Y_1 = Y_2. \end{cases}$

The statistic ${A}$ is an ancillary: it has a distribution that does not depend on ${\theta}$ . In particular, ${P_\theta(A=1) =P_\theta(A=0) =1/2}$ for every ${\theta}$ . The idea now is to report confidence, conditional on ${A}$ . Our new procedure is:

If ${A=1}$ report ${C=\{ (Y_1 + Y_2)/2\}}$ with confidence level 1. If ${A=0}$ report ${C=\{ (Y_1-1\}}$ with confidence level 1/2.

This is indeed a valid conditional confidence interval. Again, imagine we play the game over a long sequence of trials. On the subsequence for which ${A=1}$ , our interval contains the true value 100 percent of the time. On the subsequence for which ${A=0}$ , our interval contains the true value 50 percent of the time.

We still have valid coverage and a more intuitive confidence interval. Our result is identical the Bayesian answer if the Bayesian uses a flat prior. It is nearly equal to the Bayesian answer if the Bayesian uses a proper but very flat prior.

(This is an example where the Bayesian has the upper hand. I’ve had other examples on this blog where the frequentist does better than the Bayesian. To readers who attach themselves to either camp: remember, there is plenty of ammunition in terms of counterexamples on BOTH sides.)

Another famous example is from Cox (1958). Here is a modified version of that example. I flip a coin. If the coin is HEADS I give Fred ${Y \sim N(\theta,\sigma_1^2)}$ . If the coin is TAILS I give Fred ${Y \sim N(\theta,\sigma^2)}$ where ${\sigma_1^2 > \sigma_2^2}$ . What should Fred’s confidence interval for ${\theta}$ be?

We can condition on the coin, and report the usual confidence interval corresponding to the appropriate Normal distribution. But if we look unconditionally, over replications of the whole experiment, and minimize the expected length of the interval, you get an interval that has coverage less than ${1-\alpha}$ for HEADS and greater than ${1-\alpha}$ for TAILS. So optimizing unconditionally pulls us away from what seems to be the correct conditional answer.

3. The Problem With Conditioning

There are lots of simple examples like the ones above where, psychologically, it just feels right to condition on something. But simple intuition is misleading. We would still be using Newtonian physics if we went by our gut feelings.

In complex situations, it is far from obvious if we should condition or what we should condition on. Let me review a simplified version of Larry Brown’s (1990) example that I discussed here. You observe ${(X_1,Y_1), \ldots, (X_n,Y_n)}$ where

$\displaystyle Y_i = \beta^T X_i + \epsilon_i,$

${\epsilon_i \sim N(0,1)}$ , ${n=100}$ and each ${X_i = (X_{i1},\ldots, X_{id})}$ is a vector of length ${d=100,000}$ . Suppose further that the ${d}$ covariates are independent. We want to estimate ${\beta_1}$ .

The “best” estimator (the maximum likelihood estimator) is obtained by conditioning on all the data. This means we should estimate the vector ${\beta}$ by least squares. But, the least squares estimator is useless when ${d> n}$ .

From the Bayesian point of view we compute we compute the posterior

$\displaystyle \pi\Bigl(\beta_1 \Bigm| (X_1,Y_1),\ldots, (X_n,Y_n)\Bigr)$

which, for such a large ${d}$ , will be useless (completely dominated by the prior).

These estimators have terrible behavior compared to the following “anti-conditioning” estimator. Throw away all the covariates except the first one. Now do linear regression using only ${Y}$ and the first covariate. The resulting estimator ${\hat\beta_1}$ is then tightly concentrated around ${\beta_1}$ with high probability. In this example, throwing away data is much better than conditioning on the data. There are some papers on “forgetful Bayesian inference” where one conditions on only part of the data. This is fine but then we are back the the original question: what do we condition on?

There are many other example such as this one.

4. The Answer

It would be nice if there was a clear answer such as “you should always condition” or “you should never condition.” But there isn’t. Do a Google Scholar search on conditional inference and you will find an enormous literature. What started as a simple, compelling idea evolved into a complex research area. Much of these conditional methods are very sophisticated and rely on second order asymptotics. But it is rare to see anyone use conditional inference in complex problems, with the exception of Bayesian inference which some will argue goes for a definite, psychologically satisfying answer at the expense of thinking hard about the properties of the resulting procedures.

Unconditional inference is simple and avoids disasters. The cost is that we can sometimes get psychologically unsatisfying answers. Conditional inference yields more psychologically satisfying answers but can lead to procedures with disastrous behavior.

There is no substitute for thinking. Be skeptical of easy answers.

Thus Conscience does make Cowards of us all, And thus the Native hue of Resolution Is sicklied o’er, with the pale cast of Thought,

References

Berger, J.O. and Wolpert, R.L. (1988). The likelihood principle, IMS.

Brown, L. D. (1990). An Ancillarity Paradox Which Appears in Multiple Linear Regression. Ann. Statist. 18, 471-493. link to paper.

Cox, D.R. (1958). Some problems connected with statistical inference. The Annals of Mathematical Statistics, 29, 357-372.