STEIN’S PARADOX

Normal Deviate 2013-05-19

STEIN’S PARADOX

Something that is well known in the statistics world but perhaps less well known in the machine learning world is Stein’s paradox.

When I was growing up, people used to say: do you remember where you were when you heard that JFK died? (I was three, so I don’t remember. My first memory is watching the Beatles on Ed Sullivan.)

Similarly, statisticians used to say: do you remember where you were when you heard about Stein’s paradox? That’s how surprising it was. (I don’t remember since I wasn’t born yet.)

Here is the paradox. Let {X \sim N(\theta,1)}. Define the risk of an estimator {\hat\theta} to be

\displaystyle  R_{\hat\theta}(\theta) = \mathbb{E}_\theta (\hat\theta-\theta)^2 = \int (\hat\theta(x) - \theta)^2 p(x;\theta) dx.

An estimator {\hat\theta} is inadmissible if there is another estimator {\theta^*} with smaller risk. In other words, if

\displaystyle  R_{\theta^*}(\theta) \leq R_{\hat\theta}(\theta) \ \ {\rm for\ all\ }\theta

with strict inequality at at least one {\theta}.

Question: Is {\hat \theta \equiv X} admissible. Answer: Yes.

Now suppose that {X \sim N(\theta,I)} where now {X=(X_1,X_2)^T}, {\theta = (\theta_1,\theta_2)^T} and

\displaystyle  R_{\hat\theta}(\theta) = \mathbb{E}_\theta ||\hat\theta - \theta||^2.

Question: Is {\hat \theta \equiv X} admissible. Answer: Yes.

Now suppose that {X \sim N(\theta,I)} where now {X=(X_1,X_2,X_3)^T}, {\theta = (\theta_1,\theta_2,\theta_3)^T} and

\displaystyle  R_{\hat\theta}(\theta) = \mathbb{E}_\theta ||\hat\theta - \theta||^2.

Question: is {\hat \theta \equiv X} admissible. Answer: No!

If you don’t find this surprising then either you’ve heard this before or you’re not thinking hard enough. Keep in mind that the coordinates of the vector {X} are independent. And the {\theta_j's} could have nothing to do with each other. For example, {\theta_1 = } mass of the moon, {\theta_2 = } price of coffee and {\theta_3 = } temperature in Rome.

In general, {\hat\theta \equiv X} is inadmissible if the dimension {k} of {\theta} satisfies {k \geq 3}.

The proof that {X} is inadmissible is based on defining an explicit estimator {\theta^*} that has smaller risk than {X}. For example, the James-Stein estimator is

\displaystyle  \theta^* = \left( 1 - \frac{k-2}{||X||^2}\right) X.

It can be show that the risk of this estimator is strictly smaller than the risk of {X}, for all {\theta}. This implies that {X} is inadmissible. If you want to see the detailed calculations, have a look at Iain Johnstone’s at this site which he makes freely available on his website.

Note that the James-Stein estimator shrinks {X} towards the origin. (In fact, you can shrink towards any point; there is nothing special about the origin.) This can be viewed as an empirical Bayes estimator where {\theta} has a prior of the form {\theta \sim N(0,\tau^2)} and {\tau} is estimated from the data. The Bayes explanation gives some nice intuition. But it’s also a bit misleading. The Bayes explanation suggests we are shrinking the means together because we expect them a priori to be similar. But the paradox holds even when the means are not related in any way.

Some intuition can be gained by thinking about function estimation. Consider a smooth function {f(x)}. Suppose we have data

\displaystyle  Y_i = f(x_i) + \epsilon_i

where {x_i = i/n} and {\epsilon_i \sim N(0,1)}. Let us expand {f} in an orthonormal basis: {f(x) = \sum_j \theta_j \psi_j(x)}. To estimate {f} we need only estimate the coefficients {\theta_1,\theta_2,\ldots,}. Note that {\theta_j = \int f(x) \psi_j(x) dx}. This suggests the estimator

\displaystyle  \hat\theta_j = \frac{1}{n}\sum_{i=1}^n Y_i \psi_j(x_i).

But the resulting function estimator {\hat f(x) = \sum_j \hat\theta_j \psi_j(x)} is useless because it is too wiggly. The solution is to smooth the estimator; this corresponds to shrinking the raw estimates {\hat\theta_j} towards 0. This adds bias but reduces variance. In other words, the familiar process of smoothing, which we use all the time for function estimation, can be seen as “shrinking estimates towards 0” as with the James-Stein estimator.

If you are familiar with minimax theory, you might find the Stein paradox a bit confusing. The estimator {\hat\theta = X} is minimax, that is, it’s risk achieves the minimax bound

\displaystyle  \inf_{\hat\theta}\sup_\theta R_{\hat\theta}(\theta).

This suggests that {X} is a good estimator. But Stein’s paradox tells us that {\hat\theta = X} is inadmissible which suggests that it is a bad estimator.

Is there a contradiction here?

No. The risk {R_{\hat\theta}(\theta)} of {\hat\theta=X} is a constant. In fact, {R_{\hat\theta}(\theta)=k} for all {\theta} where {k} is the dimension of {\theta}. The risk {R_{\theta^*}(\theta)} of the James-Stein estimator is less than the risk of {X}, but, {R_{\theta^*}(\theta)\rightarrow k} as {||\theta||\rightarrow \infty}. So they have the same maximum risk.

On the one hand, this tells us that a minimax estimator can be inadmissible. On the other hand, in some sense it can’t be “too far” from admissible since they have the same maximum risk.

Stein first reported the paradox in 1956. I suspect that fewer and fewer people include the Stein paradox in their teaching. (I’m guilty.) This is a shame. Paradoxes really grab students’ attention. And, in this case, the paradox is really fundamental to many things including shrinkage estimators, hierarchical Bayes, and function estimation.