STEIN’S PARADOX
Normal Deviate 2013-05-19
STEIN’S PARADOX
Something that is well known in the statistics world but perhaps less well known in the machine learning world is Stein’s paradox.
When I was growing up, people used to say: do you remember where you were when you heard that JFK died? (I was three, so I don’t remember. My first memory is watching the Beatles on Ed Sullivan.)
Similarly, statisticians used to say: do you remember where you were when you heard about Stein’s paradox? That’s how surprising it was. (I don’t remember since I wasn’t born yet.)
Here is the paradox. Let . Define the risk of an estimator to be
An estimator is inadmissible if there is another estimator with smaller risk. In other words, if
with strict inequality at at least one .
Question: Is admissible. Answer: Yes.
Now suppose that where now , and
Question: Is admissible. Answer: Yes.
Now suppose that where now , and
Question: is admissible. Answer: No!
If you don’t find this surprising then either you’ve heard this before or you’re not thinking hard enough. Keep in mind that the coordinates of the vector are independent. And the could have nothing to do with each other. For example, mass of the moon, price of coffee and temperature in Rome.
In general, is inadmissible if the dimension of satisfies .
The proof that is inadmissible is based on defining an explicit estimator that has smaller risk than . For example, the James-Stein estimator is
It can be show that the risk of this estimator is strictly smaller than the risk of , for all . This implies that is inadmissible. If you want to see the detailed calculations, have a look at Iain Johnstone’s at this site which he makes freely available on his website.
Note that the James-Stein estimator shrinks towards the origin. (In fact, you can shrink towards any point; there is nothing special about the origin.) This can be viewed as an empirical Bayes estimator where has a prior of the form and is estimated from the data. The Bayes explanation gives some nice intuition. But it’s also a bit misleading. The Bayes explanation suggests we are shrinking the means together because we expect them a priori to be similar. But the paradox holds even when the means are not related in any way.
Some intuition can be gained by thinking about function estimation. Consider a smooth function . Suppose we have data
where and . Let us expand in an orthonormal basis: . To estimate we need only estimate the coefficients . Note that . This suggests the estimator
But the resulting function estimator is useless because it is too wiggly. The solution is to smooth the estimator; this corresponds to shrinking the raw estimates towards 0. This adds bias but reduces variance. In other words, the familiar process of smoothing, which we use all the time for function estimation, can be seen as “shrinking estimates towards 0” as with the James-Stein estimator.
If you are familiar with minimax theory, you might find the Stein paradox a bit confusing. The estimator is minimax, that is, it’s risk achieves the minimax bound
This suggests that is a good estimator. But Stein’s paradox tells us that is inadmissible which suggests that it is a bad estimator.
Is there a contradiction here?
No. The risk of is a constant. In fact, for all where is the dimension of . The risk of the James-Stein estimator is less than the risk of , but, as . So they have the same maximum risk.
On the one hand, this tells us that a minimax estimator can be inadmissible. On the other hand, in some sense it can’t be “too far” from admissible since they have the same maximum risk.
Stein first reported the paradox in 1956. I suspect that fewer and fewer people include the Stein paradox in their teaching. (I’m guilty.) This is a shame. Paradoxes really grab students’ attention. And, in this case, the paradox is really fundamental to many things including shrinkage estimators, hierarchical Bayes, and function estimation.