The Steep Price of Sparsity

Normal Deviate 2013-07-27

We all love sparse estimators these days. I am referring to things like the lasso and related variable selection methods.

But there is a dark side to sparsity. It’s what Hannes Leeb and Benedikt Potscher call the “return of the Hodges’ estimator.” Basically, any estimator that is capable of producing sparse estimators has infinitely bad minimax risk.

1. Hodges Estimator

Let’s start by recalling Hodges famous example.

Suppose that ${X_1,\ldots,X_n \sim N(\theta,1)}$ . Define ${\hat\theta}$ as follows:

${\hat\theta =\overline{X}}$ if ${|\overline{X}| \geq n^{-1/4}}$

${\hat\theta=0}$ if ${|\overline{X}| < n^{-1/4}}$ .

If ${\theta\neq 0}$ , then ${\hat\theta}$ will equal ${\overline{X}}$ for all large ${n}$ . But if ${\theta=0}$ , then eventually ${\hat\theta =0}$ . The estimator discovers that ${\theta}$ is 0.

This seems like a good thing. This is what we want whenever we do model selection. We want to discover that some coefficients are 0. That’s the nature of using sparse methods.

But there is a price to pay for sparsity. The Hodges estimator has the unfortunate property that the maximum risk is terrible. Indeed,

$\displaystyle \sup_\theta \mathbb{E}_\theta [n (\hat\theta-\theta)^2] \rightarrow \infty.$

Contrast this will the sample mean:

$\displaystyle \sup_\theta \mathbb{E}_\theta [n (\overline{X}-\theta)^2] =1.$

The reason for the poor behavior of the Hodges estimator — or any sparse estimator — is that there is a zone near 0 where the estimator will be very unstable. The estimator has to decide when to switch from ${\overline{X}}$ to 0 creating a zone of instability.

I plotted the risk function of the Hodges estimator here. The risk of the mle is flat. The large peaks in the risk function of the Hodges estimator are very clear (and very disturbing).

2. The Steep Price of Sparsity

Leeb and Potscher (2008) proved that this poor behavior holds for all sparse estimators and all loss functions. Suppose that

$\displaystyle Y_i = x_i^T \beta + \epsilon_i,\ \ \ i=1,\ldots, n.$

here, ${\beta\in \mathbb{R}^k}$ . Let ${s(\beta)}$ be the support of ${\beta}$ : ${s_j(\beta)=1}$ of ${\beta_j \neq 0}$ and ${s_j(\beta)=0}$ of ${\beta_j = 0}$ . Say that ${\hat\beta}$ is sparsistent (a term invented by Pradeep Ravikumar) if, for each ${\beta}$ ,

$\displaystyle P^n_\beta(s(\hat\beta)=s(\beta)) \rightarrow 1$

as ${n\rightarrow \infty}$ .

Leeb and Potscher (2008) showed that if ${\hat\beta}$ is sparsistent, then

$\displaystyle \sup_\beta E[ n\, ||\hat\beta-\beta||^2]\rightarrow \infty$

as ${n\rightarrow \infty}$ . More generally, for any non-negative loss function ${\ell(\hat\beta-\beta)}$ , we have

$\displaystyle \sup_\beta E[ \ell(n^{1/2}(\hat\beta -\beta))]\rightarrow \sup_\beta \ell(\beta).$

3. How Should We Interpret This Result?

One might object that the maximum risk is too conservative and includes extreme cases. But in this case, that is not true. The high values of the risk occur in a small neighborhood of 0. (Recall the picture of the risk of the Hodges estimator.) This is far from pathological.

Another objection is that the proof assumes ${n}$ grows while ${k}$ stays fixed. But in many applications that we are interested in, ${k}$ grows and is even possibly larger than ${n}$ . This is a valid objection. On the other hand, if there is unfavorable behavior in the ideal case of fixed ${k}$ , we should not be sanguine about the high-dimensional case.

I am not suggesting we should give up variable selection. I use variable selection all the time. But we should keep in mind that there is a steep price to pay for sparsistency.

References

Leeb, Hannes and Potscher, Benedikt M. (2008). Sparse estimators and the oracle property, or the return of Hodges’ estimator. Journal of Econometrics, 142, 201-211.

Leeb, Hannes and Potscher, Benedikt M. (2005). Model selection and inference: Facts and fiction. Econometric Theory, 21, 21-59.