Consistency, Sparsistency and Presistency

Normal Deviate 2013-09-15

There are many ways to discuss the quality of estimators in statistics. Today I want to review three common notions: presistency, consistency and sparsistency. I will discuss them in the context of linear regression. (Yes, that’s presistency, not persistency.)

Suppose the data are {(X_1,Y_1),\ldots, (X_n,Y_n)} where

\displaystyle  Y_i = \beta^T X_i + \epsilon_i,

{Y_i\in\mathbb{R}}, {X_i\in\mathbb{R}^d} and {\beta\in\mathbb{R}^d}. Let {\hat\beta=(\hat\beta_1,\ldots,\hat\beta_d)} be an estimator of {\beta=(\beta_1,\ldots,\beta_d)}.

Probably the most familiar notion is consistency. We say that {\hat\beta} is consistent if

\displaystyle  ||\hat\beta - \beta|| \stackrel{P}{\rightarrow} 0

as {n \rightarrow \infty}.

In recent years, people have become interested in sparsistency (a term invented by Pradeep Ravikumar). Define the support of {\beta} to be the location of the nonzero elements:

\displaystyle  {\rm supp}(\beta) = \{j:\ \beta_j \neq 0\}.

Then {\hat\beta} is sparsistent if

\displaystyle  \mathbb{P}({\rm supp}(\hat\beta) = {\rm supp}(\beta) ) \rightarrow 1

as {n\rightarrow\infty}.

The last one is what I like to call presistence. I just invented this word. Some people call it risk consistency or predictive consistency. Greenshtein and Ritov (2004) call it persistency but this creates confusion for those of us who work with persistent homology. Of course, presistence come from shortening “predictive consistency.”

Let {(X,Y)} be a new pair. The predictive risk of {\beta} is

\displaystyle  R(\beta) = \mathbb{E}(Y-X^T \beta)^2.

Let {{\cal B}_n} be some set of {\beta}‘s and let {\beta_n^*} be the best {\beta} in {{\cal B}_n}. That is, {\beta_n^*} minimizes {R(\beta)} subject to {\beta \in {\cal B}_n}. Then {\hat\beta} is presistent if

\displaystyle  R(\hat\beta) - R(\beta_n^*) \stackrel{P}{\rightarrow} 0.

This means that {\hat\beta} predicts nearly as well as the best choice of {\beta}. As an example, consider the set of sparse vectors

\displaystyle  {\cal B}_n = \Bigl\{ \beta:\ \sum_{j=1}^d |\beta_j| \leq L\Bigr\}.

(The dimension {d} is allowed to depend on {n} which is why we have a subscript on {{\cal B}_n}.) In this case, {\beta_n^*} can be interpreted as the best sparse linear predictor. The corresponding sample estimator {\hat\beta} which minimizes the sums of squares subject to being in {{\cal B}_n}, is the lasso estimator. Greenshtein and Ritov (2004) proved that the lasso is presistent under essentially no conditions.

This is the main message of this post: To establish consistency or sparsistency, we have to make lots of assumptions. In particular, we need to assume that the linear model is correct. But we can prove presistence with virtually no assumptions. In particular, we do not have to assume that the linear model is correct.

Presistence seems to get less attention than consistency of sparsistency but I think it is the most important of the three.

Bottom line: presistence deserves more attention. And, if you have never read Greenshtein and Ritov (2004), I highly recommend that you read it.

Reference:

Greenshtein, Eitan and Ritov, Ya’Acov. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10, 971-988.