Bootstrapping and Subsampling: Part II

Normal Deviate 2013-03-18

BOOTSTRAPPING AND SUBSAMPLING: PART II

In part I we discussed the bootstrap and we saw that sometimes it does not work. By “doesn’t work” I mean that the coverage of the nominal ${1-\alpha}$ level confidence intervals does not approach ${1-\alpha}$ as the sample size increases. In this post I’ll discuss subsampling which works under broader conditions than the bootstrap.

1. Subsampling

Suppose that ${X_1,\ldots, X_n \sim P}$ . Let ${\hat\theta_n = g(X_1,\ldots,X_n)}$ be some estimator of ${\theta = T(P)}$ .

We need to make one main assumption:

Assume that ${n^\beta (\hat\theta_n - \theta)}$ converges in distribution to some continuous, non-degenerate distribution ${J}$ .

We do not need to know ${J}$ ; we only need to know that such a ${J}$ exists. We don’t even need to know ${\beta}$ but for simplicity I’ll assume that ${\beta=1/2}$ . So our assumption is that ${\sqrt{n} (\hat\theta_n - \theta)}$ converges to some ${J}$ . Here is the subsampling algorithm.

Choose a number ${b = b_n}$ such that
$\displaystyle b_n \rightarrow \infty,\ \ \ \ \frac{b_n}{n}\rightarrow 0$

as ${n\rightarrow\infty}$ .
Draw ${b}$ observations (without replacement) from the data ${\{X_1,\ldots, X_n\}}$ This is one subsample. Repeat this ${N}$ times. So we have ${N}$ subsamples, each of size ${b}$ . Denote these subsamples by ${S_1,\ldots, S_N}$ .
From each subsample ${S_j}$ we compute the estimator. This yields ${N}$ values
$\displaystyle \hat\theta_b^1,\ldots, \hat\theta_b^N.$

Note that each estimator is based on one subsample of size ${b}$ .
Now define
$\displaystyle L_n(t) = \frac{1}{N}\sum_{j=1}^N I\Bigl( \sqrt{b} (\hat\theta_b^j - \hat\theta_n) \leq t\Bigr)$

where ${I}$ denotes the indicator function.
Let ${\hat t_{\alpha/2}}$ and ${\hat t_{1-\alpha/2}}$ be the ${\alpha/2}$ and ${1-\alpha/2}$ quantiles of ${L_n}$ :
$\displaystyle \hat t_{\alpha/2} = L_n^{-1}(\alpha/2),\ \ \ \hat t_{1-\alpha/2} = L_n^{-1}(1-\alpha/2).$
Define the confidence interval
$\displaystyle C_n = \left[\hat\theta_n - \frac{\hat t_{1-\alpha/2}}{\sqrt{n}},\ \hat\theta_n - \frac{\hat t_{\alpha/2}}{\sqrt{n}}\right].$

Theorem:

$\displaystyle \mathop{\mathbb P}(\theta\in C_n)\rightarrow 1-\alpha$

as ${n\rightarrow\infty}$ .

Note: There are ${N = \binom{n}{b}}$ possible subsamples of size ${b}$ . In practice it is not necessary to use all of these subsamples. Usually, one selects, say, ${N=1,000}$ subsamples at random.

2. Why It Works

Here is an outline of the proof of the Theorem. (See Chapter 2 of Politis, Romano and Wolf 1999 for details.) Define

$\displaystyle J_n(x) = \mathbb{P}( \sqrt{n}(\hat\theta_n - \theta) \leq x).$

By assumption ${J_n(x) = J(x) + o(1)}$ . Also, ${J_b(x) = J(x) + o(1)}$ since ${b\rightarrow\infty}$ as ${n\rightarrow\infty}$ . The key fact is that ${L_n(x)-J_b(x)\stackrel{P}{\rightarrow}0}$ as ${n\rightarrow \infty}$ . To see this, note that

$\displaystyle \sqrt{b} (\hat\theta_b - \hat\theta_n) = \sqrt{b} (\hat\theta_b - \theta) + \sqrt{\frac{b}{n}}\ \sqrt{n}(\theta-\hat\theta_n)= \sqrt{b} (\hat\theta_b - \theta) + R_n$

where ${R_n= o_P(1)}$ . This follows since ${b/n \rightarrow 0}$ and since ${\sqrt{n}(\theta-\hat\theta_n)}$ is converging in distribution. So,

$\displaystyle L_n(x) \approx U(x)$

where

$\displaystyle U(x) = \frac{1}{N}\sum_{j=1}^N I( \sqrt{b} (\hat\theta_b^j - \theta) \leq x).$

By Hoeffding’s inequality for U-statistics, we have that, for every ${\epsilon>0}$ ,

$\displaystyle \mathbb{P} ( |U(x) - J_b(x)| > \epsilon) \leq 2 \exp\left( - \frac{2 n \epsilon^2}{b}\right)$

which goes to 0 since ${n/b\rightarrow \infty}$ as ${n\rightarrow\infty}$ . So

$\displaystyle L_n(x) \approx U(x) \approx J_b(x) \approx J_n(x) \approx J(x).$

The coverage of ${C_n}$ is

$\displaystyle P\left(\hat\theta_n - \frac{\hat t_{1-\alpha/2}}{\sqrt{n}} \leq \theta \leq \hat\theta_n - \frac{\hat t_{\alpha/2}}{\sqrt{n}}\right) = P(\hat{t}_{\alpha/2} \leq \sqrt{n}(\hat\theta_n-\theta) \leq \hat{t}_{1-\alpha/2})= J_n(\hat t_{1-\alpha/2}) - J_n(\hat t_{\alpha/2}).$

But

$\displaystyle J_n(\hat t_{1-\alpha/2}) - J_n(\hat t_{\alpha/2}) \approx L_n(\hat t_{1-\alpha/2}) - L_n(\hat t_{\alpha/2}) = \left(1-\frac{\alpha}{2}\right) - \frac{\alpha}{2} = 1-\alpha.$

The amazing thing about subsampling is that there are very few regularity conditions. It works under almost no conditions. This is in contrast to the bootstrap which does require fairly strong conditions to yield valid confidence intervals.

3. What’s the Catch?

There are a few catches. First, you need to choose ${b}$ . The only requirement is that ${b/n\rightarrow 0}$ and ${b\rightarrow\infty}$ as ${n\rightarrow\infty}$ . It’s great that it works for such a large range of values but we need a way to pick ${b}$ .

Fortunately, there are methods for choosing ${b}$ . I won’t go into detail, rather, I’ll point you to Chapter 9 of Politis, Romano, and (1999) and also Bickel and Sakov (2008).

A more serious problem is that the confidence guarantee is only an asymptotic guarantee. However, this is true of many statistical methods. I prefer to use methods with finite sample guarantees whenever possible but sometimes there is no choice but to resort to large sample approximations. Come to think of it, a good topic for a future blog post would be: for which problems are there accurate methods with finite sample guarantees?

Another concern about subsampling and bootstrapping is that they are computationally intensive. For a recent and very interesting approach to dealing with the computational burden, see this recent paper by Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Mike Jordan.

By the way, see this paper by Joe Romano and Azeem Shaikh for some recent theoretical advances.

4. Conclusion

The bootstrap and subsampling are powerful techniques for constructing confidence intervals. Subsampling works under weaker conditions and relies on simpler theory. However, subsampling requires choosing a tuning parameter (the size ${b}$ of the subsamples) which probably explains why it is less popular. I think that if we had a fast method for automatically choosing ${b}$ , then subsampling would replace bootstrapping as the method of choice.

References

Bickel, P.J. and Sakov, A. (2008). On the choice of m in the m out of n bootstrap and confidence bounds for extrema. Statistica Sinica, 18, 967-985.

Politis, D.N. and Romano, J.P. and Wolf, M. (1999). Subsampling, Springer Verlag.