Bootstrapping and Subsampling: Part I
Normal Deviate 2013-03-18
BOOTSTRAPPING AND SUBSAMPLING: PART I
Bootstrapping and subsampling are in the “amazing” category in statistics. They seem much more popular in statistics than machine learning for some reason.
1. The Bootstrap
The bootstrap (a.k.a. the shotgun) was invented by Brad Efron. Here is how it works. We have data and we want a confidence interval for
. For example,
could be the median of
or the mean of
or something more complicated like, the largest eigenvalue of the covariance matrix of
.
The bootstrap confidence interval is
where is an estimator of
and
and
are sample bootstrap quantiles that I will describe below. Before I explain this in more detail, notice two things. First, there is a minus sign in both the lower and upper endpoint. Second, the
and
quantiles are in the upper and lower endpoints, the reverse of what you might expect. The reason for the strange looking interval will be clear when we derive the interval.
Now for some details. Think of the parameter of interest as a function of the unknown distribution, which is why we write it as
. Let
denote the empirical distribution:
In other words, is the distribution that puts mass
at each
.
The estimator is just the function applied to
, that is,
. For example, if
is the median of
then
is the median of
which is just the sample median.
Now let
We use because typically it converges in distribution to some well-defined distribution (such as a Normal). Now let
denote the (unknown) distribution of
:
Suppose, for a moment, that we did know . We could then find the
quantile
and the
quantile
, namely,
It follows that
Continuing with the fantasy that we know , define
Now I will show you that is an exact
confidence interval. This follows since
We engineered so that the last line would be exactly
. The strange form of
is explained by the fact that we really have a probability statement for
which we then manipulate into the form of an interval for
. (You can check that if
were standard Gaussian, then using the symmetry of the Gaussian, the interval could be re-written in the more familiar looking form
where
is the upper-tail quantile of a Normal.)
The problem is that we don’t know and hence we don’t know
or
. The bootstrap is a method for approximating
. Let
be a large number (for example,
.) Now do this:
- Draw
observations
from
and compute
from these new data.
- Repeat step 1
times yielding values
.
- Approximate
with
where
denotes the indicator function.
- Find the quantiles
and
of
and construct
as defined earlier.
The interval is the same as
except we use the estimated quantiles for
. What we are doing here is estimating
by using
as an estimate of
. (That’s why we draw
from
.) If
is close to
then
and
and then
.
There are two sources of error. First we approximate
with
Essentially, we are replacing with
. Second, we are approximating
with
This second source of error is negligible because we can make as large as we want.
Remark: A moment’s reflection should convince you that drawing a sample of size from
is the same as drawing
points with replacement from the original data. This is how the bootstrap is often described but I think it is clearer to describe it as drawing
observations from
.
2. Why Does It Work?
If is close to
then the bootstrap confidence interval will have coverage close to
. Formally, one has to show that
in which case
as .
It is non-trivial to show that but it has been shown in some generality. See Chapter 23 of van der Vaart (1998) for example.
3. Why Does It Fail?
The bootstrap does not always work. It can fail for a variety of reasons such as when the dimension is high or when is poorly behaved.
An example of a bootstrap failure is in the problem of estimating phylogenetic trees. The problem here is that is an extremely complex object and the regularity conditions needed to make the bootstrap work are unlikely to hold.
In fact, this is a general problem with the bootstrap: it is most useful in complex situations, but these are often the situations where the theory breaks down.
4. What Do We Do?
So what do we do when the bootstrap fails? One answer is: subsampling. This is a variant of the bootstrap that works under much weaker conditions than the bootstrap. Interestingly, the theory behind subsampling is much simpler than the theory behind the bootstrap. The former involves little more than a simple concentration inequality while the latter uses high-powered techniques from empirical process theory.
So what is subsampling?
Stay tuned. I will describe it in my next post. In the meantime, I’d be interested to hear about your experiences with the bootstrap. Also, why do you think the bootstrap is not more popular in machine learning?
References
Efron, Bradley. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 1-26.
Efron, Bradley, and Tibshirani, R. (1994). An Introduction to the Bootstrap. Chapman and Hall.
Holmes, Susan. (2003). Bootstrapping phylogenetic trees: Theory and methods. Statistical Science, 241-255.
van der Vaart, A. (1996). Asymptotic Statistics. Cambridge.
P.S. See also Cosma’s post: here
![](https://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&blog=36942929&post=351&subd=normaldeviate&ref=&feed=1)