Statistics and Dr. Strangelove

Normal Deviate 2013-08-22

One of the biggest embarrassments in statistics is that we don’t really have confidence bands for nonparametric functions estimation. This is a fact that we tend to sweep under the rug.

Consider, for example, estimating a density ${p(x)}$ from a sample ${X_1,\ldots, X_n}$ . The kernel estimator with kernel ${K}$ and bandwidth ${h}$ is

$\displaystyle \hat p(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}K\left( \frac{x-X_i}{h}\right).$

Let’s start with getting a confidence interval at a single point ${x}$ .

Let ${\overline{p}(x) = \mathbb{E}(\hat p(x))}$ be the mean of the estimator and let ${s_n(x)}$ be the standard deviation. (A consistent estimator of ${s_n(x)}$ is ${\sqrt{a^2/n}}$ where ${a}$ is the sample standard deviation of ${U_1,\ldots, U_n}$ where ${U_i = h^{-1} K(x-X_i/h)}$ .)

Now,

$\displaystyle \frac{\hat{p}(x) - p(x)}{s_n(x)} = \frac{\hat{p}(x) - \overline{p}(x)}{s_n(x)} + \frac{\overline{p}(x)-p(x)}{s_n(x)} = \frac{\hat{p}(x) - \overline{p}(x)}{s_n(x)} + \frac{b(x)}{s_n(x)}$

where ${b(x) = \overline{p}(x)-p(x)}$ is the bias. By the central limit theorem, the first term is approximately standard Normal,

$\displaystyle \frac{\hat{p}(x) - \overline{p}(x)}{s_n(x)} \approx N(0,1).$

This suggests the confidence interval, ${\hat p(x) \pm z_{\alpha/2} s_n(x)}$ .

The problem is the second term ${\frac{b(x)}{s_n(x)}}$ . Recall that the optimal bandwidth ${h}$ balances the bias and the variance. This means that if we use the best bandwidth, then ${\frac{b(x)}{s_n(x)}}$ behaves like a constant, that is, ${\frac{b(x)}{s_n(x)}\rightarrow c}$ for some (unknown) constant ${c}$ . The result is that, even as ${n\rightarrow\infty}$ , the confidence interval is centered at ${p(x) + c}$ instead of ${p(x)}$ . As a result, the coverage will be less than the intended coverage ${1-\alpha}$ .

We could in principle solve this problem by undersmoothing. If we choose a bandwidth smaller than the optimal bandwidth, then ${\frac{b(x)}{s_n(x)}\rightarrow 0}$ and the problem disappears. But this creates two new problems. First, the confidence interval is now centered around a non-optimal estimator. Second, and more importantly, we don’t have any practical, data-driven methods for choosing an undersmoothed bandwidth. There are some proposals in the theoretical literature but they are, frankly, not very practical

Of course, all of these problems are only worse when we consider simultaneous confidence band for the whole function.

The same problem arises if we use Bayesian inference. In short, the 95 percent Bayesian interval will not have 95 percent frequentist coverage. The reasons are the same as for the frequentist interval. (One can of course “declare” that coverage is not important and then there is no problem. But I am assuming that we do want correct frequentist coverage.)

The only good solution I know is to simply admit that we can’t create a confidence interval for ${p(x)}$ . Instead, we simply interpret ${\hat p(x) \pm z_{\alpha/2} s_n(x)}$ as a confidence interval for ${\overline{p}(x)}$ , which is a smoothed out (biased) version of ${p(x)}$ :

$\displaystyle \overline{p}(x) = \int K(u) p(x+hu) \,du \approx p(x) + C h^2$

where ${C}$ is an unknown constant. In other words, we just live with the bias. (Of course, the bias gets smaller as the sample size gets larger and ${h}$ gets smaller.)

Once we take this point of view, we can easily get a confidence band using the bootstrap. We draw ${X_1^*,\ldots, X_n^*}$ from the empirical distribution ${P_n}$ and compute the density estimator ${\hat p^*}$ . By repeating this many times, we can find the quantile ${z_\alpha}$ defined by

$\displaystyle P\Biggl( \sqrt{nh}||\hat p(x)^*-\hat p(x)||_\infty > z_\alpha \Biggm| X_1,\ldots, X_n\Biggr) = \alpha.$

The confidence band is then ${(\ell(x),u(x))}$ where

$\displaystyle \ell(x) = \hat p(x) - \frac{z_\alpha}{\sqrt{nh}},\ \ \ u(x) = \hat p(x) + \frac{z_\alpha}{\sqrt{nh}}.$

We then have that

$\displaystyle P\Biggl( \ell(x) \leq \overline{p}(x) \leq u(x)\ \ \ {\rm for\ all\ }x\Biggr)\rightarrow 1-\alpha$

as ${n\rightarrow \infty}$ . Again, this is a confidence band for ${\overline{p}(x)}$ rather than for ${p(x)}$ .

Summary: There really is no good, practical method for getting true confidence bands in nonparametric function estimation. The reason is that the bias does not disappear fast enough as the sample size increases. The solution, I believe, is to just acknowledge that the confidence bands have a slight bias. Then we can use the bootstrap (or other methods) to compute the band. Like Dr. Strangelove, we just learn to stop worrying and love the bias.