Statistics Declares War on Machine Learning!
Normal Deviate 2013-03-18
STATISTICS DECLARES WAR ON MACHINE LEARNING!
Well I hope the dramatic title caught your attention. Now I can get to the real topic of the post, which is: finite sample bounds versus asymptotic approximations.
In my last post I discussed Normal limiting approximations. One commenter, Csaba Szepesvari, wrote the following interesting comment:
What still surprises me about statistics or the way statisticians do their business is the following: The Berry-Esseen theorem says that a confidence interval chosen based on the CLT is possibly shorter by a good amount of . Despite this, statisticians keep telling me that they prefer their “shorter” CLT-based confidence intervals to ones derived by using finite-sample tail inequalities that we, “machine learning people prefer” (lies vs. honesty?). I could never understood the logic behind this reasoning and I am wondering if I am missing something. One possible answer is that the Berry-Esseen result could be oftentimes loose. Indeed, if the summands are normally distributed, the difference will be zero. Thus, an interesting question is the study of the behavior of the worst-case deviation under assumptions (or: for what class of distributions is the Berry-Esseen result tight?). It would be interesting to read about this, or in general, why statisticians prefer to possibly make big mistakes to saying “I don’t know” more often.
I completely agree with Csaba (if I may call you by your first name).
A good example is the binomial. There are simple confidence intervals for the parameter of a binomial that have correct, finite sample coverage. These finite sample confidence sets have the property that
for every sample size . There is no reason to use the Normal approximation.
And yet …
I want to put on my Statistics hat and remove my Machine Learning hat and defend my fellow statisticians for using Normal approximations.
Open up a random issue of Science or Nature. There are papers on physics, biology, chemistry and so on, filled with results that look like, for example:
These are usually point estimates with standard errors based on asymptotic Normal approximations.
The fact is, science would come to a screeching halt without Normal approximations. The reason is that finite sample intervals are available in only a tiny fraction of statistical problems. I think that Machine Learners (remember, I am wearing my Statistics hat today) have a skewed view of the universe of statistical problems. The certitude of finite sample intervals is reassuring but it rules out the vast majority of statistics as applied to real scientific problems.
Here is a random list, off the top of my head, of problems where all we have are confidence intervals based on asymptotic Normal approximations:
(Most) maximum likelihood estimators, robust estimators, time series, nonlinear regression, generalized linear models, survival analysis (Cox proportional hazards model), partial correlations, longitudinal models, semiparametric inference, confidence bands for function estimators, population genetics, etc.
Biostatistics and cancer research would not exist without confidence intervals based on Normal approximations. In fact, I would guess that every medication or treatment you have ever had, was based on medical research that used these approximations.
When we use approximate confidence intervals, we always proceed knowing that they are approximations. In fact, the entire scientific process is based on numerous, hard to quantify, approximations.
Even finite sample intervals are only approximations. They assume that the data are iid draws from a distribution which is a convenient fiction. It is rarely exactly true.
Let me repeat: I prefer to use finite sample intervals when they are available. And I agree with Csaba that it would be interesting to state more clearly the conditions under which we can tighten the Berry-Esseen bounds. But approximate confidence intervals are a successful, indispensable part of science.
P.S. We’re comparing two frequentist ideas here: finite sample confidence intervals versus asymptotic confidence intervals. Die hard Bayesians will say: Bayes is always a finite sample method. Frequentists will reply: no, you’re just sweeping the complications under the rug. Let’s refrain from the Bayes vs Frequentist arguments for this post. The question being discussed is: given that you are doing frequentist inference, what are your thoughts on finite sample versus asymptotic intervals?