I was wrong about statistics

Win-Vector Blog 2015-08-18

I’ll admit it: I have been wrong about statistics. However, that isn’t what this article is about. This article is less about some of the statistical mistakes I have made, as a mere working data scientist, and more of a rant about the hectoring tone of corrections from some statisticians (both when I have been right and when I have been wrong).

Used wrong (image Justin Baeder, some rights reserved).

I know, an unapologetic “apologia” opens me and my work to more criticism (or “two for flinching”). But this is what it is can be like for a non-statistician to work with or in front of some statisticians (in fact many statisticians, but none of the big ones). I suspect, from speaking with other data scientists, it is not just me.

I apologize now for displaying a thin skin.

Background

I have made errors when speaking and writing about statistics. This shouldn’t come as a shock as there isn’t anybody who hasn’t made errors. I have no doubt made more errors in writing this article.

But what has been a bit disturbing is many times instead being merely corrected (when I am wrong) or asked for clarification (when I may be right), I am instead publicly accused of being stupid, ignorant, or willfully disseminating falsehoods. I find this disappointing (and I come from a field, theoretical computer science, where if they get excited during your whiteboard presentation they may grab the marker out of your hand to contribute). It would seem graduate schools are not finishing schools.

I’ll admit, I am not a statistician. Some statisticians take that to mean I am ignorant and uneducated in issues of probability. In fact I am very interested in the theory of probability, and come to probability through a fairly long path. I was trained in mathematics and theoretical computer science before becoming a data science carpetbagger. Probability relevant topics I have studied include:

Integral calculus (including the Riemann, Darboux, and Lebesgue integrals).
Real and complex analysis.
Measure theory (σ-algebras, Lebesgue measure, Borel measures and so-on).
Transform theory (convolutions, Fourier transform, Laplace transform, Z transform).
Distribution theory (generalized distributions as maps from functions to numbers, allowing things like a rigorous non-limit based treatment of the Dirac Delta Function).
Kolmogorov axiomatic probability (random variables as measurable functions from probability spaces).
Geometric probability, volumes, and mixed volumes.
Markov chains, Martingales, heat/diffusion equations, and stochastic calculus.
The probabilistic method (linearity of expectation, the Poisson paradigm, pseudorandomness, and more).
Concentration inequalities (Chernoff bound, Hoeffding’s inequality, Talagrand’s concentration inequality, Lovász local lemma, …).
The history of concepts of probability (von Mises collectives, de Finetti exchangeability).
Information theory (entropy, asymptotic equipartition, Kolmogorov–Chaitin complexity, Cramer-Rao inequality, …).
Betting systems (Kelly criterion, small Martingale, and A/B testing).

I am not trying to say I know a lot. What I am trying to say is: I probably know enough to understand the basis of a statistical concept. If you try to correct me politely, I may be capable of understanding your point and learn from you. My terminology may differ from statistical canon, as I may have learned a common concept in a different field (where it may in fact have a different common name).

My sins

I am going to confess a few of my sins. And as I am writing to a technical audience, I will allow myself some technical examples.

I wrote `sum_{i=1...n} (x[i]-mean(x))/n` for variance.

I wrote sum_{i=1...n} (x[i]-mean(x))/n for variance. Well, not quite. It was in R and I wrote (on page 155 of Practical Data Science with R):

d <- data.frame(y=c(2,2,3,3,3))m <- lm(y~1,data=d)df <- nrow(d)-length(coefficients(m))sqrt(sum((residuals(m)^2)/df))

Superficially, there is no “-1″ in there. The reason I wrote this code is I was trying to teach the exact calculation needed to reproduce the “Residual standard error” line found in R’s summary(m) (one of the goals of Practical Data Science with R was to de-mystify summary.lm() by documenting and showing how to calculate every element of summary.lm()).

Now there are a few points in my defense.

I am trying to reproduce standard calculations. So I don’t get to choose if I divide by n or n-1. To explain summary(m) I would have to pick the one that matches the existing summary(m).
If you look at this for a moment you will notice that if n is the number of rows in our data frame then we have the quantity I am dividing by is in fact n-1 (as we have length(coefficients(m))==1). I am guessing those correcting me divide by n-1 because they have been told “population variance divide by n sample variance divide by n-1” and refuse to accept that Bessel’s correction can be arrived at as an attempt to correct for the number of modeling parameters.

Or, on page 171 (actually by my co-author): “Null model has (number of data points – 1) degrees of freedom.” This is from code reproducing the elements of summary.glm() (actually my co-author went a bit further to include the chi-squared statistic, which is one standard significance of fit statistic for summary.glm()– which doesn’t supply any such statistic in the default implementation).

Yet, both of us have received feedback on these sections saying that we have no idea how to compute a sample variance.

Now I am not saying we are perfect. There may (or may not) be places we have written “S/n” instead of “S/(n-1)”. What I am saying is: even this wouldn’t indicate we didn’t understand the distinction. But when corrected I am never accused of mere carelessness (which I would in fact like to apologize for), but of willful ignorance. The aggressive correction is almost always “clearly you don’t know you need to divide by n-1″ instead of “wouldn’t dividing by n-1 be better?”

In practice things are not as simple as always mechanically applying the one true method. We are always searching for better estimates (less bias, lower variance, weaker assumptions, or easier to calculate). I guess if you are used to teaching statistics to out of major beginning students you develop a strong binary and didactic feeling of right (the procedure you actually taught in the course) and wrong (anything else the students write, often in fact gibberish). I am not trying to be inclusive or think relatively here, it is just correcting people is a bit harder than one would think (so it pays to be polite when attempting it).

For example: is Sn = sum_{i=1...n} (x[i]-mean(x))/n in fact a wrong estimate of variance? It clearly is good enough as n gets large. I would also argue it is not in fact wrong. Oh, it is biased (tends to be too low on average), and Sn1 = sum_{i=1...n} (x[i]-mean(x))/(n-1) is unbiased. But bias is not the only concern we might be trading off. We know Sn is a maximum likelihood estimate (desirable for its own reasons) and also a lower variance estimate (or more statistically efficient) than Sn1.

[ By Cochran’s theorem we know that Sn is distributed S chi-sq(n-1)/n and Sn1 is distributed S chi-sq(n-1)/(n-1) (where S is the unknown true variance). But the chi-sq(n-1) distribution has mean n-1 and variance 2n-2. So Sn1-S is mean 0 variance 2 S, and Sn-S is mean -S/(n-1) variance (2-2/n) S. Thus in moving from Sn to Sn1 we removed -S/(n-1) units of bias in exchange for 2 S/n units of additional variance. For large enough n this is not an obvious good trade. ]

I used the word “heteroscedastic” wrong.

In writing about linear regression I used the word “heteroscedastic” wrong. The issue is you are heteroscedastic if “if there are sub-populations that have different variabilities from others” (Wikipedia heteroscedasticity). Based on my readings of the history of probability (the problems von Mises ran into with controlling selections of sub-populations, and the importance of exchangeability to de Finetti’s formulation of probability) I would propose altering the definition to: “computationally identifiable sub-populations (either through explicit or omitted variables).”

However, I (in error) wrote in a footnote: “Heteroscedastic errors are errors whose magnitude is correlated with the quantity to be predicted.”

There are in fact at least four things wrong with what I wrote.

If we assume a generative model y[i] = b.x[i] + e[i] then of course the error terms are correlated with ys as they are in there!. What I was thinking is: we have a problem if the error e is correlated with the signal or systematic portion of y: b.x or the x.
I should have written “unsigned magnitude.” I am a computer scientist, to me magnitudes are the unsigned portion of a numeric representation (as in “sign and magnitude”). However, magnitude is also used for logarithmic scales of things like power and in such senses can be negative (such as negative earthquake magnitudes).
I should have written “depends upon” or “varies with” instead of “correlated.” I am fully away we can have an unsigned magnitude that is a non-constant function of a numeric variable, but not correlated with it. For example with the (x,y) pairs {(-1,2),(0,1),(1,2)} we can write y as function of x but there is no linear correlation.
I need to spend some time explaining we are talking about expected errors/variabilities, not mere instantiated empirical errors (observed or unobserved).

I truly wish I could condense good statistical advice down as efficiently as problems pile up (when you don’t have space for a lot of caveats, and worked examples to indicate intent and meaning). I’ve tried to address this in errata, but still regret my error. I also regret bringing the issue up at all, given I didn’t have enough space to layout enough context and caveats. Really what I wanted to discuss is when it is safe to apply linear regression (the topic of our next section).

I assumed normality in linear regression.

I said you need to check that the residuals are normal when using linear regression (and hopefully I never accidentally wrote you need the dependent or independent variables to be normal, as that is way too restrictive). This is open to criticism.

However there is a bit of gamesmanship going on here. You can pretty much criticize any position one takes on normality of errors. It goes like this:

If the writer claims you need normal errors you can point out that this in not any of the requirements of the Gauss-Markov Theorem (the big-boy theorem in linear regression since at least 1809, though Gauss used a Bayesian argument based on normal distributions). More important are model structure, uniformly bounded variances, and homoscedasticity.
If the writer doesn’t claim you need normal errors, you can point out that it is standard practice to check normality of errors through the q-q-plot (so it is negligent not to worry about this). You can further ask how the writer expects to apply coefficient t-tests, and goodness of fit F-tests without strong distributional assumptions. You can also find references that claim you do need normal errors and even normal variables (which is not in fact necessary).

The actual case is there is a lot of disagreement on what are the convenient assumptions needed for reliable linear regression (see here, here, here, and here). You can find well regarded statisticians on just about any side of this: depending on the context (are they modeling, teaching, or proving theorems). Just don’t have an opinion either way if you are a non-statistician.

The minimal assumptions do not in fact seem to include normality of residuals, xs, or ys. However, minimal may not be the same as convenient in all contexts. For example: in teaching you might invoke normality to give a concrete property and simpler special-case proofs (which goes on to being mis-quoted out of context as the general case necessity). Or: if for domain-specific reasons you known the errors should be normal (such as expected measurement errors when estimating the orbits of celestial bodies) then it pays to check if the residuals are normal (as if the residuals are not normal, they must include things other than the known to be normal errors).

There is also the issue of what do you mean by “linear regression”? If you are just going to fit the model and use it without looking at the diagnostics, you may need fewer assumptions than somebody who (wisely) decides to look at the diagnostics. What additional assumptions you need depend on what tests you run. For example if you run an F-test on the goodness of fit you are then “sensitive to non-normality“, but this is non-normality of the sampling distribution (which can itself be normal even with non-normal residuals under fairly mild assumptions, if you have enough data). However, some derivations of the F-test assume normality of errors (the strong assumption we would rather not have to make; for example result 7.2.1 on page 220 here). Similarly the t-tests on coefficient significance have some distributional assumptions- which can either be satisfied by strong assumptions on the residuals, or weak assumptions on the residuals and a lot of data (making the sampling distribution well behaved).

Frankly a reason it is hard to state reasonable assumptions for regression is there doesn’t seem to be a community agreed upon good primary source for what are considered the strongest derivations (hence with weakest assumptions) of the F-test (and for the chi-square or log-likelihood tests for logistic regression). A clear sign of what the community considers the best proof method would be a great boon (for instance in mathematics collecting proofs by style is considered very important as it hints what holds as we vary assumptions and domain; example Proofs of Fermat’s little theorem). Most sources either neglect the F-test, merely use the F-test, derive it using strong assumptions of normality, deriving in a non-regression setting (single variable contingency tables, or single variable ANOVA), or derive it only for single regression (which doesn’t show how the number of parameters adjust the degrees of freedom). With the right reference identifying the right conditions is a simple matter of proof analysis. I have found derivations of the F-test (including its ANOVA history), but if they are all considered “bad” any conclusions derived from them would be hard to defend. So I’ll say it: I don’t know a good reference for the derivation of the F-test (especially one that answers if residual normality is considered a standard requirement and one that deals directly with multiple regression), and I would appreciate a recommendation.

Our commitment

We are going to keep writing and keep teaching on statistical topics. We choose and arrange topics to reflect what we have found to be the most important issues in practice. We do try to make everything correct, but the order we address things is determined by historic project impact and not traditional teaching order. Just because we haven’t addressed a topic yet doesn’t mean we don’t know about it, it is just that something more urgent may have cut in front. Because of this we feel our work (book, course, blog) are some of the best ways to build data science mastery.

Conclusion

In 2009 Hal Varian, chief economist at Google, famously said:

I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.”

(NYT 2009).

There is some concern that computer science style data science is stealing some of the thunder. Mostly I agree with Professor Matloff: statistics has a lot to offer that is needlessly being missed. Though I also wonder if we are both encroaching on a field that has been historically owned by operations research.

I myself am considered a relatively statistically friendly data scientist who often criticizes data science in statistical terms (for example my recent talk a the recent Data Science Summit: “Statistics in the age of data science, issues you can and can not ignore.”). I repeat: for all my faults I am one of the data scientists who has a more friendly (better than say, median) relationship with statisticians.

That being said: I’ll weigh in on the “why is no statistical vision triumphant?” Data science is rising not just because of data engineering. It is also rising because the business environment is largely collaborative and requires effective consulting skills and behaviors. Collaborating in a positive way with others (valuing their work, even while improving their working style) is decisive. Live and teach that and you win. Yes, I am saying software engineering is somewhat socialized (has long dealt with issues of training, project management, agile practices, working with uncertainty, recording history, and remote collaboration). I am also saying: for any of us (including myself) self improvement and collaboration are always quicker ways to get ahead than trying to pull down others (as there are too many others for tearing down others to be an effective strategy).

I want more statisticians on my data science teams. We can even call it a statistical applications team, as long as that doesn’t get me kicked off the team.

“I was wrong” Sisters of Mercy
So I was wrongI was wrong to ever doubtI can get along without