Reading the Gauss-Markov theorem

Win-Vector Blog 2014-09-04

What is the Gauss-Markov theorem?

From “The Cambridge Dictionary of Statistics” B. S. Everitt, 2nd Edition:

A theorem that proves that if the error terms in a multiple regression have the same variance and are uncorrelated, then the estimators of the parameters in the model produced by least squares estimation are better (in the sense of having lower dispersion about the mean) than any other unbiased linear estimator.

This is pretty much considered the “big boy” reason least squares fitting can be considered a good implementation of linear regression.

Suppose you are building a model of the form:

    y(i) = B . x(i) + e(i)

where B is a vector (to be inferred), i is an index that runs over the available data (say 1 through n), x(i) is a per-example vector of features, and y(i) is the scalar quantity to be modeled. Only x(i) and y(i) are observed. The e(i) term is the un-modeled component of y(i) and you typically hope that the e(i) can be thought of unknowable effects, individual variation, ignorable errors, residuals, or noise. How weak/strong assumptions you put on the e(i) (and other quantities) depends on what you know, what you are trying to do, and which theorems you need to meet the pre-conditions of. The Gauss-Markov theorem assures a good estimate of B under weak assumptions.

How to interpret the theorem

The point of the Gauss-Markov theorem is that we can find conditions ensuring a good fit without requiring detailed distributional assumptions about the e(i) and without distributional assumptions about the x(i). However, if you are using Bayesian methods or generative models for predictions you may want to use additional stronger conditions (perhaps even normality of errors and even distributional assumptions on the xs).

We are going to read through the Wikipedia statement of the Gauss-Markov theorem in detail.

Wikipedia’s stated pre-conditions of the Gauss-Markov theorem

To apply the Gauss-Markov theorem the Wikipedia says you must assume your data has the following properties:

   E[e(i)] = 0                   (lack of structural errors, needed to avoid bias)   V[e(i)] = c                   (equal variance, one form of homoscedasticity)   cov[e(i),e(j)] = 0  for i!=j  (non-correlation of errors)

It is always important to know precisely what probability model the expectation (E[]), variance (V[]), and covariance (cov[]) operators are working over in the Wikipedia conditions. This is usually left implicit, but it is critical to know exactly what is being asserted. When reading/listening about statistical or probabilistic work you should always insist on a concrete description of the probability model underlying all the notation (the E[]s and V[]s). A lot of confusion and subtle tricks get hidden by not sharing an explicit description of the probability model.

Probability models

Two plausible probability models are:

  • Frequentist: unobserved parameters are held constant and all probabilities are over re-draws of the data. At first guess you would think this is the correct model for this problem, as the content of the Gauss-Markov theorem is about how estimates drawn from a larger population perform in expectation.
  • x-Generative: This is not standard and not immediately implied by the notation (and represents a fairly strong set of assumptions). In this model all of the observed xs are held constant and unobserved es and ys are regenerated with respect to the xs. This is similar to a Bayesian generative model, except in the usual Bayesian formulation all observables (both xs and ys) are held fixed. We only introduce this model as it seems to be the simplest one which makes for a workable interpretation of the Wikipedia statements.

The issue is: the conditions as stated are not strong enough to ensure actual homoscedasticity (or even non-structure of errors/bias) needed to apply the Gauss-Markov theorem under a strict frequentist model. So we must go venue-shopping and find what model is likely intended. An easy way to do this is to design synthetic data that is considered well-behaved under one model and not under the other.

A source of examples

Let’s use a deliberately naive empirical view of data. Suppose the entire possible universe of data is X(i),Y(i),Z(i) i=1...k for some k (k and X(i),Y(i),Z(i) all finite real vectors). Our chosen explicit probability model for generating the observed data x(i),y(i) and unobserved e(i) is the following. We pick a length-n sequence of integers s(1),...,s(n) where each s(i) is picked uniformly and independently from 1...k and add a bit of unique noise. Our sample data is then (only x(i),y(i) are observed, e(i) is an unobserved notional quantity):

   (x(i),y(i),e(i)) = (X(s(i)),Y(s(i))+t(i),Z(s(i))+t(i)) for i=1...n,        where t(i) is an independent normal variable with mean 0 and variance 1

This is similar to a standard statistical model (empirical re-sampling from a fixed set, and designed to be similar to a sampling distribution). Z(i) represents an idealized error term and e(i) represents a per-sample unobserved realization of Z(i). It is a nice model because the e(i) are independently identically distributed (and so are the x(i) and y(i), though obviously there can be dependencies between the x,y and es). This model can be thought of as “too nice” as it isn’t powerful enough to capture the full power of the Gauss-Markov theorem (it can’t express non- independent identically distributed situations). However it can concretely embody situations that do meet the Gauss-Markov conditions and be used to work clarifying examples.

Good examples under the frequentist probability model

Let’s see what conditions on X(i),Y(i),Z(i) i=1...k are needed to meet the Gauss-Markov pre-conditions assuming a frequentist probability model.

  • The first one is easy: E[e(i)] = 0 if and only if sum_{j=1...k} Z(j) = 0.
  • When we have E[e(i)]=0 the second condition (homoscedasticity as stated) simplifies to V[e(i)] = E[(e(i) - E[e(j)])^2] = E[e(i)^2] = E[Z^2] + 1 which is independent of i.
  • When we have E[e(i)]=0 the third condition simplifies to E[e(i) e(j)] = 0 for i!=j. And then follows immediately from our overly strong condition of the index selections s(i) being independent (giving us E[e(i) e(j)] = E[e(i)] E[e(j)] = 0 for i!=j).

So all we need is: sum_{j=1...k} Z(j) = 0 and then the other conditions hold. This seems too easy, and is evidence that the frequentist probability model is not the model intended by Wikipedia. We will confirm this with a specific counter example later.

Good examples under the x-generative probability model

Under the x-generative probability model (and this is not standard terminology) the Wikipedia conditions are more properly written conditionally:

   E[e(i)|x(i)] = 0   V[e(i)|x(i)] = c   cov[e(i),e(j)|x(i),x(j)] = 0  for i!=j

Or more precisely: if the conditions had been written in their conditional form we wouldn’t have to contrive a phrase like “x-generative model” to ensure the correct interpretation. These conditions are strict. Checking or ensuring these properties is a problem when x is continuous and we have a joint description of how x,y,e are generated (instead of a hierarchical one). These conditions as stated are strong enough to support the Gauss-Markov theorem, but probably in fact stronger than the minimum or canonical conditions. But let’s see how they work.

To meet these conditions our Z(i) must pretty much be free of dependence on x(i) (even one snuck through the index i). This is somewhat unsatisfying as our overly simple modeling framework (producing x,y,e from X,Y,Z) combined with these strong conditions don’t really model much more than identical independence (so do not capture the full breadth of the Gauss-Markov theorem). The frequentist conditions are too lenient to work and the x-generative/conditioned conditions seem too strong (at least when combined with our simplistic source of examples).

A good example

The following R example (also available here) shows a data set generated under our framework where the Gauss-Markov theorem applies (under either probability model). In this case the true y is produced as an actual linear function of x plus iid (independent identically distributed) noise. This model meets the pre-conditions of the Gauss-Markov condition (under both the frequentist and x-generative models). We observe that the empirical samples average out to the correct theoretical coefficients taken from the original universal population. All of the calculations are designed to match the quantities discussed in the Wikipedia derivations.

library(ggplot2)workProblem <- function(dAll,nreps,name,sampleSize=10) {   xAll <- matrix(data=c(dAll$x0,dAll$x1),ncol=2)   cAll <- solve(t(xAll) %*% xAll) %*% t(xAll)   beta <- as.numeric(cAll %*% dAll$y)   betaSamples <- matrix(data=0,nrow=2,ncol=nreps)   nrows <- dim(dAll)[[1]]   for(i in 1:nreps) {      dSample <- dAll[sample.int(nrows,sampleSize,replace=TRUE),]      individualError <- rnorm(sampleSize)      dSample$y <- dSample$y + individualError      dSample$e <- dSample$z + individualError      xSample <- matrix(data=c(dSample$x0,dSample$x1),ncol=2)      cSample <- solve(t(xSample) %*% xSample) %*% t(xSample)      betaS <- as.numeric(cSample %*% dSample$y)      betaSamples[,i] <- betaS   }   d <- c()   for(i in 1:(dim(betaSamples)[[1]])) {      coef <- paste('beta',(i-1),sep='')      mean <- mean(betaSamples[i,])      dev <- sqrt(var(betaSamples[i,])/nreps)      d <- rbind(d,data.frame(nsamples=nreps,model=name,coef=coef,         actual=beta[i],est=mean,estP=mean+2*dev,estM=mean-2*dev))   }   d}repCounts <- as.integer(floor(10^(0.25*(4:24))))print('good example')## [1] "good example"set.seed(2623496)dGood <- data.frame(x0=1,x1=0:10)dGood$y <- 3*dGood$x0 + 2*dGood$x1dGood$z <- dGood$y - predict(lm(y~0+x0+x1,data=dGood))print(dGood)##    x0 x1  y          z## 1   1  0  3 -9.326e-15## 2   1  1  5 -7.994e-15## 3   1  2  7 -7.105e-15## 4   1  3  9 -5.329e-15## 5   1  4 11 -5.329e-15## 6   1  5 13 -3.553e-15## 7   1  6 15 -1.776e-15## 8   1  7 17 -3.553e-15## 9   1  8 19  0.000e+00## 10  1  9 21  0.000e+00## 11  1 10 23  0.000e+00print(summary(lm(y~0+x0+x1,data=dGood)))## Warning: essentially perfect fit: summary may be unreliable## ## Call:## lm(formula = y ~ 0 + x0 + x1, data = dGood)## ## Residuals:##       Min        1Q    Median        3Q       Max ## -2.77e-15 -1.69e-15 -5.22e-16  4.48e-16  6.53e-15 ## ## Coefficients:##    Estimate Std. Error t value Pr(>|t|)    ## x0 3.00e+00   1.58e-15 1.9e+15   <2e-16 ***## x1 2.00e+00   2.67e-16 7.5e+15   <2e-16 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 2.8e-15 on 9 degrees of freedom## Multiple R-squared:     1,   Adjusted R-squared:     1 ## F-statistic: 1.47e+32 on 2 and 9 DF,  p-value: <2e-16print(workProblem(dGood,10,'good/works',10000))##   nsamples      model  coef actual   est  estP  estM## 1       10 good/works beta0      3 3.006 3.016 2.995## 2       10 good/works beta1      2 1.999 2.001 1.997pGood <- c()set.seed(2623496)for(reps in repCounts) {  pGood <- rbind(pGood,workProblem(dGood,reps,'goodData'))}ggplot(data=pGood,aes(x=nsamples)) +  geom_line(aes(y=actual)) +  geom_line(aes(y=est),linetype=2,color='blue') +  geom_ribbon(aes(ymax=estP,ymin=estM),alpha=0.2,fill='blue') +  facet_wrap(~coef,ncol=1,scales='free_y') + scale_x_log10() +  theme(axis.title.y=element_blank())

Notice the code is using the “return data frames” principle. The derived graph shows what we expect from an unbiased low-variance estimate: convergence to the correct values as we increase number of repetitions.

Hetexample1

A bad example

The following R example meets all of the Wikipedia stated conditions of the Gauss-Markov theorem under a frequentist probability model, but doesn’t even exhibit unbiased estimates- let alone a minimal variance such on small samples. It does produce correct estimates on large samples (so one could work with it), but we are not seeing unbiasedness (let alone low variance) on small samples. For this example: the ideal distribution and large samples are unbiased (but have some ugly structure), yet small samples appear biased.

This bad example is essentially given as: y = x^2 and we haven’t made x^2 available to the model (only x). So this data set doesn’t actually follow the assumed linear modeling structure. However, we can be sophists and claim the effect to model is y = 10*x - 15 + e (which is linear in the features we are making available) and the error term is in fact e=x^2 - 10*x + 15 + individualError (which does have an expected value of zero when x is sampled uniformly from the integers 0...10).

This data set is designed to slip past the Gauss-Markov theorem pre-conditions under the frequentist interpretation. As we have shown all we need to do is check sum_{k} Z(k) is zero and the rest of the properties follow. In our case we have sum_{k} Z(k) = sum_{x=0...10} (x^2 - 10*x + 15) = 0. This data set does not slip past the Gauss-Markov theorem pre-conditions under the x-generative model as the obviously structured error term is what they are designed to prohibit/avoid. This sets us up for the following syllogism.

  • This data set satisfies the Gauss-Markov theorem pre-conditions under the frequentist model.
  • Our R simulation shows the data set doesn’t satisfy the conclusions of the Gauss-Markov theorem.
  • We can then conclude the Gauss-Markov theorem pre-conditions can’t be based on the frequentist model.

We confirm this with the following R-simulation.

dBad <- data.frame(x0=1,x1=0:10)dBad$y <- dBad$x1^2 # or y = -15 + 10*x1 with structured errordBad$z <- dBad$y - predict(lm(y~0+x0+x1,data=dBad))print('bad example')## [1] "bad example"print(dBad)##    x0 x1   y   z## 1   1  0   0  15## 2   1  1   1   6## 3   1  2   4  -1## 4   1  3   9  -6## 5   1  4  16  -9## 6   1  5  25 -10## 7   1  6  36  -9## 8   1  7  49  -6## 9   1  8  64  -1## 10  1  9  81   6## 11  1 10 100  15print(summary(lm(y~0+x0+x1,data=dBad)))## ## Call:## lm(formula = y ~ 0 + x0 + x1, data = dBad)## ## Residuals:##    Min     1Q Median     3Q    Max ##  -10.0   -7.5   -1.0    6.0   15.0 ## ## Coefficients:##    Estimate Std. Error t value Pr(>|t|)    ## x0  -15.000      5.508   -2.72    0.023 *  ## x1   10.000      0.931   10.74    2e-06 ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 9.76 on 9 degrees of freedom## Multiple R-squared:  0.966,  Adjusted R-squared:  0.959 ## F-statistic:  128 on 2 and 9 DF,  p-value: 2.42e-07print(workProblem(dBad,10,'bad/works',10000))##   nsamples     model  coef actual    est   estP    estM## 1       10 bad/works beta0    -15 -14.92 -14.81 -15.023## 2       10 bad/works beta1     10   9.99  10.01   9.971print(sum(dBad$z*dBad$x0))## [1] -7.816e-14print(sum(dBad$z*dBad$x1))## [1] -1.013e-13pBad <- c()set.seed(2623496)for(reps in repCounts) {  pBad <- rbind(pBad,workProblem(dBad,reps,'badData'))}ggplot(data=pBad,aes(x=nsamples)) +  geom_line(aes(y=actual)) +  geom_line(aes(y=est),linetype=2,color='blue') +  geom_ribbon(aes(ymax=estP,ymin=estM),alpha=0.2,fill='blue') +  facet_wrap(~coef,ncol=1,scales='free_y') + scale_x_log10() +  theme(axis.title.y=element_blank())

Hetexample2

Notice even when we drive the number of repetitions high enough to collapse the error bars we still have one of the coefficient estimates routinely below its ideal value. This is what a biased estimation procedure looks like. Again, it isn’t strictly correct to say we the problem is due to heteroscedasticity, as we are seeing bias (not just systematic changes in magnitude of variation).

The reason the average of small samples retains bias on this example is: least squares fitting is a non-linear function of the xs (it is only linear in the ys). Without an additional argument (such as the Gauss-Markov theorem) to appeal to there is no a priori reason to believe an average of non-linear estimates will converge to the original population values. However, we feel it is much easier to teach a conclusion like this from stronger assumptions such as identically independent distributed errors than from homoscedasticity. The gain in generality in basing inference on homoscedasticity is not really so large and the loss in clarity is expensive. The main downside of basing inference on identically independent distributed errors appears to be: you get accused of not knowing of the Gauss-Markov theorem.

What is homoscedasticity/heteroscedasticity

Heteroscedasticity is a general undesirable modeling situation where the variability of some of your variables changes from sub-population to sub-population. That is what the Wikipedia requirement is trying to get at with V[e(i)]=c. However as we move from informal text definitions to actual strict mathematics we have to precisely specify: what is varying with respect to what and which sub-populations do we consider identifiable?

Also be aware that while data with structured errors (the sign of errors being somewhat predictable from xs or even from omitted variables) can not be homoscedastic, it is not traditional to call such situations heteroscedastic (but to instead point out the structural error and say in the presence of such problems the question between homoscedastic/heteroscedastic does not apply).

We would also point out that B.S. Everitt’s “The Cambridge Dictionary of Statistics” 2nd edition does not have primary entries for homoscedastic or heteroscedastic. Our opinion is not that Everitt forgot them or did not know of them. But, likely Everitt found the criticism he would get for leaving these entries out of his dictionary would be less than the loss of clarity/conciseness that would come from including them (and the verbiage needed to respect their detailed historic definitions and conventions).

For our part: we have come to regret ever having used the term “heteroscedacity” (which we have only attempted out of respect to our sources, which use the term). It is far simpler to introduce an ad-hoc term like structural errors and supply a precise definition and examples of what is meant in concise mathematical notation. What turns out to be complicated is: using standard statistical terminology which comes with a lot of conventions and historic linguistic baggage. Part of the problem is of course our own background is mathematics, not statistics. In mathematics term definitions tend to be revised to fit use and intended meaning, instead of being frozen to document priority (as is more common in sciences).

Summary/conclusions

Many probability/statistical write-ups fail to explicitly identify what probability model is actually underling operators such as E[],V[], and cov[]. This is for brevity and pretty much the standard convention. Common probability models to consider include: frequentist (all parameters held constant and data regenerated), Bayesian (all observables held constant and probability statements are over distributions of unobserved quantities and parameters), and ad-hoc generative/conditional distributions (as we used here). The issue is: different probability models give different answers. Usually this is not a problem because by the same token: probability models encode so much about intent you can usually infer the right one from knowing intent.

Most common sampling questions use a frequentist model/interpretation (for example see Bayesian and Frequentist Approaches: Ask the Right Question). The issue is: under that rubric the statement there is a c such that V[e(i)] = c doesn’t carry a lot of content. What is probably meant/intended are strong conditional distribution statements like E[e(i)|x(i)]=0 and V[e(i)|x(i)]=c. A quick proof analysis shows the derivations in the Wikipedia article are definitely pushing the E[] operator through Xs as if the Xs are constants independent of the sample/experiment. This is not correct in general (as our bad example showed), but is a legitimate step if all operators are conditioned on X (but again, that is a fairly strong condition).

Part of this is just a reminder that the Wikipedia is an encyclopedia, not a primary source. The other part is: don’t let statistical bullies force you away from clear thoughts and definitions.

For example: it is considered vulgar or ignorant to assume something as strong as independent identically distributed errors. The feeling is: the conclusion of the Gauss-Markov theorem gives facts about only the first two moments of a distribution, so the invoked pre-conditions should only use facts about the first two moments of any input distributions. But philosophically: assuming identical errors makes sense: errors we can’t tell apart in some sense must be treated as identical (as we can’t tell them apart). A data scientist if asked why they believe the residuals hidden in their data may be homoscedastic is more likely to appeal to some sort of assumed independent generative structure in their problem (which is itself not as weak or as general as homoscedasticity) than to point to an empirical test of homoscedasticity (which can itself be unreliable).

A lot tends to be going on in statistics papers (probabilities, interpretation, reasoning over counterfactuals, math, and more) so expect technical terminology (or even argot), implied conventions, and telegraphic writing. Correct comprehension often requires introducing and working your own examples.