A pathological glm() problem that doesn’t issue a warning

Win-Vector Blog 2013-05-02

I know I have already written a lot about technicalities in logistic regression (see for example: How robust is logistic regression? and Newton-Raphson can compute an average). But I just ran into a simple case where R‘s glm() implementation of logistic regression seems to fail without issuing a warning message. Yes the data is a bit pathological, but one would hope for a diagnostic or warning message from the fitter.Consider the following synthetic data set and glm() logistic regression fit (using “R version 3.0.0 (2013-04-03) — “Masked Marvel”" on OSX Mountain Lion):

> d <- data.frame(x=c(rep(1,200),rep(0,25)),y=c(rep(1,24),rep(0,176),rep(1,25)))> table(y=d$y,x=d$x)   xy     0   1  0   0 176  1  25  24> m <- glm(y~x,data=d,family=binomial(link='logit'))> print(summary(m))Call:glm(formula = y ~ x, family = binomial(link = "logit"), data = d)Deviance Residuals:     Min       1Q   Median       3Q      Max  -0.5056  -0.5056  -0.5056  -0.5056   2.0593  Coefficients:            Estimate Std. Error z value Pr(>|z|)(Intercept)    18.57    1304.53   0.014    0.989x             -20.56    1304.53  -0.016    0.987(Dispersion parameter for binomial family taken to be 1)    Null deviance: 235.84  on 224  degrees of freedomResidual deviance: 146.77  on 223  degrees of freedomAIC: 150.77Number of Fisher Scoring iterations: 17

Notice that no coefficient achieved significance and the error bars are in fact gigantic. It is almost like when logistic regression tries to run a coefficient to infinity on linearly separable data. In fact that is what is going on: for x=0 the y data is all in one class (separated or pseudo-separated) and any model of the form dc-term = B and x-term = -B -1.9924 is a good model for large positive B (always is correct on the x=1 distribution of y’s, and gets better at reproducing the x=0 distribution of y’s as B goes to infinity). An ideal optimizer would run B to +infinity. Likely the Newton method in glm() failed because the Hessian became numerically ill-conditioned or the gradient became near zero (but in a region where the loss function was flat, so not a good indication of an optimum). But that is the problem: glm() didn’t inform us of any issue. It should have run to infinity (bad), but instead it just stopped without diagnostic signaling (also bad).

The model is in fact good, it is the error bars that are a problem. It should not be hard to bound coefficients that are running to infinity away from zero. The standard error estimates (even if right) are in this case not able to show the coefficients are nowhere near zero (the usual use of the standard error estimates from a model summary!). It has always been a strange feature of logistic regression that it has problems with (and has to be defended from) data that is “too good.” Many other methods (like decision trees) do not have this issue.

The solution is simple: add a regularization term in the optimizer (or add reasonable prior if you are a Bayesian). Regularizing of course spoils the coefficient error-bar calculations; but you could either work out the math for error bars on a regularized estimate- or empirically estimate error bars by some sort of Bootstrap or empirical re-sampling scheme.

Unfortunately trying to regularize by adding some fuzzy data does not work until we add a fairly significant perturbation to the data:

> d$wt <- 1> fuzz <- data.frame(x=c(1,1,0,0),y=c(1,0,1,0))> fuzz$wt <- 0.5> d2 <- rbind(d,fuzz)> m2 <- glm(y~x,data=d2,family=binomial(link='logit'),weights=wt)Warning message:In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!> summary(m2)Call:glm(formula = y ~ x, family = binomial(link = "logit"), data = d2,     weights = wt)Deviance Residuals:     Min       1Q   Median       3Q      Max  -1.9878  -0.5099  -0.5099  -0.5099   2.0516  Coefficients:            Estimate Std. Error z value Pr(>|z|)    (Intercept)    3.932      1.428   2.754  0.00589 ** x             -5.906      1.444  -4.090 4.31e-05 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)    Null deviance: 239.37  on 228  degrees of freedomResidual deviance: 153.95  on 227  degrees of freedomAIC: 151.75Number of Fisher Scoring iterations: 6

But to even try these fixes you would have to know you have a problem. Right now the only sign of a problem are the enormous error bars.