Bad Bayes: an example of why you need hold-out testing

Win-Vector Blog 2014-02-02

We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit.

The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k-grams and the model can be anything from Naive Bayes to conditional random fields. This sort of modeling situation exposes the modeler to a lot of training bias. You can get models that look good on training data even though they have no actual value on new data (very poor generalization performance). In this sort of situation you are very vulnerable to having fit mere noise.

Often there is a feeling if a model is doing really well on training data then must be some way to bound generalization error and at least get useful performance on new test and production data. This is, of course, false as we will demonstrate by building deliberately useless features that allow various models to perform well on training data. What is actually happening is you are working through variations of worthless models that only appear to be good on training data due to overfitting. And the more “tweaking, tuning, and fixing” you try only appears to improve things because as you peek at your test-data (which you really should have held some out until the entire end of project for final acceptance) your test data is becoming less exchangeable with future new data and more exchangeable with your training data (and thus less helpful in detecting overfit).

Any researcher that does not have proper per-feature significance checks or hold-out testing procedures will be fooled into promoting faulty models.

Many predictive NLP (natural language processing) applications require the use of very many very rare (almost unique) text features. A simple example would be 4-grams or sequences of 4-consecutive works from a document. At some point you are tracking phrases that occur in only 1 to 2 documents in your training corpus. A tempting intuition is that each of these rare features is in fact a low utility clue for document classification. The hope is if we track enough of them then enough are available when scoring a given document to make a reliable classification.

These features may in fact be useful, but you must be careful to have procedures to determine which features are in fact useful and which are mere noise. The issue is that rare features are only seen in a few training examples, so it is hard to reliably estimate their value during training. We will demonstrate (in R) some absolutely useless variables masquerading as actual signal during training. Our example is artificial, but if you don’t have proper hold-out testing procedures you can easily fall into a similar trap.

Our code to create a bad example is as follows:

runExample <- function(rows,features,rareFeature,trainer,predictor) {   print(sys.call(0)) # print call and arguments   set.seed(123525)   # make result deterministic   yValues <- factor(c('A','B'))   xValues <- factor(c('a','b','z'))   d <- data.frame(y=sample(yValues,replace=T,size=rows),                   group=sample(1:100,replace=T,size=rows))   if(rareFeature) {      mkRandVar <- function() {         v <- rep(xValues[[3]],rows)         signalIndices <- sample(1:rows,replace=F,size=2)         v[signalIndices] <- sample(xValues[1:2],replace=T,size=2)         v      }   } else {      mkRandVar <- function() {         sample(xValues[1:2],replace=T,size=rows)      }   }   varValues <- as.data.frame(replicate(features,mkRandVar()))   varNames <- colnames(varValues)   d <- cbind(d,varValues)   dTrain <- subset(d,group<=50)   dTest <- subset(d,group>50)   formula <- as.formula(paste('y',paste(varNames,collapse=' + '),sep=' ~ '))   model <- trainer(formula,data=dTrain)   tabTrain <- table(truth=dTrain$y,      predict=predictor(model,newdata=dTrain,yValues=yValues))   print('train set results')   print(tabTrain)   print(fisher.test(tabTrain))   tabTest <- table(truth=dTest$y,      predict=predictor(model,newdata=dTest,yValues=yValues))   print('hold-out test set results')   print(tabTest)   print(fisher.test(tabTest))}

This block of code builds a universe of examples of size rows. The ground-truth we are trying to predict is if y is “A” or “B”. Each row has a number of features (equal to features). And these features are considered rare if we have rareFeature=T (if so the feature spends almost all of its time parked at the constant “z”). The point is each and every feature in this example is random and built without looking at the actual truth-values or y’s (and therefore useless). We split the universe of data into a 50/50 test/train split. We then build a model on the training data and show the performance of predicting the y-category on both the test and train set. We use the Fisher contingency table test to see if we have what looks like a significant model. In all cases we get a deceptive very good (very low) p-value on training that does not translate to any real effect on test data. We show the effect for Naive Bayes (a common text classifier), decision trees, logistic regression, and random forests (note for the non Naive Bayes classifiers we use non-rare features to trick them into thinking there is a model).

Basically if you don’t at least look at model diagnostics (such as coefficient p-values in logistic regression) or look at test significance you fool yourself into thinking you have a model that is good in training. You may even feel with the right sort of smoothing it should at least be usable in test. It will not. The most you can hope for is a training procedure that notices there is no useful signal. You can’t model your way out of having no useful features.

The results we get are as follows:

  • Naive Bayes train (looks good when it is not):
    > library(e1071)> runExample(rows=200,features=400,rareFeature=T,    trainer=function(formula,data) { naiveBayes(formula,data) },    predictor=function(model,newdata,yValues) {        predict(model,newdata,type='class')    } )runExample(rows = 200, features = 400, rareFeature = T, trainer = function(formula,     data) {    naiveBayes(formula, data)}, predictor = function(model, newdata, yValues) {    predict(model, newdata, type = "class")})[1] "train set results"     predicttruth  A  B    A 45  2    B  0 49Fisher's Exact Test for Count Datadata:  tabTrainp-value < 2.2e-16alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval: 131.2821      Infsample estimates:odds ratio        Inf 
  • Naive Bayes hold-out test (is bad):
    [1] "hold-out test set results"     predicttruth  A  B    A 17 41    B 14 32Fisher's Exact Test for Count Datadata:  tabTestp-value = 1alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval: 0.3752898 2.4192687sample estimates:odds ratio  0.9482474 
  • Decision tree train (looks good when it is not):
    > library(rpart)> runExample(rows=200,features=400,rareFeature=F,    trainer=function(formula,data) { rpart(formula,data) },    predictor=function(model,newdata,yValues) {        predict(model,newdata,type='class')    } )runExample(rows = 200, features = 400, rareFeature = F, trainer = function(formula,     data) {    rpart(formula, data)}, predictor = function(model, newdata, yValues) {    predict(model, newdata, type = "class")})[1] "train set results"     predicttruth  A  B    A 42  5    B 16 33Fisher's Exact Test for Count Datadata:  tabTrainp-value = 7.575e-09alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval:  5.27323 64.71322sample estimates:odds ratio   16.69703 
  • Decision tree hold-out test (is bad):
    [1] "hold-out test set results"     predicttruth  A  B    A 33 25    B 27 19Fisher's Exact Test for Count Datadata:  tabTestp-value = 1alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval: 0.3932841 2.1838878sample estimates:odds ratio  0.9295556 
  • Logistic regression train (looks good when it is not):
    > runExample(rows=200,features=400,rareFeature=F,    trainer=function(formula,data) {        glm(formula,data,family=binomial(link='logit'))     },    predictor=function(model,newdata,yValues) {        yValues[ifelse(predict(model,newdata=newdata,type='response')>=0.5,2,1)]    } )runExample(rows = 200, features = 400, rareFeature = F, trainer = function(formula,     data) {    glm(formula, data, family = binomial(link = "logit"))}, predictor = function(model, newdata, yValues) {    yValues[ifelse(predict(model, newdata = newdata, type = "response") >=         0.5, 2, 1)]})[1] "train set results"     predicttruth  A  B    A 47  0    B  0 49Fisher's Exact Test for Count Datadata:  tabTrainp-value < 2.2e-16alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval: 301.5479      Infsample estimates:odds ratio        Inf 
  • Logistic regression test (is bad):
    [1] "hold-out test set results"     predicttruth  A  B    A 35 23    B 25 21Fisher's Exact Test for Count Datadata:  tabTestp-value = 0.5556alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval: 0.5425696 3.0069854sample estimates:odds ratio   1.275218 Warning messages:1: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==  :  prediction from a rank-deficient fit may be misleading2: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==  :  prediction from a rank-deficient fit may be misleading
  • Random Forests train (looks good, but is not):
    > library(randomForest)> runExample(rows=200,features=400,rareFeature=F,    trainer=function(formula,data) { randomForest(formula,data) },    predictor=function(model,newdata,yValues) {        predict(model,newdata,type='response')    } )runExample(rows = 200, features = 400, rareFeature = F, trainer = function(formula,     data) {    randomForest(formula, data)}, predictor = function(model, newdata, yValues) {    predict(model, newdata, type = "response")})[1] "train set results"     predicttruth  A  B    A 47  0    B  0 49Fisher's Exact Test for Count Datadata:  tabTrainp-value < 2.2e-16alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval: 301.5479      Infsample estimates:odds ratio        Inf 
  • Random Forests tests (is bad):
    [1] "hold-out test set results"     predicttruth  A  B    A 21 37    B 13 33Fisher's Exact Test for Count Datadata:  tabTestp-value = 0.4095alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval: 0.5793544 3.6528127sample estimates:odds ratio   1.435704 

The point is: good training performance means nothing (unless your trainer is in fact reporting cross-validated results). To avoid overfit you must at least examine model diagnostics, per-variable model coefficient significances, and should always report results on truly held-out data. It is not enough to look only at model-fit significance on training data. An additional risk is when you are in a situation where you are likely to encounter a mixture of rare useful features and rare noise features. As we have illustrated above the model fitting procedures can’t always tell the difference between features and noise. So it is easy to expect that the noise features can drown out rare useful features in practice. This should remind all of us of the need for good variable curation, selection and principled dimension reduction (domain knowledge sensitive and y-sensitive, not just broad principal components analysis). Lots of features (the so-called “wide data” style of analytics) are not always easy to work with (as opposed to “tall data” which is always good as you have more examples to falsify bad relations).

We took the liberty of using the title “Bad Bayes” because this is where we have most often seen the use of many weak variables without enough data to really establish per-variable significance.

For a more on feature selection and model testing please see Zumel, Mount, “Practical Data Science with R”.