Python Data Science Tip: Don’t use Default Cross Validation Settings

Win-Vector Blog 2020-03-03

Here is a quick, simple, and important tip for doing machine learning, data science, or statistics in Python: don’t use the default cross validation settings. The default can default to a deterministic, and even ordered split, which is not in general what one wants or expects from a statistical point of view. From a software engineering point of view the defaults may be sensible as since they don’t touch the pseudo-random number generator they are repeatable, deterministic, and side-effect free.

This issue falls under “read the manual”, but it is always frustrating when the defaults are not sufficiently generous.

To see what is going on, let’s work an example.

First we import our packages/modules.

import pandas
import numpy
import sklearn
import sklearn.model_selection
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import cross_val_predict

sklearn.__version__
'0.21.3'

Now let’s set up some simple example data.

nrow = 15
cv = 3
d = pandas.DataFrame({
    'const': ['a'] * nrow,
    'r1': numpy.random.normal(size=nrow),
    'row_id': range(nrow)
})
y = [2**i for i in range(nrow)]

d
const r1 row_id 0 a -0.090306 0 1 a -0.062128 1 2 a 0.530181 2 3 a -0.769375 3 4 a -2.082851 4 5 a 0.703230 5 6 a 0.404206 6 7 a -0.648879 7 8 a 0.149515 8 9 a -0.697519 9 10 a -0.177883 10 11 a 0.809709 11 12 a 0.956048 12 13 a 0.621239 13 14 a -0.579699 14

We now use sklearn.model_selection.cross_val_predict to land some derived columns. In this case we are going to land the global average the outcome y as our estimate.

class CopyYMeanTransform(BaseEstimator, 
                         TransformerMixin):
    def __init__(self):
        self.est = 0
        BaseEstimator.__init__(self)
        TransformerMixin.__init__(self)

    def fit(self, X, y):
        self.est = numpy.mean(y)
        return self

    def transform(self, X):
        return [self.est] * X.shape[0]

    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)

    def predict(self, X):
        return self.transform(X)

    def fit_predict(self, X, y):
        return self.fit_transform(X, y)

ests1 = cross_val_predict(CopyYMeanTransform(), d, y, cv=cv)

pandas.DataFrame({'ests': ests1})
ests 0 3273.6 1 3273.6 2 3273.6 3 3273.6 4 3273.6 5 3177.5 6 3177.5 7 3177.5 8 3177.5 9 3177.5 10 102.3 11 102.3 12 102.3 13 102.3 14 102.3

In the result we notice two things:

  • The estimate global average varies, it is not a constant. This is a very important feature of cross validated methods, and something I intend to write more on later.
  • The results seem to be in orderly blocks. This is implied by the help, but not what one wants or expects in cross validated work. Structure in the input data may survive the block structure of this cross validation and spoil results.
  • Let’s re-encode the output to see what is going on. We deliberately chose the y values to be powers of 2 so v*(nrow-nrow/cv) should give us the exact rows used in each calculation as bit positions. We can view this as follows.

    pandas.DataFrame({
        'blocks': 
        [format(int(v*(nrow-nrow/cv)), '#0' + str(nrow+2) + 'b') for v in ests1]
    })
    blocks 0 0b111111111100000 1 0b111111111100000 2 0b111111111100000 3 0b111111111100000 4 0b111111111100000 5 0b111110000011111 6 0b111110000011111 7 0b111110000011111 8 0b111110000011111 9 0b111110000011111 10 0b000001111111111 11 0b000001111111111 12 0b000001111111111 13 0b000001111111111 14 0b000001111111111

    The first row indicates it was derived from all rows except the first 5 (as the 5 lowest bit positions are zero). In fact the first five rows are all calculated in this manner. So we have 3-way cross validation (each row is calculated using 2/3rds of the data), but in consecutive blocks.

    This happens because sklearn.model_selection.cross_val_predict defaults to using one of klearn.model_selection.KFold.html sklearn.model_selection.StratifiedKFold. These in turn both default to shuffle=False, which explains the observed behavior.

    This is “as expected” in the sense it is clearly documented. It is however, not how a statistician would expect k-fold cross validation to work for a small k.

    The solution is, as documented, avoid the default by explicitly setting the cross validation strategy. We demonstrate this here.

    cvstrat = sklearn.model_selection.KFold(shuffle=True, n_splits=3)
    ests2 = sklearn.model_selection.cross_val_predict(CopyYMeanTransform(), d, y, cv=cvstrat)
    
    pandas.DataFrame({
        'ests2': 
        [format(int(v*(nrow-nrow/cv)), '#0' + str(nrow+2) + 'b') for v in ests2]
    })
    ests2 0 0b111101010110110 1 0b110111111001001 2 0b110111111001001 3 0b111101010110110 4 0b110111111001001 5 0b110111111001001 6 0b111101010110110 7 0b001010101111111 8 0b111101010110110 9 0b001010101111111 10 0b111101010110110 11 0b001010101111111 12 0b110111111001001 13 0b001010101111111 14 0b001010101111111

    This is still a 3-fold cross validation strategy as there are only 3 distinct calculations made. However the arrangement is now random subject to the important constraint that the i-th row is not in input to the i-th result.

    We can also confirm that the shuffle option suffles the cross-validation plan, and not the data set rows.

    class CopyXTransform(BaseEstimator, TransformerMixin):
        def __init__(self):
            BaseEstimator.__init__(self)
            TransformerMixin.__init__(self)
    
        def fit(self, X, y):
            return self
    
        def transform(self, X):
            return X.copy()
    
        def fit_transform(self, X, y):
            self.fit(X, y)
            return self.transform(X)
    
        def predict(self, X):
            return self.transform(X)
    
        def fit_predict(self, X, y):
            return self.fit_transform(X, y)
    
    preds = sklearn.model_selection.cross_val_predict(
        CopyXTransform(), d, y, cv=cvstrat)
    
    pandas.DataFrame(preds)
    0 1 2 0 a -0.0903059 0 1 a -0.0621276 1 2 a 0.530181 2 3 a -0.769375 3 4 a -2.08285 4 5 a 0.70323 5 6 a 0.404206 6 7 a -0.648879 7 8 a 0.149515 8 9 a -0.697519 9 10 a -0.177883 10 11 a 0.809709 11 12 a 0.956048 12 13 a 0.621239 13 14 a -0.579699 14

    The CopyXTransform copied out the input data in its original order, confirming shuffle shuffles the plan not the data rows.

    And that concludes our tip: don’t use default cross validation settings.