Cross-Methods are a Leak/Variance Trade-Off

Win-Vector Blog 2020-03-11

We have a new Win Vector data science article to share:

Cross-Methods are a Leak/Variance Trade-Off

John Mount (Win Vector LLC), Nina Zumel(Win Vector LLC)

March 10, 2020

We work some exciting examples of when cross-methods (cross validation, and also cross-frames) work, and when they do not work.

Abstract

Cross-methods such as cross-validation, and cross-prediction are effective tools for many machine learning, statisitics, and data science related applications. They are useful for parameter selection, model selection, impact/target encoding of high cardinality variables, stacking models, and super learning. They are more statistically efficient than partitioning training data into calibration/training/holdout sets, but do not satisfy the full exchangeability conditions that full hold-out methods have. This introduces some additional statistical trade-offs when using cross-methods, beyond the obvious increases in computational cost.

Specifically, cross-methods can introduce an information leak into the modeling process. This information leak will be the subject of this post.

The entire article is a JupyterLab notebook, and can be found here. Please check it out, and share it with your favorite statisticians, machine learning researchers, and data scientits.