Fluid use of data
Win-Vector Blog 2015-12-22
Nina Zumel and I recently wrote a few article and series on best practices in testing models and data:
- Random Test/Train Split is not Always Enough
- How Do You Know if Your Data Has Signal?
- How do you know if your model is going to work?
- A Simpler Explanation of Differential Privacy (explaining the reusable holdout set)
- Using differential privacy to reuse training data
- Preparing Data for Analysis using R: Basic through Advanced Techniques
What stands out in these presentations is: the simple practice of a static test/train split is merely a convenience to cut down on operational complexity and difficulty of teaching. It is in no way optimal. That is, using slightly more complicated procedures can build better models on a given set of data.
Suggested static cal/train/test experiment design from vtreat data treatment library.When you think about data handling as being a part of the modeling process, you realize you can use your data with more statistical efficiency. You can build better models with the same amount of original data by trying one of:
- Jackknifing calibration/training/test data.
- Protecting calibration data by a significance threshold.
- Protecting calibration data by noising (differential privacy).
All of these techniques are demonstrated as examples in our articles. The idea is to get the most out your data by fluidly re-arranging it for analysis (versus a rigid test/train split).
These techniques are more complicated than the traditional one-time test/train split (and much more complicated than the flawed approach of training and testing on the same single data set). For example:consider the following simple improvement: re-training a production model on all of your data after you are done scoring models on test/train splits. This produces a best possible model (as it used all of your data) that you just happen to not know the performance of (as you have no data disjoint from training to score it on). This is a good practice, and can make quite a lot of difference when you have limited or expensive to produce data.
Computer science dean and professor Dr. Merrick Furst taught:
The biggest difference between time and space is that you can’t reuse time.
For data science this might be:
The biggest difference between computation and data is you can’t always spin up more data.
Even in the “big data” era, data can be more valuable than processor cycles (such as when predicting rare events). Data handling is part of model design, and not something you can always leave to a framework.