Re-Share: vtreat Data Preparation Documentation and Video

Win-Vector Blog 2020-03-27

I would like to re-share vtreat (R version, Python version) a data preparation documentation for machine learning tasks.

vtreat is a system for preparing messy real world data for predictive modeling tasks (classification, regression, and so on). In particular it is very good at re-coding high-cardinality string-valued (or categorical) variables for later use.

A nice introductory video lecture on vtreat can be found here, and the latest copy of the lecture slides here. Or, you can check out chapter 8 “Advanced data preparation” of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019– which covers the use of vtreat.

The vtreat documentation is organized by task (regression, classification, multinomial classification, and unsupervised), language (R or Python) and interface style (design/prepare, or fit/prepare). In particular the R code now supports variations of the interfaces, allowing users to choose what works best with their coding style. Either design/prepare, which is very fluid when combined with wrapr::unpack notation or the fit/prepare (which uses mutable state to organize steps).