Nina Zumel and John Mount speaking on vtreat at PyData LA 2019

Win-Vector Blog 2019-11-27

As we have announced before, we have ported the R version of vtreat to a new Python version of vtreat.

Our latest news is: we are speaking about the Python version at PyData LA 2019 (Thursday 11:15 AM–12:00 PM in Track 2 Room).

Many R users have found that vtreat rapidly becomes an indispensable step in their supervised machine learning workflows.

This tool accepts real world data which may have issues such as: missing values, categorical values with many levels, or even novel levels appearing during model application. The tool then faithfully and reliably converts this data into a ready for machine learning data frame that is entirely numeric, and without missing values. By faithful we mean: most of the relevant modeling information is preserved. And by reliable we mean: a number subtle over-fitting (or nested model bias) traps are avoided.

Once you get used to having this capability, it is hard to give up.

In our talk we will lay out the typical problems and how vtreat now also solves these problems for Python users.

We won’t have time to get deeply into it, but the Python version of vtreat is designed “be Pythonic” (or at least follow the patterns of other Python tools), so the calling conventions should be very familiar to scikit-learn users.