Using partial pooling when preparing data for machine learning applications

Win-Vector Blog 2018-04-18

Geoffrey Simmons writes:

I reached out to John Mount/Nina Zumel over at Win Vector with a suggestion for their vtreat package, which automates many common challenges in preparing data for machine learning applications.

The default behavior for impact coding high-cardinality variables had been a naive bayes approach, which I found to be problematic due its multi-modal output (assigning probabilities close to 0 and 1 for low sample size levels). This seemed like a natural fit for partial pooling, so I pointed them to your work/book and demonstrated it’s usefulness from my experience/applications. It’s now the basis of a custom-coding enhancement to their package.

You can find their write up here.

Cool. I hope their next step will be to implement in Stan.

It’s also interesting to think of Bayesian or multilevel modeling being used as a preprocessing tool for machine learning, which is sort of the flipped-around version of an idea we posted the other day, on using black-box machine learning predictions as inputs to a Bayesian analysis. I like these ideas of combining different methods and getting the best of both worlds.

The post Using partial pooling when preparing data for machine learning applications appeared first on Statistical Modeling, Causal Inference, and Social Science.