y-aware scaling in context
Win-Vector Blog 2016-07-01
Nina Zumel introduced y-aware scaling in her recent article Principal Components Regression, Pt. 2: Y-Aware Methods. I really encourage you to read the article and add the technique to your repertoire. The method combines well with other methods and can drive better predictive modeling results.
From feedback I am not sure everybody noticed that in addition to being easy and effective, the method is actually novel (we haven’t yet found an academic reference to it or seen it already in use after visiting numerous clients). Likely it has been applied before (as it is a simple method), but it is not currently considered a standard method (something we would like to change).
In this note I’ll discuss some of the context of y-aware scaling.
y-aware scaling is a transform that has been available in as “scale mode” in the vtreat R package since prior to the first public release Aug 7, 2014 (derived from earlier proprietary work). It was always motivated by a “dimensional analysis” or “get the units consistent” argument. It is intended as the pre-processing step before operations that are metric sensitive, such as KNN classification and principal components regression. We didn’t really work on proving theorems about it, because in certain contexts it can be recognized as “the right thing to do.” It derives from considering input (or independent variables or columns) as single variable models and the combining of such variables as a nested model or ensemble model construction (chapter 6 of Practical Data Science with R Nina Zumel, John Mount; Manning 2014 was somewhat organized with this idea behind the scenes). Considering y (or the outcome to be modeled) during dimension reduction prior to predictive modeling is a natural concern, but it seems to be anathema in principal components analysis.
y-aware scaling is in fact simple (it involves multiplying by the slope coefficients from linear regressions for a regression problem or multiplying by the slope coefficient from a logistic regression for classification problems; this is different than multiplying by the outcome y which would not be available during the application phase of a predictive model). The fact that it is simple makes it a bit hard to accept that it is both effective and novel. We are not saying it is unprecedented, but it is certainly not center in the standard literature (despite being an easy and effective technique).
There is an an extensive literature on scaling, filtering, transforming, and pre-conditioning data for principal components analysis (for example see “Centering, scaling, and transformations: improving the biological information content of metabolomics data”, Robert A van den BergEmail, Huub CJ Hoefsloot, Johan A Westerhuis, Age K Smilde and Mariët J van der Werf, BMC Genomics20067:142, 2006). However, these are all what we call x-only transforms.
When you consult references (such as The Elements of Statistical Learning, 2nd edition, Trevor Hastie, Robert Tibshirani, Jerome Friedman, Springer 2009; and Applied Predictive Modeling, Max Kuhn, Kjell Johnson, Springer 2013) you basically see only two y-sensitive principal components style techniques (in addition to recommendations to use regularized regression):
I would like to repeat (it is already implied in Nina’s article): y-aware scaling is not equivalent to either of these methods.
Supervised PCA is simply pruning the variables by inspecting small regressions prior to the PCA steps. In my mind it makes our point that principal components users do not consider using the outcome or y-variable in their data preparation that in 2006 you could get a publication by encouraging this natural step and giving the step a name. I’ll repeat: filtering and pruning variables is common in many forms of data analysis so it is remarkable how much work was required to sell the idea of supervised PCA.
Partial Least Squares Regression is an interesting y-aware technique, but it is a different (and more complicated) technique than y-aware scaling. Here is an example (in R) showing the two methods having very different performance on (an admittedly artificial) problem: PLS.md.
In conclusion, I encourage you to take the time to read up on y-aware scaling and consider using it during your dimension reduction steps prior to predictive modeling.