Unprincipled Component Analysis
Win-Vector Blog 2014-02-11
As a data scientist I have seen variations of principal component analysis and factor analysis so often blindly misapplied and abused that I have come to think of the technique as unprincipled component analysis. PCA is a good technique often used to reduce sensitivity to overfitting. But this stated design intent leads many to (falsely) believe that any claimed use of PCA prevents overfit (which is not always the case). In this note we comment on the intent of PCA like techniques, common abuses and other options.
The idea is to illustrate what can quietly go wrong in an analysis and what tests to perform to make sure you see the issue. The main point is some analysis issues can not be fixed without going out and getting more domain knowledge, more variables or more data. You can’t always be sure that you have insufficient data in your analysis (there is always a worry that some clever technique will make the current data work), but it must be something you are prepared to consider.
A typical misuse of principal component analysis is pointed out in iowahawk’s Fables of the Reconstruction. To be clear: iowahawk does good work pointing out the problems in the poor results of others. I would, however, advise against trying to draw any climate or political conclusions. iowahawk performed a very smart reconstruction of a typical poor result. It is a dangerous fallacy to think you can find truth by reversing the conclusion of a poor result. The technique wasn’t bad, it was just the data wasn’t enough to support the desired result and it was wrong to not check more thoroughly and to promote such a weak result. Technique can be fixed, but the hardest thing to get right is having enough good data. The original analysis ideas are fairly well organized and clever, the failings were: not checking enough and promoting the result when the data really wasn’t up to supporting all of the claimed steps.
The typical unprincipled component analysis is organized as follows:
- Collect a bunch of data in wildly different units related to the quantity you are trying to predict.
- Without any concern for dimensionality or scaling extract the principle components of the explanatory variables. Perform no research into appropriate model structure, data exchangeability, omitted variable issues and actual relations to the response variable.
- Do not examine or use any of the principle component diagnostics and include all principle components in your linear model (completely missing the point of PCA).
- Do not attempt to simulate your proposed application with a test/train holdout split and rely on qualitative measures to promote (not check) your result.
We can simulate this typical bad analysis ourselves (so it is me making the mistakes here, so in addition to having given deliberately bad advice we now offer a deliberately bad analysis) using data copied from iowahawk’s Open Office document (Which seems to have changed since I last looked at his example which had 23 variables when I last looked, a very good lesson in saving data and recording provenance. For example the copy of warming.ods we just downloaded has a shasum of 683d931e650f9bbf098985311c400f126a14e5cf. This is something we try to teach in our new data science book.). Note: that at the time I thought the PCA issue was due to including rotation elements matching near-zero principle components (which can be pure noise vectors as the need to be orthogonal to early components starts to dominate their calculation). I now think more of the error should be ascribed to simple overfitting and not to any special flaws of PCA- but just the fallacy that PCA always prevents overfit.
But let’s start the (bad) analysis. The task at hand was two part: show a recent spike in global temperature and show a very long history of global temperatures with no similar event. If done properly this would demonstrate recent warning, and help support the argument that such warming could be due to recent industrial activity (by finding no such spike in pre-history). Overall this was a reasonable plan. In fact it stands out as more planning than is typical. And for all its warts more data and details (code) ended up being shared than is common. But the analysis did fall sort and was over-sold. It needed some more domain specific steps and more checking that the type of model built actually could be used in the way proposed (sorting through correlations and causes and extrapolating into the past).
We are only going to concentrate on the second step: trying to establish a claim that there is no sign of large temperature changes in early history. We will ignore the steps of trying to establish a recent temperature spike and of trying to relate that spike to industrial activity and by-products (also done poorly in the study, but these are separate issues). Also: to be clear any criticism of one analysis is not a refutation of the theory of anthropogenic global warming (which I think is likely true), though obviously a bad study is certainly not a proof of anthropogenic global warming (and should not be promoted as being such). To emphasize: the step we are looking at is only try to estimate temperatures far in the past using measured outcomes (tree rings, ice layers); we are not trying to model the future and we are not trying to even estimate correlation with environmental conditions (CO2 isn’t even a variable in this data extract) or human activity.
A big issue is this data set has only recorded temperatures from 1856 through 2001, so you don’t have any direct temperature data even from 1400AD. So you are not yet in a position to establish a calm early climate. But you do have a number of measurements of things related to historic temperatures: tree ring widths, ice core layer thicknesses, and so on. With enough of these you could hope to get estimates of historic temperatures. With too many of these you just over-fit your known data and learn nothing about the past. And in this sort of model you have a lot of co-linearity (a good source of fitting problems) between these measurements and between expected temperature year to year (which we are assuming is a slow trend with high-autocorrelation hidden by a noise process).
The simple analysis doesn’t bother with any research on the relations between these variables and temperature (either through experiments, first principles or something like generalized additive models) and goes immediately to a PCA regression. The idea is sound: run the data through a constriction (the PCA) to make things more learnable (one of the good ideas in neural nets). The R code for such a shotgun analysis is roughly as follows (the steps happen to be easy in R, but with sufficiently clumsy tools you could write much longer code and really convince yourself that you are accomplishing something):
# load and prepare the dataurlBase <- 'http://www.win-vector.com/dfiles/PCA/'pv <- read.table(paste(urlBase,'ProxyVariables.csv',sep=''), sep=',',header=T,comment.char='')ot <- read.table(paste(urlBase,'ObservedTemps.csv',sep=''), sep=',',header=T,comment.char='')keyName <- 'Year'varNames <- setdiff(colnames(pv),keyName)yName = setdiff(colnames(ot),keyName)d <- merge(pv,ot,by=c(keyName))# perform a regression on new PCA variablespcomp <- prcomp(d[,varNames])synthNames <- colnames(pcomp$rotation)d <- cbind(d, as.matrix(d[,varNames]) %*% pcomp$rotation)f <- as.formula(paste(yName, paste(synthNames,collapse=' + '),sep=' ~ '))model <- step(lm(f,data=d))# print and plot a littleprint(summary(model)$r.squared)d$pred <- predict(model,newdata=d)library(ggplot2)ggplot(data=d) + geom_point(aes_string(x=keyName,y=yName)) + geom_line(aes_string(x=keyName,y='pred'))
And this produces an R-squared of 0.49 (not good, but often valid models achieve less) and the following graph showing recorded temperature (recorded as differences from a reference temperature) as points and the model as a line:
Remember: the purpose of this step is not to establish a temperature spike on the right. The goal is to show that we can use data from the recent past to infer temperatures of the further back past. A cursory inspection of the graph would seem to give the impression the modeling idea worked: we get a fit line through history that explains a non-negligible fraction of the variation and seems to track trends. Our step-wise regression model used only six principle components- so we have some generalization occurring.
On the minus side: since we used principle components and have no side knowledge of how temperatures are supposed to affect the stand-in variables we can’t check if the model coefficient signs make sense (an often important task, though we could at least compare signs from the complex model to signs from single variable models). We haven’t looked at the model summary (though it is likely to be forced to be “good” by the use of stepwise regression which rejects many models with bad summaries to get a final model).
And while we have a graph we could talk up, we have not simulated how this model would be used: to try to use relations fit from more recent data to extrapolate past temperatures from past signs. Let’s see if we can even do that on the data we have available.
Our intended use of the data is: use all of the data we have both temperatures and proxy variables for (1856 through 1980) to build a model that imputes older temperatures from proxy variables (at least back to 1400AD). To simulate this intended use of the model we will split our data into what I call a structured test and train split. The test and train split will not be random (which tends not to be an effective test for auto-correlated time series data) but instead split in a way related to our intended use: using the future to extrapolate into the past. We will use all data at or after the median known date (1918) for training and all older data to simulate trying to predict further into the past. If we can’t use data after 1917 to predict 1856 (a claim we can test because we have enough data) then we can’t expect to predict back to 1400AD (a claim we can’t check as we don’t know the “true” tempuratre for 1400AD).
The code to perform the analysis in this style is:
d$trainGroup <- d[,keyName]>=median(d[,keyName])dTrain <- subset(d,trainGroup)dTest <- subset(d,!trainGroup)model <- step(lm(f,data=dTrain))dTest$pred <- predict(model,newdata=dTest)dTrain$pred <- predict(model,newdata=dTrain)ggplot() + geom_point(data=dTest,aes_string(x=keyName,y=yName)) + geom_line(data=dTest,aes_string(x=keyName,y='pred'), color='blue',linetype=2) + geom_point(data=dTrain,aes_string(x=keyName,y=yName)) + geom_line(data=dTrain,aes_string(x=keyName,y='pred'), color='red',linetype=1)
What we are predicting is the difference in temperature from a reference temperature (fixes some issues of scale). The analysis produces the graph below which shows that modeling the past is harder than just running a regression through all of our data. Notice how the dashed line (predictions made on data not available during training) are all upwardly biased having just copied trends from the future into the past.
Our first graph only looked as “good” as it did because it was not even simulating the task we were going to try to use the derived model for: extrapolating into the past. In fact we find this model is worse than a type of null-model on the past: using a single (magically guessed) historic temperature difference of mean(dTest[,yName])
. The point is: the model is better than any single constant on the data it was trained on, but not better than all constants on data it did not see. This is a common symptom of overfitting.
We show a quantitative measure of the poor performance with root mean square error below:
rmse <- function(x,y) { sqrt(mean((x-y)^2))}print(rmse(dTrain[,yName],mean(dTrain[,yName])))## [1] 0.1404713print(rmse(dTrain[,yName],dTrain$pred))## [1] 0.1062604print(rmse(dTest[,yName],mean(dTest[,yName])))## [1] 0.1312853print(rmse(dTest[,yName],dTest$pred))## [1] 0.2004035
We used RMSE instead of correlation to track predictor performance for a subtle reason (see: Don’t use correlation to track prediction performance). And as we said: the RMSE error on test is bigger for the model than for using a single historic average: mean(dTest[,yName])
. Now the historic average is a constant that we would not know in actual application, but for evaluating the quality of the model it is appropriate to compare RMSE’s in this manner. In fact the whole point of this sort of evaluation is to expose the data you wouldn’t know in practice (the actual older temperatures) and see if your “looks plausible” model models them well or not. We can present the issue graphically by re-plotting the data and fit with a smoothing line run through the actual data (you want to work with a system that makes exploratory graphing easy, as to get your “hands in the data” you need produce a lot of plots and summaries).
ggplot() + geom_point(data=dTest,aes_string(x=keyName,y=yName)) + geom_line(data=dTest,aes_string(x=keyName,y='pred'), color='blue',linetype=2) + geom_segment(aes(x=min(dTest[,keyName]), xend=max(dTest[,keyName]), y=mean(dTest[,yName]), yend=mean(dTest[,yName]))) + geom_point(data=dTrain,aes_string(x=keyName,y=yName)) + geom_line(data=dTrain,aes_string(x=keyName,y='pred'), color='red',linetype=1) + geom_segment(aes(x=min(dTrain[,keyName]), xend=max(dTrain[,keyName]), y=mean(dTrain[,yName]), yend=mean(dTrain[,yName]))) + geom_smooth(data=d,aes_string(x=keyName,y=yName), color='black')
The problem is: that while the predictions that were made in the data range available during training (1918 and beyond the predictions being the red line plot through the data which is black dots) look okay the predictions in the test region (all data before 1918, the blue dotted line being the predictions and the data again being the black dots) are systematically off (mostly too high). The black smoothing curve run through all of the data shows this- notice how the predictions center around the curve in the training region, but mostly stay above the curve in the test region. The warning sign is: the behavior of the model is systematically off once we apply it on data it didn’t see during training. The model is good at memorizing what it has been shown, but can’t be trusted where it can’t be checked (as it may not have learned a useful general relation). A good result would have had the model continue to track the smoothing line in the held-out past.
Honestly the fitting ideas were pretty good, but the results just were not good enough to be used as intended (to prove a recent temperature spike would be rare by claiming no such spikes for an interval even longer than we have directly recorded temperature measurements).
Another thing we never did check is: if the PCA actually did anything for us? I doubt it made matters worse, but it does represent extra complexity that should not be allowed until we see an actual benefit (otherwise it is just slavish ritual). Lets check if there was any PCA benefit by comparing to a naive regression on the raw variables now:
g <- as.formula(paste(yName, paste(varNames,collapse=' + '),sep=' ~ '))model <- lm(g,data=dTrain)dTest$pred <- predict(model,newdata=dTest)dTrain$pred <- predict(model,newdata=dTrain)print(rmse(dTrain[,yName],mean(dTrain[,yName])))## [1] 0.1404713print(rmse(dTrain[,yName],dTrain$pred))## [1] 0.1018893print(rmse(dTest[,yName],mean(dTest[,yName])))## [1] 0.1312853print(rmse(dTest[,yName],dTest$pred))## [1] 0.1973062
Almost indistinguishable results. PCA, at least how we used it, brought us almost nothing. Any procedure you use without checking degenerates into mere ritual.
And just to be cruel we have hidden a few other common PCA errors in the above writeup.
First we used PCA in a “dimensionally invalid” way: by not setting the scale.=T
option in the prcomp()
(which should have been pcomp <- prcomp(d[,varNames],scale.=T)
). The point is without scaling you get different principal components depending on units used int the records. It is not true that something measured in millimeters has more range than the same quantity measured in kilometers, but if you don’t use scaling PCA makes exactly this mistake.
Also it was probably not a good idea to rely only on stepwise regression for variable selection, we should probably have not made the later principle components ever available to the analysis. This could be done by setting the synthetic variables to at least have some minimum variation before they are eligible for analysis. Something like the following could work (though you have to pick the threshold sensibly): synthNames <- colnames(pcomp$rotation)[pcomp$sdev>1]
.
We can easily re-run the PCA analysis with these fixes:
pcomp <- prcomp(d[,varNames],scale.=T)synthNames &tl;- colnames(pcomp$rotation)[pcomp$sdev>1]f <- as.formula(paste(yName, paste(synthNames,collapse=' + '),sep=' ~ '))model <- step(lm(f,data=dTrain))dTest$pred <- predict(model,newdata=dTest)dTrain$pred <- predict(model,newdata=dTrain)print(rmse(dTrain[,yName],mean(dTrain[,yName])))## [[1] 0.1404713print(rmse(dTrain[,yName],dTrain$pred))## [1] 0.1179709print(rmse(dTest[,yName],mean(dTest[,yName])))## [1] 0.1312853print(rmse(dTest[,yName],dTest$pred))## [1] 0.2778389
This gives us a model that uses only three principle components with even worse generalization error. Being able to recognize (and admit to) bad results or equivocal results is key skill of a good data scientist.
I originally made a funny mistake when trying this test. I accidentally wrote model <- step(lm(f,data=d))
and trained a model on all of the data without any holdout. This gave RMSEs on the dTrain and dTest sets that were both worse than using the means as the predictors. This is how I noticed I used the wrong data set to train (as least squares minimizes RMSE so it can not underperform a constant on its actual training set, so if I had actually trained on dTrain then I would have to have a non-horrible RMSE on dTrain). This is another (non-principled) bit of evidence we have a bad model: we fit on all the data yet, do not do better than the best constant on large subsets. This isn’t a reliable measure of generalization error, but it is a very bad sign. You don’t get a lot of credit for passing this sort of test (it is a fairly low bar), but when you fail this test things are bad (especially when the subsets your are testing on are structured to simulate your intended application as we have here).
But when PCA actually helps: what is it actually trying to do for us? The variables of this problem actually were set up in way that helps explain what we should be expecting (and do not confuse what we are expecting with what actually happens or with implementation details, the math can be as impressive as you want and still fail to deliver):
print(varNames)## [1] "Quelccaya.Ice.Core.summit..Ice.O.18" "Quelccaya.Ice.Core.summit..Ice.accumulation"## [3] "Quelccaya.Ice.Core..2..Ice.O.18" "Quelccaya.Ice.Core..2..Ice.accumulation"## [5] "Svalbard.Ice.melt" "west.Greenland.Ice.o.18"## [7] "tasmania.tree.ring.width" "north.patagonia.tree.ring.width"## [9] "NA.treeline.ring.widths" "Southeast.U.S.N..Carolina..Dendro.ring.widths"## [11] "Southeast.U.S.S..Carolina..Dendro.ring.widths" "Southeast.U.S.Georgia.Dendro.ring.widths"## [13] "Yakutia.Dendro.ring.widths" "Fennoscandia.Dendro.density"## [15] "Northern.Urals.Dendro.density"
Notice the variables fall into a few small groups: tree ring widths, dendro density, oxygen radioisotope measurements. Each of these is a measurement that is thought to be an outcome affected by temperature (so may record evidence of a temperature at given date). It would make sense to have factors that combine and summarize all measurements of a given type (such as an average). This is something that principal components loading could be expected to do for us (we have not confirmed if it actually has). PCA can be thought of as a simplification or regularize procedure to be combined with regression. The step-wise regressed PCA model is simpler than a full regression because only six synthetic variables are used instead of arbitrary combinations of all 15 original input variables.
However, once we start thinking in terms of variables groups and regularization we realize we have other ways to directly implement these improvements. We could attempt some sort of domain specific direct regularization on the least squares procedure such as adding a penalty term that says all of the coefficients related to a given variable type should have similar magnitudes (this would be a variant of Tikhonov regularization where instead of saying coefficient should be near zero we say they groups of them should be near each other). To do this we would have to use R’s linear algebra capabilities directly (as informing R’s linear regression fitter of the additional loss function would be difficult).
Or we could try a more Bayesian tack and say that all measurements in a given group are in fact noisy observations of hidden ideal variable. So there is an ideal “tree width variable” that has the unobserved temperature as an influence and our tree width measurements are all different derived observations. The Bayesian framework allows us to put a weighted criteria on model coefficients that reflects this pattern of dependence. With the right choice of structure and priors this would simplify the model in a meaningful way (again pushing model coefficients for related variables closer together) and help lessen the potential for overfitting. I don’t have an R package for this I currently recommend, but there are certainly a large number worth trying. The additional complexity of learning a Bayesian regression package is worth the extra control in being able to directly express sensible priors on the model parameters.
My biggest complaint about PCA is: PCA builds synthetic variables without every looking at their relation to the response variable. This is in principle bad: we can force the modeling system to ignore important variables by adding more and more large variance irrelevant variables. Their are some attempts to fix this sort of behavior but explanatory variable scale and variance dominate can undesirably dominate your analysis (such as Partial Least Squares, see Hastie, Tibshirani, Friedman “The Elements of Statistical Learning” 2nd edition, Springer).
So what is one to do? The answer is be careful: read the literature on how to perform a good analysis, use the right tools and test things related to your actual intended application.
Our advice is:
- Don’t use analyses without understanding. You don’t need to study complete implementation details, but you need to understand intent, limits, procedures and consequences.
- Check everything. Use a system (like R) where re-running and re-checking is no harder than copying and pasting. Unchecked procedures become mere ritual. Use the standard tests, but augment them with ad-hoc tests (tests you make up that may not prove anything if passed, but are very bad if they fail). Don’t be afraid to check if things are obviously wrong.
- Don’t treat the assurances (implicit or explicit) that come with a technique as replacements for direct tests of your results. And do not take cross-validation and hold-out techniques as inscrutable methods that always require randomness: see how you can directly re-adapt them to simulate your actual intended application.
It is no coincidence that PCA is most commonly used in fields like social science which have a large emphasis on proper experimental design, variable selection and variable pre-scaling (avoiding many of the potential pitfalls of PCA). We have in fact noticed strong per-field affinities for different modeling methods (and different specializations of methods). For example PLS literature starts in the field of chemometrics, which leads me to conclude it fixes problems that are important to them (scaling and controlling for explanatory variable relevance to the response variable) even if it is not a complete panacea.