PCA / EOF for data with missing values – a comparison of accuracy

R-bloggers 2014-09-15

Summary:

Not all Principal Component Analysis (PCA) (also called Empirical Orthogonal Function analysis, EOF) approaches are equal when it comes to dealing with a data field that contain missing values (i.e. "gappy"). The following post compares several methods by assessing the accuracy of the derived PCs to reconstruct the "true" data set, as was similarly conducted by Taylor et al. (2013). The gappy EOF methods to be compared are:
  1. LSEOF - "Least-Squares Empirical Orthogonal Functions" - The traditional approach, which modifies the covariance matrix used for the EOF decomposition by the number of paired observations, and further scales the projected PCs by these same weightings (see Björnsson and Venegas 1997, von Storch and Zweiers 1999 for details).
  2. RSEOF - "Recursively Subtracted Empirical Orthogonal Functions" - This approach modifies the LSEOF approach by recursively solving for the leading EOF, whose reconstructed field is then subtracted from the original field. This recursive subtraction is done until a given stopping point (i.e. number of EOFs, % remaining variance, etc.) (see Taylor et al. 2013 for details)
  3. DINEOF - "Data Interpolating Empirical Orthogonal Functions" - This approach gradually solves for EOFs by means of an iterative algorothm to fit EOFS to a given number of non-missing value reference points (small percentage of observations) via RMSE minimization (see Beckers and Rixen 2003 for details).
I have introduced both the LSEOF [link] and DINEOF [link] methods in the past, but have never directly compared them for the blog. The purpose of this post is to make this comparison and to also introduce a more general EOF function that is capable of conducting RSEOF. All analyses can be reproduced following installation of the "sinkr" package: https://github.com/menugget/sinkr The basic problem comes down to the difficulties of decomposing a matrix that is not "positive-definite", i.e. the estimated covariance matrix from a gappy data set. DINEOF entirely avoids this issue by first interpolating the values to create a full data field, while LSEOF and RSEOF rely on decomposing this estimation. A known problem is that the trailing EOFs derived from such a matrix are amplified in their singular values, which can consequently amplify errors in field reconstructions when included. The RSEOF approach thus attempts to remedy these issues by recursively solving for only leading EOFs. In the following examples, I show the performance of the three approaches in terms of reconstructing the data field (including the "true" values). Example 1 - Synthetic data set: The first example uses a synthetic data set used by Beckers and Rixen (2003) in their introduction of the DINEOF approach. The accuracy of the reconstruction is dependent on the number of  EOFs used. In a non-gappy example, a perfect reconstruction should be possible using this full set of EOFs - In fact it only takes 9 EOFs when using the non-noisy true field, since it is a composite of 9 signals. In the case of the noisy, gappy data sets, reconstructions with trailing EOFs may increase errors. This can be seen in the figure at the top of the post showing RMSE vs the number of EOFs used in the reconstruction. The figure shows the DINEOF approach to be the most accurate. The LSEOF approach has a clear RSME minimum with 4 EOFs, while the RSEOF approach was largely able to remedy the amplification of error when using trailing EOFs. The problem of error amplification is even more dramatic when viewed visually, as in the following where the full set of EOFs have been used: It's clear that the LSEOF approach is only successful in reconstructing the non-gap values of the observed data, while values of gaps are washed out by the error in trailing EOFs. In fact, given the associated amplitude of the noise, there were only about 4 EOFs modes that really carry any information across the entire field. Again, DINEOF does a fine job in precisely estimating the EOF modes, even in cases where modes were quite similar in magnitude (e.g. EOFs 2 & 3). The LSEOF and RSEOF approaches create more of a mixture out of the modes 2 & 3 :

Link:

http://feedproxy.google.com/~r/RBloggers/~3/gOz2-9s45vs/

From feeds:

Statistics and Visualization » R-bloggers

Tags:

r bloggers

Authors:

Marc in the box

Date tagged:

09/15/2014, 22:21

Date published:

09/15/2014, 06:16