principal component analysis (PCA) in R: which fun

Can anyone explain what the major differences between the prcomp and princomp functions are?

Is there any particular reason why I should choose one over the other? In case this is relevant, the type of application I am looking at is a quality control analysis for genomic (expression) data sets.

Thank you!

标签： r linear-algebra pca unsupervised-learning

1条回答

啃猪蹄的小仙女

2楼-- · 2019-03-13 21:32

There are differences between these two functions w/r/t

the function parameters (what you can/must pass in when you call the function);
the values returned by each; and
the numerical technique used by each to calculate principal components.

Numerical Technique Used to Calculate PCA

In particular, princomp should be a lot faster (and the performance difference will increase with the size of the data matrix) given that it calculates principal components via eigenvector decomposition on the covariance matrix, versus prcomp which calculates principal components via singular value decomposition (SVD) on the original data matrix.

Eigenvalue decomp is only defined for square matrices (because the the technique is just solving the characteristic polynomial) but that's not a practical limitation because the eigenvalue decomp always involves the predicate step of calculating from the original data matrix, the covariance matrix.

Not only is the covariance matrix square, but is is usually much smaller than the original data matrix (as long as the number of attributes is less than the number of rows, or n < m, which is true in most of the time.

The former (eigenvector decomp) is less accurate (the difference is often not material), but much faster because computation is performed on the covariance matrix rather than on the original data matrix; so for instance, if the data matrix has the usual shape such that n >> m, i.e., 1000 rows and 10 columns, then the covariance matrix is 10 x 10; by contrast prcomp calculates SVD on the original 1000 x 10 matrix.

I don't know the shape of data matrices for genomic expression data, but if the rows are in the thousands or even hundreds, then prcomp will be noticeably slower than princomp. I don't know your context, eg, whether pca is performed as a single step in a larger data flow and whether net performance (execution speed) is of concern, so i can't say whether this performance is indeed relevant for your use case. Likewise, it's difficult to say whether the difference in numerical accuracy between the two techniques is significant and in fact it depends on the data.

Return Values

princomp returns a list comprised of seven items; prcomp returns a list of five.

> names(pc1)    # prcomp
    [1] "sdev"     "rotation" "center"   "scale"    "x"       

> names(pc2)    # princomp
    [1] "sdev"     "loadings" "center"   "scale"    "n.obs"    "scores"   "call"

For princomp, the most important items returnd are component scores and loadings.

The values returned by the two functions can be reconciled (compared) this way: prcomp returns, among other things, a matrix called rotation which is equivalent to the loadings matrix returned by princomp.

if you multiply prcomp's rotation matrix by the original data matrix the result is stored in the matrix keyed to x

finally, prcomp has a plot method which gives a scree plot (shows the relative and cumulative importance of each variable/column--the most useful visualization of PCA in my opinion).

Function Arguments

prcomp will scale (to unit variance) and mean center your data for you if you set to TRUE the arguments scale and center. That's a trivial difference between the two given that you can both scale and mean center your data in a single line using the scale function.

0人赞添加讨论(0) 举报