PCA: How does princomp() work and can I use it to

2019-01-23 00:31发布

问题:

I'm trying to use PCA to pick good predictors to use in the xreg argument of an arima model to try to forecast the tVar variable below. I am just using the reduced dataset below with just a few variables to make the example simple.

I am trying to understand how the formula argument in princomp works. For the pc object below, is it saying "use xVar1 and xVar2 to explain the variance in na.omit(dfData[,c("tVar","xVar1","xVar2")])" ?

What I ultimately would like to do is create a new variable which explains most of the variance in tVar. Is that something I can do using PCA? If so, could someone please explain how or point me towards an example?

Code:

pc <- princomp(~xVar1+xVar2,
               data = na.omit(dfData[,c("tVar","xVar1","xVar2")]), 
               cor=TRUE)

Data:

dput(na.omit(dfData[1:100,c("tVar","xVar1","xVar2")]))
structure(list(tVar = c(11, 14, 17, 5, 5, 5.5, 8, 5.5, 
          6.5, 8.5, 4, 5, 9, 10, 11, 7, 6, 7, 7, 5, 6, 9, 9, 6.5, 9, 3.5, 
          2, 15, 2.5, 17, 5, 5.5, 7, 6, 3.5, 6, 9.5, 5, 7, 4, 5, 4, 9.5, 
          3.5, 5, 4, 4, 9, 4.5, 6, 10, 9.5, 15, 9, 5.5, 7.5, 12, 17.5, 
          19, 7, 14, 17, 3.5, 6, 15, 11, 10.5, 11, 13, 9.5, 9, 7, 4, 6, 
          15, 5, 18, 5, 6, 19, 19, 6, 7, 7.5, 7.5, 7, 6.5, 9, 10, 5.5, 
          5, 7.5, 5, 4, 10, 7, 5, 12), xVar1 = c(0L, 0L, 0L, 0L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 
          1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
          xVar2  = c(0L, 
          1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 
          2L, 3L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 
          0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 
          0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 3L, 1L, 0L, 1L, 2L,
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 
          1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 
          0L)), .Names = c("tVar", "xVar1", "xVar2"
          ), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L, 10L, 11L, 12L, 
          13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L,25L, 
          26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L,38L, 
          39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L,51L, 
          52L, 54L, 55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L,
          66L, 67L, 68L, 69L, 70L, 71L, 72L, 73L, 74L, 75L, 76L, 77L, 78L, 
          79L, 80L, 81L, 82L, 83L, 84L, 85L, 86L, 87L, 88L, 89L, 90L, 91L, 
          92L, 93L, 94L, 95L, 96L, 97L, 98L, 99L, 100L),
          class  = "data.frame", na.action = structure(c(8L,53L),
          .Names = c("8", "53"), class = "omit"))

回答1:

(This is a very good post! It is interesting to have another post today regarding PCA. Though that question is more basic, regarding the difference between princomp and prcomp, but the mathematical details with R code I make in the answer may be beneficial to any one learning PCA.)

PCA is used for dimension reduction (low-rank approximation), when:

  1. you have a lot of (say p) correlated variables x1, x2, ..., xp;
  2. you want to shrink them to a small number (say k < p) of new, linearly independent variables z1, z2, ..., zk;
  3. you want to use z1, z2, ..., zk rather than x1, x2, ..., xp to predict a response variable y.

A fundamental picture and a little mathematics

Suppose you have a response variable y, a full linear regression without dropping any variables should take the formula:

y ~ x1 + x2 + ... + xp

However, we can do a reasonable approximate model, after PCA. Let X be the model matrix in above, i.e., the matrix by combining all observations of x1, x2, ... , xp by column, then

S <- cor(X)  ## get correlation matrix S
E <- eigen(S)  ## compute eigen decomposition of S
root_eigen_value <- sqrt(E$values)  ## square root of eigen values
eigen_vector_mat <- E$vectors  ## matrix of eigen vectors
X1 <- scale(X) %*% eigen_vector_mat  ## transform original matrix

Now, root_eigen_value (a length-p vector) is monotonically decreasing, i.e., the contribution to total covariance is decreasing, hence we can only select the first k values. Accordingly, we can select the first k columns of transformed matrix X1. Let's do:

Z <- X1[, 1:k]

Now, we have successfully reduced p variables to k variables, and each column of Z is the new variable z1, z2, ..., zk. Bear in mind that these variables are not a subset of original variables; they are completely new, without names. But since we are only interested in predicting y, it does not matter what name we give to z1, z2, ..., zk. Then we can fit an approximate linear model:

y ~ z1 + z2 + ... + zk

Use princomp()

In fact, things are easier, because princomp() does all the computation for us. By calling:

pc <- princomp(~ x1 + x2 + ... + xp, data, cor = TRUE)

we can get all we want. Among the a few returned values in pc:

  1. pc$sdev gives root_eigen_value. If you do plot(pc), you can see a barplot showing this. If your input data are highly correlated, then you are expected to see a near exponential decay in this figure, with only a few variables dominating the covariance. (Unfortunately, your toy data is not going to work. xVar1 and xVar2 are binary, and they are already linearly independent, hence after PCA, you will see that they both give equal contribution.)
  2. pc$loadings gives eigen_vector_mat;
  3. pc$scores gives X1.

Use arima()

The variable selection process is simple. If you decide to take the first k variables out of a total of p variables, by inspecting plot(pc), then you extract the first k columns of the pc$scores matrix. Each column forms z1, z2, ..., zk, and pass them to arima() via argument reg.


Back to your question about formula

For the pc object below, is it saying "use xVar1 and xVar2 to explain the variance in na.omit(dfData[,c("tVar","xVar1","xVar2")])"

After my explanation, you should know the answer is "No". Do not mix response variable tVar used in the regression step, with the predictor variables xVar1, xVars, ... used in the PCA step.

princomp() allows three ways to pass in arguments:

  1. by formula and data;
  2. by model matrix;
  3. by covariance matrix.

You have chosen the first way. The formula is used to tell princomp() to extract data from data, and it will later compute model matrix, covariance matrix, correlation matrix, eigen decomposition, till we finally get the result of PCA.


Follow-up on your comments

So if I understand correctly, PCA is primarily for reducing the number of variables, and I shouldn't include the response variable tVar in the formula or data. But I was wondering why princomp(~xVar1+xVar2, data = na.omit(dfData[,c("tVar","xVar1","xVar2")]), cor=TRUE) and princomp(na.omit(dfData[,c("xVar1","xVar2")]), cor=TRUE) are basically equivalent?

The formula tells how to extract matrix from the data frame. Since you use the same formula ~ xVar1 + xVar2, whether you include tVars in the data frame to pass to princomp makes no difference, as that column will not be touched by princomp.

Do not include tVars in your formula for PCA. As I said, regression and PCA are different problems, and shall not be confused with each other.

To be clear, the strategy with PCA isn't to create a new variable which is a combination of xVar1 and xVar2 and explains most the variance in tVar, but rather to create a new variable which is a combination of xVar1 and xVar2 and explains most the variance of dfData[,c("xVar1","xVar2")]?

Yes. Regression (or arima() in your setting) is used to set up relation between your response tVars and predictor variables x1, x2, ..., xp or z1, z2, ..., zk. A regression/arima model will explain the mean and variance of response, in terms of predictors.

PCA is a different problem. It only select a low rank (fewer parameters) representation of your original predictor variables xVar1, xVar2, ..., so that you can use fewer variables in later regression / ARIMA modelling.

Still, you might need to think about whether you should do PCA for your problem.

  1. Do you have a lot of variables, say 10+? In statistical modelling, it is common to reach hundreds of thousands of parameters. Computation can get very slow if we use all of them. PCA is useful in this case, to reduce computational complexity, while giving a reasonable representation of original covariance.
  2. Are your variables highly correlated? If they are readily linearly independent of each other, PCA may not drop anything. For example, the toy data xVar1 and xVar2 you gave are just linearly independent, so dimension reduction is impossible. You can view the correlation in your data, by pairs(mydata). A better visualization may be to use corrplot R package. See this answer for examples on how to use it to plot covariance matrix.