I'm trying to use PCA to pick good predictors to use in the xreg
argument of an arima
model to try to forecast the tVar
variable below. I am just using the reduced dataset below with just a few variables to make the example simple.
I am trying to understand how the formula argument in princomp
works. For the pc
object below, is it saying "use xVar1
and xVar2
to explain the variance in na.omit(dfData[,c("tVar","xVar1","xVar2")])
" ?
What I ultimately would like to do is create a new variable which explains most of the variance in tVar
. Is that something I can do using PCA? If so, could someone please explain how or point me towards an example?
Code:
pc <- princomp(~xVar1+xVar2,
data = na.omit(dfData[,c("tVar","xVar1","xVar2")]),
cor=TRUE)
Data:
dput(na.omit(dfData[1:100,c("tVar","xVar1","xVar2")]))
structure(list(tVar = c(11, 14, 17, 5, 5, 5.5, 8, 5.5,
6.5, 8.5, 4, 5, 9, 10, 11, 7, 6, 7, 7, 5, 6, 9, 9, 6.5, 9, 3.5,
2, 15, 2.5, 17, 5, 5.5, 7, 6, 3.5, 6, 9.5, 5, 7, 4, 5, 4, 9.5,
3.5, 5, 4, 4, 9, 4.5, 6, 10, 9.5, 15, 9, 5.5, 7.5, 12, 17.5,
19, 7, 14, 17, 3.5, 6, 15, 11, 10.5, 11, 13, 9.5, 9, 7, 4, 6,
15, 5, 18, 5, 6, 19, 19, 6, 7, 7.5, 7.5, 7, 6.5, 9, 10, 5.5,
5, 7.5, 5, 4, 10, 7, 5, 12), xVar1 = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
xVar2 = c(0L,
1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L,
2L, 3L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 3L, 1L, 0L, 1L, 2L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L,
1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L,
0L)), .Names = c("tVar", "xVar1", "xVar2"
), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L, 10L, 11L, 12L,
13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L,25L,
26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L,38L,
39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L,51L,
52L, 54L, 55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L,
66L, 67L, 68L, 69L, 70L, 71L, 72L, 73L, 74L, 75L, 76L, 77L, 78L,
79L, 80L, 81L, 82L, 83L, 84L, 85L, 86L, 87L, 88L, 89L, 90L, 91L,
92L, 93L, 94L, 95L, 96L, 97L, 98L, 99L, 100L),
class = "data.frame", na.action = structure(c(8L,53L),
.Names = c("8", "53"), class = "omit"))
(This is a very good post! It is interesting to have another post today regarding PCA. Though that question is more basic, regarding the difference between
princomp
andprcomp
, but the mathematical details with R code I make in the answer may be beneficial to any one learning PCA.)PCA is used for dimension reduction (low-rank approximation), when:
p
) correlated variablesx1, x2, ..., xp
;k < p
) of new, linearly independent variablesz1, z2, ..., zk
;z1, z2, ..., zk
rather thanx1, x2, ..., xp
to predict a response variabley
.A fundamental picture and a little mathematics
Suppose you have a response variable
y
, a full linear regression without dropping any variables should take the formula:However, we can do a reasonable approximate model, after PCA. Let
X
be the model matrix in above, i.e., the matrix by combining all observations ofx1, x2, ... , xp
by column, thenNow,
root_eigen_value
(a length-p
vector) is monotonically decreasing, i.e., the contribution to total covariance is decreasing, hence we can only select the firstk
values. Accordingly, we can select the firstk
columns of transformed matrixX1
. Let's do:Now, we have successfully reduced
p
variables tok
variables, and each column ofZ
is the new variablez1, z2, ..., zk
. Bear in mind that these variables are not a subset of original variables; they are completely new, without names. But since we are only interested in predictingy
, it does not matter what name we give toz1, z2, ..., zk
. Then we can fit an approximate linear model:Use
princomp()
In fact, things are easier, because
princomp()
does all the computation for us. By calling:we can get all we want. Among the a few returned values in
pc
:pc$sdev
givesroot_eigen_value
. If you doplot(pc)
, you can see a barplot showing this. If your input data are highly correlated, then you are expected to see a near exponential decay in this figure, with only a few variables dominating the covariance. (Unfortunately, your toy data is not going to work.xVar1
andxVar2
are binary, and they are already linearly independent, hence after PCA, you will see that they both give equal contribution.)pc$loadings
giveseigen_vector_mat
;pc$scores
givesX1
.Use
arima()
The variable selection process is simple. If you decide to take the first
k
variables out of a total ofp
variables, by inspectingplot(pc)
, then you extract the firstk
columns of thepc$scores
matrix. Each column formsz1, z2, ..., zk
, and pass them toarima()
via argumentreg
.Back to your question about formula
After my explanation, you should know the answer is "No". Do not mix response variable
tVar
used in the regression step, with the predictor variablesxVar1
,xVars
, ... used in the PCA step.princomp()
allows three ways to pass in arguments:You have chosen the first way. The formula is used to tell
princomp()
to extract data fromdata
, and it will later compute model matrix, covariance matrix, correlation matrix, eigen decomposition, till we finally get the result of PCA.Follow-up on your comments
The formula tells how to extract matrix from the data frame. Since you use the same formula
~ xVar1 + xVar2
, whether you includetVars
in the data frame to pass to princomp makes no difference, as that column will not be touched byprincomp
.Do not include
tVars
in your formula for PCA. As I said, regression and PCA are different problems, and shall not be confused with each other.Yes. Regression (or
arima()
in your setting) is used to set up relation between your responsetVars
and predictor variablesx1, x2, ..., xp
orz1, z2, ..., zk
. A regression/arima model will explain the mean and variance of response, in terms of predictors.PCA is a different problem. It only select a low rank (fewer parameters) representation of your original predictor variables
xVar1, xVar2, ...
, so that you can use fewer variables in later regression / ARIMA modelling.Still, you might need to think about whether you should do PCA for your problem.
xVar1
andxVar2
you gave are just linearly independent, so dimension reduction is impossible. You can view the correlation in your data, bypairs(mydata)
. A better visualization may be to usecorrplot
R package. See this answer for examples on how to use it to plot covariance matrix.