I tried searching for this but could not find the info. I am conducting a linear regression using 10 variables (1 y variable and 9 x variables). All the variables are correlated. I want to see if I need all 9 variables or not. How do I use the data from PCA to eliminate variables?
I conducted PCA on all 10 variables using prcomp()
and got the following results:
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Standard deviation 0.1021 0.04005 0.03464 0.03114 0.02414 0.02047 0.01708 0.01425 0.01308 0.003287
Proportion of Variance 0.6567 0.10101 0.07555 0.06104 0.03668 0.02639 0.01838 0.01278 0.01078 0.000680
Cumulative Proportion 0.6567 0.75773 0.83328 0.89432 0.93100 0.95738 0.97576 0.98854 0.99932 1.000000
Rotation:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
[1,] -0.219033940 0.009323363 0.14371969 0.06987706 0.19302513 -0.02648874 0.16654618 -0.06567080 -0.925393447 0.005948459
[2,] -0.007661133 -0.027804546 -0.24045564 0.13997803 0.00461297 -0.13195868 0.13625008 0.05140013 -0.005668700 -0.939724900
[3,] -0.053184446 -0.212036806 -0.26744318 0.36220366 -0.53094911 0.24356319 -0.04692857 -0.62944042 -0.084900337 0.051564259
[4,] -0.188804651 0.062154139 -0.08807850 0.18886008 0.19969440 -0.59987987 -0.68882923 -0.20548388 -0.004509710 0.024501524
[5,] -0.299789863 0.080676352 -0.62720621 -0.23335343 0.37274825 0.50767975 -0.23796461 0.03549668 -0.025233090 0.023917725
[6,] -0.013478134 -0.052386807 -0.58015768 0.34394876 -0.01276741 -0.38994226 0.42009710 0.31887185 0.002157408 0.334375266
[7,] -0.380565266 0.227200067 0.23992808 0.40306010 0.46135693 0.09059073 0.35930614 -0.34019038 0.342613874 0.015991214
[8,] -0.432463682 0.037822199 0.20765408 0.45337044 -0.30497494 0.26299209 -0.26947304 0.57196490 0.008807625 -0.029461460
[9,] -0.654931547 0.158646794 -0.01629962 -0.51083458 -0.39357245 -0.27198634 0.20326283 -0.08572653 0.083798804 -0.010738521
[10,] -0.250287731 -0.928894500 0.10639604 -0.08339656 0.20266163 -0.03955488 0.02948133 0.03827340 0.106117791 0.002154660
So it sounds like you are facing a model selection problem, you want to choose the best variables without overfitting correct?
PCA may not be the way to go for feature selection, here's one discussion of it:
https://stats.stackexchange.com/questions/27300/using-pca-for-feature-selection
The usual purpose of PCA is dimensionality reduction, i.e. describing relationships in your data using fewer dimensions than are actually present. A component that explains a lot of variance could be a good feature but not necessarily, its not exactly geared towards that purpose.
If what you want to do is pare down the number of features in your model, I would suggest using an information criterion like the AIC. You can easily use this is R with the
stepAIC
function like so:At each step it trims out another feature, minimizing on AIC. There is a lot more that goes into model selection, and a lot of things to consider and adjust, so this is not a proscriptive guide, just wanted to bring it up as something to consider.