I have performed a PCA analysis over my original dataset and from the compressed dataset transformed by the PCA I have also selected the number of PC I want to keep (they explain almost the 94% of the variance). Now I am struggling with the identification of the original features that are important in the reduced dataset. How do I find out which feature is important and which is not among the remaining Principal Components after the dimension reduction? Here is my code:
from sklearn.decomposition import PCA
pca = PCA(n_components=8)
pca.fit(scaledDataset)
projection = pca.transform(scaledDataset)
Furthermore, I tried also to perform a clustering algorithm on the reduced dataset but surprisingly for me, the score is lower than on the original dataset. How is it possible?
First of all, I assume that you call
features
the variables andnot the samples/observations
. In this case, you could do something like the following by creating abiplot
function that shows everything in one plot. In this example I am using the iris data:Before the example, please note that the basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings). See my last paragraph after the plot for more details.
PART1: I explain how to check the importance of the features and how to plot a biplot.
PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.
PART 1:
Visualize what's going on using the biplot
Now, the importance of each feature is reflected by the magnitude of the corresponding values in the eigenvectors (higher magnitude - higher importance)
Let's see first what amount of variance does each PC explain.
PC1 explains 72%
andPC2 23%
. Together, if we keep PC1 and PC2 only, they explain95%
.Now, let's find the most important features.
Here,
pca.components_
has shape[n_components, n_features]
. Thus, by looking at thePC1
(First Principal Component) which is the first row:[0.52237162 0.26335492 0.58125401 0.56561105]]
we can conclude thatfeature 1, 3 and 4
(or Var 1, 3 and 4 in the biplot) are the most important.To sum up, look at the absolute values of the Eigenvectors' components corresponding to the k largest Eigenvalues. In
sklearn
the components are sorted byexplained_variance_
. The larger they are these absolute values, the more a specific feature contributes to that principal component.PART 2:
The important features are the ones that influence more the components and thus, have a large absolute value7score on the component.
To get the most important features on the PCs with names and save them into a pandas dataframe use this:
This prints:
So on the PC1 the feature named
e
is the most important and on PC2 thed
.