I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples.
Suppose I want to preserve the nf
features with the maximum variance. With scikit-learn
I am able to do it in this way:
from sklearn.decomposition import PCA
nf = 100
pca = PCA(n_components=nf)
# X is the matrix transposed (n samples on the rows, m features on the columns)
pca.fit(X)
X_new = pca.transform(X)
Now, I get a new matrix X_new
that has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones?
Thanks
The projected features onto principal components will retain the important information (axes with maximum variances) and drop axes with small variances. This behavior is like to
compression
(Not discard).And
X_proj
is the better name ofX_new
, because it is the projection ofX
ontoprincipal components
You can reconstruct the
X_rec
asHere,
X_rec
is close toX
, but theless important
information was dropped by PCA. So we can sayX_rec
is denoised.In my opinion, I can say
the noise
is discard.The features that your
PCA
object has determined during fitting are inpca.components_
. The vector space orthogonal to the one spanned bypca.components_
is discarded.Please note that PCA does not "discard" or "retain" any of your pre-defined features (encoded by the columns you specify). It mixes all of them (by weighted sums) to find orthogonal directions of maximum variance.
If this is not the behaviour you are looking for, then PCA dimensionality reduction is not the way to go. For some simple general feature selection methods, you can take a look at
sklearn.feature_selection
The answer marked above is incorrect. The sklearn site clearly states that the components_ array is sorted. so it can't be used to identify the important features.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html