How to use scikit-learn PCA for features reduction

I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples.

Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way:

from sklearn.decomposition import PCA

nf = 100
pca = PCA(n_components=nf)
# X is the matrix transposed (n samples on the rows, m features on the columns)
pca.fit(X)

X_new = pca.transform(X)

Now, I get a new matrix X_new that has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones?

Thanks

标签： python machine-learning scikit-learn pca feature-selection

3条回答

何必那么认真

2楼-- · 2020-05-14 16:56

The projected features onto principal components will retain the important information (axes with maximum variances) and drop axes with small variances. This behavior is like to compression (Not discard).

And X_proj is the better name of X_new, because it is the projection of X onto principal components

You can reconstruct the X_rec as

X_rec = pca.inverse_transform(X_proj) # X_proj is originally X_new

Here, X_rec is close to X, but the less important information was dropped by PCA. So we can say X_rec is denoised.

In my opinion, I can say the noise is discard.

0人赞添加讨论(0) 举报

Emotional °昔

3楼-- · 2020-05-14 16:59

The features that your PCA object has determined during fitting are in pca.components_. The vector space orthogonal to the one spanned by pca.components_ is discarded.

Please note that PCA does not "discard" or "retain" any of your pre-defined features (encoded by the columns you specify). It mixes all of them (by weighted sums) to find orthogonal directions of maximum variance.

If this is not the behaviour you are looking for, then PCA dimensionality reduction is not the way to go. For some simple general feature selection methods, you can take a look at sklearn.feature_selection

0人赞添加讨论(0) 举报

我想做一个坏孩纸

4楼-- · 2020-05-14 17:01

The answer marked above is incorrect. The sklearn site clearly states that the components_ array is sorted. so it can't be used to identify the important features.

components_ : array, [n_components, n_features] Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

0人赞添加讨论(0) 举报

How to use scikit-learn PCA for features reduction

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间