PCA projection and reconstruction in scikit-learn

I can perform PCA in scikit by code below: X_train has 279180 rows and 104 columns.

from sklearn.decomposition import PCA
pca = PCA(n_components=30)
X_train_pca = pca.fit_transform(X_train)

Now, when I want to project the eigenvectors onto feature space, I must do following:

""" Projection """
comp = pca.components_ #30x104
com_tr = np.transpose(pca.components_) #104x30
proj = np.dot(X_train,com_tr) #279180x104 * 104x30 = 297180x30

But I am hesitating with this step, because Scikit documentation says:

components_: array, [n_components, n_features]

Principal axes in feature space, representing the directions of maximum variance in the data.

It seems to me, that it is already projected, but when I checked the source code, it returns only the eigenvectors.

What is the right way how to project it?

Ultimately, I am aiming to calculate the MSE of reconstruction.

""" Reconstruct """
recon = np.dot(proj,comp) #297180x30 * 30x104 = 279180x104

"""  MSE Error """
print "MSE = %.6G" %(np.mean((X_train - recon)**2))

标签： python machine-learning scikit-learn pca

2条回答

Evening l夕情丶

2楼-- · 2020-05-14 13:41

You can do

proj = pca.inverse_transform(X_train_pca)

That way you do not have to worry about how to do the multiplications.

What you obtain after pca.fit_transform or pca.transform are what is usually called the "loadings" for each sample, meaning how much of each component you need to describe it best using a linear combination of the components_ (the principal axes in feature space).

The projection you are aiming at is back in the original signal space. This means that you need to go back into signal space using the components and the loadings.

So there are three steps to disambiguate here. Here you have, step by step, what you can do using the PCA object and how it is actually calculated:

pca.fit estimates the components (using an SVD on the centered Xtrain):

from sklearn.decomposition import PCA
import numpy as np
from numpy.testing import assert_array_almost_equal

#Should this variable be X_train instead of Xtrain?
X_train = np.random.randn(100, 50)

pca = PCA(n_components=30)
pca.fit(X_train)

U, S, VT = np.linalg.svd(X_train - X_train.mean(0))

assert_array_almost_equal(VT[:30], pca.components_)

pca.transform calculates the loadings as you describe

X_train_pca = pca.transform(X_train)

X_train_pca2 = (X_train - pca.mean_).dot(pca.components_.T)

assert_array_almost_equal(X_train_pca, X_train_pca2)

pca.inverse_transform obtains the projection onto components in signal space you are interested in

X_projected = pca.inverse_transform(X_train_pca)
X_projected2 = X_train_pca.dot(pca.components_) + pca.mean_

assert_array_almost_equal(X_projected, X_projected2)

You can now evaluate the projection loss

loss = ((X_train - X_projected) ** 2).mean()

0人赞添加讨论(0) 举报

干净又极端

3楼-- · 2020-05-14 13:41

Adding on @eickenberg's post, here is how to do the pca reconstruction of digits' images:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn import decomposition

n_components = 10
image_shape = (8, 8)

digits = load_digits()
digits = digits.data

n_samples, n_features = digits.shape
estimator = decomposition.PCA(n_components=n_components, svd_solver='randomized', whiten=True)
digits_recons = estimator.inverse_transform(estimator.fit_transform(digits))

# show 5 randomly chosen digits and their PCA reconstructions with 10 dominant eigenvectors
indices = np.random.choice(n_samples, 5, replace=False)
plt.figure(figsize=(5,2))
for i in range(len(indices)):
    plt.subplot(1,5,i+1), plt.imshow(np.reshape(digits[indices[i],:], image_shape)), plt.axis('off')
plt.suptitle('Original', size=25)
plt.show()
plt.figure(figsize=(5,2))
for i in range(len(indices)):
    plt.subplot(1,5,i+1), plt.imshow(np.reshape(digits_recons[indices[i],:], image_shape)), plt.axis('off')
plt.suptitle('PCA reconstructed'.format(n_components), size=25)
plt.show()

0人赞添加讨论(0) 举报

PCA projection and reconstruction in scikit-learn

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间