I have a dataset with 400 features.
What I did:
# approach 1
d_cov = np.cov(d_train.transpose())
eigens, mypca = LA.eig(d_cov) # assume sort by eigen value also/ LA = numpy linear algebra
# approach 2
pca = PCA(n_components=300)
d_fit = pca.fit_transform(d_train)
pc = pca.components_
Now, these two should be the same, right? as PCA is just the eigendecomposition of the covariance matrix.
But these are very different in my case?
How could that be, I am doing any mistake above?
Comparing variances:
import numpy as np
LA = np.linalg
d_train = np.random.randn(100, 10)
d_cov = np.cov(d_train.transpose())
eigens, mypca = LA.eig(d_cov)
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
pca = PCA(n_components=10)
d_fit = pca.fit_transform(d_train)
pc = pca.components_
ve = pca.explained_variance_
#mypca[0,:], pc[0,:] pc.transpose()[0,:]
plt.plot(list(range(len(eigens))), [ x.transpose().dot(d_cov).dot(x) for x,y in zip(mypca, eigens) ])
plt.plot(list(range(len(ve))), ve)
plt.show()
print(mypca, '\n---\n' , pc)
You need to read the doc more carefully. numpy's doc is great and very thorough, very often you'll find the solution to your problem only by reading it.
Here is a modified version of your code (import on top of snippet, use of .T instead of .transpose(), pep8.)
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from numpy import linalg as LA
d_train = np.random.randn(100, 10)
d_cov = np.cov(d_train.transpose())
eigens, mypca = LA.eig(d_cov)
pca = PCA(n_components=10)
d_fit = pca.fit_transform(d_train)
pc = pca.components_
explained = pca.explained_variance_
my_explained = np.sort([x.T.dot(d_cov).dot(x) for x in mypca.T])[::-1]
plt.close('all')
plt.figure()
plt.plot(my_explained, label='me')
plt.plot(explained, label='sklearn')
plt.legend()
plt.show(block=False)
The two curves are exactly the same.
The important thing is that I iterate over my_pca.T, not my_pca.
Signature: np.linalg.eig(a)
Docstring:
Compute the eigenvalues and right eigenvectors of a square array.
Parameters
----------
a : (..., M, M) array
Matrices for which the eigenvalues and right eigenvectors will
be computed
Returns
-------
w : (..., M) array
# not important for you
v : (..., M, M) array
The normalized (unit "length") eigenvectors, such that the
column ``v[:,i]`` is the eigenvector corresponding to the
eigenvalue ``w[i]``.
The eigenvectors are returned as columns of my_pca
, not rows. for x in my_pca
was iterating over rows.
I'm not an expert on PCA but I seem to get similar values if I transpose one of the matrices.
>>> import numpy as np
>>> LA = np.linalg
>>> d_train = np.random.randn(100, 10)
>>> d_cov = np.cov(d_train.transpose())
>>> eigens, mypca = LA.eig(d_cov)
>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=10)
>>> d_fit = pca.fit_transform(d_train)
>>> pc = pca.components_
>>> mypca[0,:]
array([-0.44255435, -0.77430549, -0.14479638, -0.06459874, 0.24772212,
0.20780185, 0.22388151, -0.05069543, -0.14515676, -0.03385801])
>>> pc[0,:]
array([-0.44255435, -0.24050535, -0.17313927, 0.07182494, 0.09748632,
0.17910516, 0.26125107, 0.71309764, 0.17276004, 0.25095447])
>>> pc.transpose()[0,:]
array([-0.44255435, 0.77430549, 0.14479638, -0.06459874, 0.24772212,
-0.20780185, 0.22388151, -0.03385801, 0.14515676, 0.05069543])
>>> list(zip(pc.transpose()[:,0], mypca[:,0]))
[(-0.44255435328718207, -0.44255435328718096),
(-0.24050535133912765, -0.2405053513391287),
(-0.17313926714559819, -0.17313926714559785),
(0.07182494253930383, 0.0718249425393035),
(0.09748631534772645, 0.09748631534772684),
(0.17910516453826955, 0.17910516453826758),
(0.2612510722861703, 0.2612510722861689),
(0.7130976419217306, 0.7130976419217326),
(0.17276004381786172, 0.17276004381786136),
(0.25095447415020183, 0.2509544741502009)]