The TruncatedSVD's explained variance ratio is not in descending order, unlike sklearn's PCA. I looked at the source code and it seems they use different way of calculating the explained variance ratio:
U, Sigma, VT = randomized_svd(X, self.n_components,
n_iter=self.n_iter,
random_state=random_state)
X_transformed = np.dot(U, np.diag(Sigma))
self.explained_variance_ = exp_var = np.var(X_transformed, axis=0)
if sp.issparse(X):
_, full_var = mean_variance_axis(X, axis=0)
full_var = full_var.sum()
else:
full_var = np.var(X, axis=0).sum()
self.explained_variance_ratio_ = exp_var / full_var
PCA:
U, S, V = linalg.svd(X, full_matrices=False)
explained_variance_ = (S ** 2) / n_samples
explained_variance_ratio_ = (explained_variance_ /
explained_variance_.sum())
PCA
uses sigma to directly calculate the explained_variance and since sigma is in descending order, the explained_variance is also in the descending order. On the other hand, TruncatedSVD
uses the variance of the columns of transformed matrix to calculate the explained_variance and therefore the variances are not necessarily in descending order.
Does this mean that I need to sort the explained_variance_ratio
from TruncatedSVD
first in order to find the top k principle components?
You dont have to sort
explianed_variance_ratio
, output itself would be sorted and contains only then_component
number of values.From Documentation:
X_transformed contains the decomposition using only k components.
The example would give you an idea