Sci-kit Learn PLS SVD and cross validation

2019-07-26 11:19发布

问题:

The sklearn.cross_decomposition.PLSSVD class in Sci-kit learn appears to be failing when the response variable has a shape of (N,) instead of (N,1), where N is the number of samples in the dataset.

However, sklearn.cross_validation.cross_val_score fails when the response variable has a shape of (N,1) instead of (N,). How can I use them together?

A snippet of code:

from sklearn.pipeline import Pipeline
from sklearn.cross_decomposition import PLSSVD
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# x -> (N, 60) numpy array
# y -> (N, ) numpy array

# These are the classifier 'pieces' I'm using
plssvd = PLSSVD(n_components=5, scale=False)
logistic = LogisticRegression(penalty='l2', C=0.5)
scaler = StandardScaler(with_mean=True, with_std=True)

# Here's the pipeline that's failing
plsclf = Pipeline([('scaler', scaler),
                   ('plssvd', plssvd), 
                   ('logistic', logistic)])

# Just to show how I'm using the pipeline for a working classifier
logclf = Pipeline([('scaler', scaler),
                   ('logistic', logistic)])

##################################################################

# This works fine
log_scores = cross_validation.cross_val_score(logclf, x, y, scoring='accuracy',
                                              verbose=True, cv=5, n_jobs=4)

# This fails!
pls_scores = cross_validation.cross_val_score(plsclf, x, y, scoring='accuracy',
                                              verbose=True, cv=5, n_jobs=4)

Specifically, it fails in the _center_scale_xy function of cross_decomposition/pls_.pyc with 'IndexError: tuple index out of range' at line 103: y_std = np.ones(Y.shape[1]), because the shape tuple has only one element.

If I set scale=True in the PLSSVD constructor, it fails in the same function at line 99: y_std[y_std == 0.0] = 1.0, because it is attempting to do a boolean index on a float (y_std is a float, since it only has one dimension).

Seems, like an easy fix, just make sure the y variable has two dimensions, (N,1). However:

If I create an array with dimensions (N,1) out of the output variable y, it still fails. In order to change the arrays, I add this before running cross_val_score:

y = np.transpose(np.array([y]))

Then, it fails in sklearn/cross_validation.py at line 398:

File "my_secret_script.py", line 293, in model_create
    scores = cross_validation.cross_val_score(plsclf, x, y, scoring='accuracy', verbose=True, cv=5, n_jobs=4)
File "/Users/my.secret.name/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1129, in cross_val_score
    cv = _check_cv(cv, X, y, classifier=is_classifier(estimator))
File "/Users/my.secret.name/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1216, in _check_cv
    cv = StratifiedKFold(y, cv, indices=needs_indices)
File "/Users/my.secret.name/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 398, in __init__
    label_test_folds = test_folds[y == label]
ValueError: boolean index array should have 1 dimension

I'm running this on OSX, NumPy version 1.8.0, Sci-kit Learn version 0.15-git.

Any way to use PLSSVD together with cross_val_score?

回答1:

Partial least squares projects both your data X and your target Y onto linear subspaces spanned by n_components vectors each. They are projected in a way that regression scores of one projected variable on the other are maximized.

The number of components, i.e. dimensions of the latent subspaces is bounded by the number of features in your variables. Your variable Y only has one feature (one column), so the latent subspace is one-dimensional, effectively reducing your construction to something more akin to (but not exactly the same as) linear regression. So using partial least squares in this specific situation is probably not useful.

Take a look at the following

import numpy as np
rng = np.random.RandomState(42)
n_samples, n_features_x, n_features_y, n_components = 20, 10, 1, 1
X = rng.randn(n_samples, n_features_x)
y = rng.randn(n_samples, n_features_y)

from sklearn.cross_decomposition import PLSSVD
plssvd = PLSSVD(n_components=n_components)

X_transformed, Y_transformed = plssvd.fit_transform(X, y)

X_transformed and Y_transformed are arrays of shape n_samples, n_components, they are the projected versions of X and Y.

The answer to your question about using PLSSVD within a Pipeline in cross_val_score, is no, it will not work out of the box, because the Pipeline object calls fit and transform using both variables X and Y as arguments if possible, which, as you can see in the code I wrote, returns a tuple containing the projected X and Y values. The next step in the pipeline will not be able to process this, because it will think that this tuple is the new X.

This type of failure is due to the fact that sklearn is only beginning to be systematic about multiple target support. The PLSSVD estimator you are trying to use is inherently multi target, even if you are only using it on one target.

Solution: Don't use partial least squares on 1D targets, there would be no gain to it even if it worked with the pipeline.