The sklearn.cross_decomposition.PLSSVD
class in Sci-kit learn appears to be failing when the response variable has a shape of (N,)
instead of (N,1)
, where N
is the number of samples in the dataset.
However, sklearn.cross_validation.cross_val_score
fails when the response variable has a shape of (N,1)
instead of (N,)
. How can I use them together?
A snippet of code:
from sklearn.pipeline import Pipeline
from sklearn.cross_decomposition import PLSSVD
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# x -> (N, 60) numpy array
# y -> (N, ) numpy array
# These are the classifier 'pieces' I'm using
plssvd = PLSSVD(n_components=5, scale=False)
logistic = LogisticRegression(penalty='l2', C=0.5)
scaler = StandardScaler(with_mean=True, with_std=True)
# Here's the pipeline that's failing
plsclf = Pipeline([('scaler', scaler),
('plssvd', plssvd),
('logistic', logistic)])
# Just to show how I'm using the pipeline for a working classifier
logclf = Pipeline([('scaler', scaler),
('logistic', logistic)])
##################################################################
# This works fine
log_scores = cross_validation.cross_val_score(logclf, x, y, scoring='accuracy',
verbose=True, cv=5, n_jobs=4)
# This fails!
pls_scores = cross_validation.cross_val_score(plsclf, x, y, scoring='accuracy',
verbose=True, cv=5, n_jobs=4)
Specifically, it fails in the _center_scale_xy
function of cross_decomposition/pls_.pyc
with 'IndexError: tuple index out of range'
at line 103: y_std = np.ones(Y.shape[1])
, because the shape tuple has only one element.
If I set scale=True
in the PLSSVD
constructor, it fails in the same function at line 99: y_std[y_std == 0.0] = 1.0
, because it is attempting to do a boolean index on a float (y_std
is a float, since it only has one dimension).
Seems, like an easy fix, just make sure the y
variable has two dimensions, (N,1)
. However:
If I create an array with dimensions (N,1)
out of the output variable y
, it still fails. In order to change the arrays, I add this before running cross_val_score
:
y = np.transpose(np.array([y]))
Then, it fails in sklearn/cross_validation.py
at line 398:
File "my_secret_script.py", line 293, in model_create
scores = cross_validation.cross_val_score(plsclf, x, y, scoring='accuracy', verbose=True, cv=5, n_jobs=4)
File "/Users/my.secret.name/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1129, in cross_val_score
cv = _check_cv(cv, X, y, classifier=is_classifier(estimator))
File "/Users/my.secret.name/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1216, in _check_cv
cv = StratifiedKFold(y, cv, indices=needs_indices)
File "/Users/my.secret.name/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 398, in __init__
label_test_folds = test_folds[y == label]
ValueError: boolean index array should have 1 dimension
I'm running this on OSX, NumPy version 1.8.0
, Sci-kit Learn version 0.15-git
.
Any way to use PLSSVD
together with cross_val_score
?
Partial least squares projects both your data
X
and your targetY
onto linear subspaces spanned byn_components
vectors each. They are projected in a way that regression scores of one projected variable on the other are maximized.The number of components, i.e. dimensions of the latent subspaces is bounded by the number of features in your variables. Your variable
Y
only has one feature (one column), so the latent subspace is one-dimensional, effectively reducing your construction to something more akin to (but not exactly the same as) linear regression. So using partial least squares in this specific situation is probably not useful.Take a look at the following
X_transformed
andY_transformed
are arrays of shapen_samples, n_components
, they are the projected versions ofX
andY
.The answer to your question about using
PLSSVD
within aPipeline
incross_val_score
, is no, it will not work out of the box, because thePipeline
object callsfit
andtransform
using both variablesX
andY
as arguments if possible, which, as you can see in the code I wrote, returns a tuple containing the projectedX
andY
values. The next step in the pipeline will not be able to process this, because it will think that this tuple is the newX
.This type of failure is due to the fact that
sklearn
is only beginning to be systematic about multiple target support. ThePLSSVD
estimator you are trying to use is inherently multi target, even if you are only using it on one target.Solution: Don't use partial least squares on 1D targets, there would be no gain to it even if it worked with the pipeline.