I want to build a classifier that uses cross validation, and then extract the important features (/coefficients) from each fold so I can look at their stability. At the moment I am using cross_validate and a pipeline. I want to use a pipeline so that I can do feature selection and standardization within each fold. I'm stuck on how to extract the features from each fold. I have a different option to using a pipeline below, if that is the problem.
This is my code so far (I want to try SVM
and logistic regression). I've included a small df as example:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import pandas as pd
df = pd.DataFrame({'length': [5, 8, 0.2, 10, 25, 3.2],
'width': [60, 102, 80.5, 30, 52, 81],
'group': [1, 0, 0, 0, 1, 1]})
array = df.values
y = array[:,2]
X = array[:,0:2]
select = SelectKBest(mutual_info_classif, k=2)
scl = StandardScaler()
svm = SVC(kernel='linear', probability=True, random_state=42)
logr = LogisticRegression(random_state=42)
pipeline = Pipeline([('select', select), ('scale', scl), ('svm', svm)])
split = KFold(n_splits=2, shuffle=True, random_state=42)
output = cross_validate(pipeline, X, y, cv=split,
scoring = ('accuracy', 'f1', 'roc_auc'),
return_estimator = True,
return_train_score= True)
I thought I could do something like:
pipeline.named_steps['svm'].coef_
but I get error message:
AttributeError: 'SVC' object has no attribute 'dual_coef_'
If it's not possible to do this using a pipeline, could I do it using 'by hand' cross validation? e.g:
for train_index, test_index in kfold.split(X, y):
kfoldtx = [X[i] for i in train_index]
kfoldty = [y[i] for i in train_index]
But I'm not sure what to do next! Any help would be very appreciated.
You should use the
output
ofcross_validate
to get the parameters of the fitted model. The reason is thatcross_validate
would clone the pipeline. Hence you would not find givenpipeline
variable be fitted after been fed tocross_validate
.output
is dictionary, which hasestimator
as one of the keys, whose value is ak_fold
number of fittedpipeline
objects.From Documentation:
Try this!