In my classification scheme, there are several steps including:
- SMOTE (Synthetic Minority Over-sampling Technique)
- Fisher criteria for feature selection
- Standardization (Z-score normalisation)
- SVC (Support Vector Classifier)
The main parameters to be tuned in the scheme above are percentile (2.) and hyperparameters for SVC (4.) and I want to go through grid search for tuning.
The current solution builds a "partial" pipeline including step 3 and 4 in the scheme clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))])
and breaks the scheme into two parts:
1) Tune the percentile of features to keep through the first grid search
skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
# SMOTE synthesizes the training data (we want to keep test data intact)
X_train, y_train = SMOTE(X_train, y_train)
for percentile in percentiles:
# Fisher returns the indices of the selected features specified by the parameter 'percentile'
selected_ind = Fisher(X_train, y_train, percentile)
X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
model = clf.fit(X_train_selected, y_train)
y_predict = model.predict(X_test_selected)
f1 = f1_score(y_predict, y_test)
The f1 scores will be stored and then be averaged through all fold partitions for all percentiles, and the percentile with the best CV score is returned. The purpose of putting 'percentile for loop' as the inner loop is to allow fair competition as we have the same training data (including synthesized data) across all fold partitions for all percentiles.
2) After determining the percentile, tune the hyperparameters by second grid search
skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
# SMOTE synthesizes the training data (we want to keep test data intact)
X_train, y_train = SMOTE(X_train, y_train)
for parameters in parameter_comb:
# Select the features based on the tuned percentile
selected_ind = Fisher(X_train, y_train, best_percentile)
X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
clf.set_params(svc__C=parameters['C'], svc__gamma=parameters['gamma'])
model = clf.fit(X_train_selected, y_train)
y_predict = model.predict(X_test_selected)
f1 = f1_score(y_predict, y_test)
It is done in the very similar way, except we tune the hyperparamter for SVC rather than percentile of features to select.
My questions are:
I) In the current solution, I only involve 3. and 4. in the clf
and do 1. and 2. kinda "manually" in two nested loop as described above. Is there any way to include all four steps in a pipeline and do the whole process at once?
II) If it is okay to keep the first nested loop, then is it possible (and how) to simplify the next nested loop using a single pipeline
clf_all = Pipeline([('smote', SMOTE()),
('fisher', Fisher(percentile=best_percentile))
('normal',preprocessing.StandardScaler()),
('svc',svm.SVC(class_weight='auto'))])
and simply use GridSearchCV(clf_all, parameter_comb)
for tuning?
Please note that both SMOTE
and Fisher
(ranking criteria) have to be done only for the training data in each fold partition.
It would be so much appreciated for any comment.
EDIT SMOTE
and Fisher
are shown below:
def Fscore(X, y, percentile=None):
X_pos, X_neg = X[y==1], X[y==0]
X_mean = X.mean(axis=0)
X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
F = num/deno
sort_F = argsort(F)[::-1]
n_feature = (float(percentile)/100)*shape(X)[1]
ind_feature = sort_F[:ceil(n_feature)]
return(ind_feature)
SMOTE
is from https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py, it returns the synthesized data. I modified it to return the original input data stacked with the synthesized data along with its labels and synthesized ones.
def smote(X, y):
n_pos = sum(y==1), sum(y==0)
n_syn = (n_neg-n_pos)/float(n_pos)
X_pos = X[y==1]
X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
y_syn = np.ones(shape(X_syn)[0])
X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
return(X, y)
scikit created a FunctionTransformer as part of the preprocessing class in version 0.17. It can be used in a similar manner as David's implementation of the class Fisher in the answer above - but with less flexibility. If the input/output of the function is configured properly, the transformer can implement the fit/transform/fit_transform methods for the function and thus allow it to be used in the scikit pipeline.
For example, if the input to a pipeline is a series, the transformer would be as follows:
where vect is a tf_idf transformer, clf is a classifier and train is the training dataset. "train.desc" is the series text input to the pipeline.
I don't know where your
SMOTE()
andFisher()
functions are coming from, but the answer is yes you can definitely do this. In order to do so you will need to write a wrapper class around those functions though. The easiest way to this is inherit sklearn'sBaseEstimator
andTransformerMixin
classes, see this for an example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.htmlIf this isn't making sense to you, post the details of at least one of your functions (the library it comes from or your code if you wrote it yourself) and we can go from there.
EDIT:
I apologize, I didn't look at your functions closely enough to realize that they transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were. For your reference, here is what it would look like to write your custom class for your Fisher process which would work if the function itself did not need to affect your target variable.