Nested cross-validation in grid search for precomp

2019-05-10 17:17发布

问题:

I have a precomputed kernel of size NxN. I am using GridSearchCV to tune C parameter of SVM with kernel='precomputed' as follows:

C_range = 10. ** np.arange(-2, 9)
param_grid = dict(C=C_range)
grid = GridSearchCV(SVC(kernel='precomputed'), param_grid=param_grid, cv=StratifiedKFold(y=data_label, n_folds=10))
grid.fit(kernel, data_label)
print grid.best_score_

This works pretty fine, however since I use the full data for prediction (with grid.predict(kernel)), it overfits (I get precision/recall = 1.0 most of the times).

So I would like to first split my data to 10 chunks (9 for training, 1 for testing) with cross-validation, and in each fold, I want to run GridSearch to tune the C value on the training set, and test on the testing set.

In order to do this, I sliced the kernel matrix into 100x100 and 50x50 submatrices where I run grid.fit() on one of them and grid.predict() on the other.

But I get the following error:

ValueError: X.shape[1] = 50 should be equal to 100, the number of features at training time

I guess training kernel should have the same dimension as testing kernel, but I don't understand why, because I simply compute np.dot(X, X.T) for 100x100, and for 50x50, hence the final kernel have different dimensions..

回答1:

The scikit learn doc says:

Set kernel='precomputed' and pass the Gram matrix instead of X in the fit method. At the moment, the kernel values between all training vectors and the test vectors must be provided.

So I guess that it's not possible to do (simple) cross-validation with precomputed kernels.