I have a precomputed kernel of size NxN. I am using GridSearchCV to tune C parameter of SVM with kernel='precomputed' as follows:
C_range = 10. ** np.arange(-2, 9)
param_grid = dict(C=C_range)
grid = GridSearchCV(SVC(kernel='precomputed'), param_grid=param_grid, cv=StratifiedKFold(y=data_label, n_folds=10))
grid.fit(kernel, data_label)
print grid.best_score_
This works pretty fine, however since I use the full data for prediction (with grid.predict(kernel)), it overfits (I get precision/recall = 1.0 most of the times).
So I would like to first split my data to 10 chunks (9 for training, 1 for testing) with cross-validation, and in each fold, I want to run GridSearch to tune the C value on the training set, and test on the testing set.
In order to do this, I sliced the kernel matrix into 100x100 and 50x50 submatrices where I run grid.fit() on one of them and grid.predict() on the other.
But I get the following error:
ValueError: X.shape[1] = 50 should be equal to 100, the number of features at training time
I guess training kernel should have the same dimension as testing kernel, but I don't understand why, because I simply compute np.dot(X, X.T) for 100x100, and for 50x50, hence the final kernel have different dimensions..
The scikit learn doc says:
So I guess that it's not possible to do (simple) cross-validation with precomputed kernels.