Retraining after Cross Validation with libsvm

2019-01-07 06:48发布

问题:

I know that Cross validation is used for selecting good parameters. After finding them, i need to re-train the whole data without the -v option.

But the problem i face is that after i train with -v option, i get the cross-validation accuracy( e.g 85%). There is no model and i can't see the values of C and gamma. In that case how do i retrain?

Btw i applying 10 fold cross validation. e.g

optimization finished, #iter = 138
nu = 0.612233
obj = -90.291046, rho = -0.367013
nSV = 165, nBSV = 128
Total nSV = 165
Cross Validation Accuracy = 98.1273%

Need some help on it..

To get the best C and gamma, i use this code that is available in the LIBSVM FAQ

bestcv = 0;
for log2c = -6:10,
  for log2g = -6:3,
    cmd = ['-v 5 -c ', num2str(2^log2c), ' -g ', num2str(2^log2g)];
    cv = svmtrain(TrainLabel,TrainVec, cmd);
    if (cv >= bestcv),
      bestcv = cv; bestc = 2^log2c; bestg = 2^log2g;
    end
    fprintf('(best c=%g, g=%g, rate=%g)\n',bestc, bestg, bestcv);
  end
end

Another question : Is that cross-validation accuracy after using -v option similar to that we get when we train without -v option and use that model to predict? are the two accuracy similar?

Another question : Cross-validation basically improves the accuracy of the model by avoiding the overfitting. So, it needs to have a model in place before it can improve. Am i right? Besides that, if i have a different model, then the cross-validation accuracy will be different? Am i right?

One more question: In the cross-validation accuracy, what is the value of C and gamma then?

The graph is something like this

Then the values of C are 2 and gamma = 0.0078125. But when i retrain the model with the new parameters. The value is not the same as 99.63%. Could there be any reason? Thanks in advance...

回答1:

The -v option here is really meant to be used as a way to avoid the overfitting problem (instead of using the whole data for training, perform an N-fold cross-validation training on N-1 folds and testing on the remaining fold, one at-a-time, then report the average accuracy). Thus it only returns the cross-validation accuracy (assuming you have a classification problem, otherwise mean-squared error for regression) as a scalar number instead of an actual SVM model.

If you want to perform model selection, you have to implement a grid search using cross-validation (similar to the grid.py helper python script), to find the best values of C and gamma.

This shouldn't be hard to implement: create a grid of values using MESHGRID, iterate overall all pairs (C,gamma) training an SVM model with say 5-fold cross-validation, and choosing the values with the best CV-accuracy...

Example:

%# read some training data
[labels,data] = libsvmread('./heart_scale');

%# grid of parameters
folds = 5;
[C,gamma] = meshgrid(-5:2:15, -15:2:3);

%# grid search, and cross-validation
cv_acc = zeros(numel(C),1);
for i=1:numel(C)
    cv_acc(i) = svmtrain(labels, data, ...
                    sprintf('-c %f -g %f -v %d', 2^C(i), 2^gamma(i), folds));
end

%# pair (C,gamma) with best accuracy
[~,idx] = max(cv_acc);

%# contour plot of paramter selection
contour(C, gamma, reshape(cv_acc,size(C))), colorbar
hold on
plot(C(idx), gamma(idx), 'rx')
text(C(idx), gamma(idx), sprintf('Acc = %.2f %%',cv_acc(idx)), ...
    'HorizontalAlign','left', 'VerticalAlign','top')
hold off
xlabel('log_2(C)'), ylabel('log_2(\gamma)'), title('Cross-Validation Accuracy')

%# now you can train you model using best_C and best_gamma
best_C = 2^C(idx);
best_gamma = 2^gamma(idx);
%# ...



回答2:

If you use your entire dataset to determine your parameters, then train on that dataset, you are going to overfit your data. Ideally, you would divide the dataset, do the parameter search on a portion (with CV), then use the other portion to train and test with CV. Will you get better results if you use the whole dataset for both? Of course, but your model is likely to not generalize well. If you want determine true performance of your model, you need to do parameter selection separately.