可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I want to evaluate a regression model build with scikitlearn using cross-validation and getting confused, which of the two functions cross_val_score
and cross_val_predict
I should use.
One option would be :
cvs = DecisionTreeRegressor(max_depth = depth)
scores = cross_val_score(cvs, predictors, target, cv=cvfolds, scoring='r2')
print("R2-Score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
An other one, to use the cv-predictions with the standard r2_score
:
cvp = DecisionTreeRegressor(max_depth = depth)
predictions = cross_val_predict(cvp, predictors, target, cv=cvfolds)
print ("CV R^2-Score: {}".format(r2_score(df[target], predictions_cv)))
I would assume that both methods are valid and give similar results. But that is only the case with small k-folds. While the r^2 is roughly the same for 10-fold-cv, it gets increasingly lower for higher k-values in the case of the first version using "cross_vall_score". The second version is mostly unaffected by changing numbers of folds.
Is this behavior to be expected and do I lack some understanding regarding CV in SKLearn?
回答1:
cross_val_score
returns score of test fold where cross_val_predict
returns predicted y values for the test fold.
For the cross_val_score()
, you are using the average of the output, which will be affected by the number of folds because then it may have some folds which may have high error (not fit correctly).
Whereas, cross_val_predict()
returns, for each element in the input, the prediction that was obtained for that element when it was in the test set. [Note that only cross-validation strategies that assign all elements to a test set exactly once can be used]. So the increasing the number of folds, only increases the training data for the test element, and hence its result may not be affected much.
Hope this helps. Feel free to ask any doubt.
Edit: Answering the question in comment
Please have a look the following answer on how cross_val_predict
works:
- https://stackoverflow.com/a/41524968/3374996
I think that cross_val_predict
will be overfit because as the folds increase, more data will be for train and less will for test. So the resultant label is more dependent on training data. Also as already told above, the prediction for one sample is done only once, so it may be susceptible to the splitting of data more.
Thats why most of the places or tutorials recommend using the cross_val_score
for analysis.
回答2:
I think the difference can be made clear by inspecting their outputs. Consider this snippet:
# Last column is the label
print(X.shape) # (7040, 133)
clf = MLPClassifier()
scores = cross_val_score(clf, X[:,:-1], X[:,-1], cv=5)
print(scores.shape) # (5,)
y_pred = cross_val_predict(clf, X[:,:-1], X[:,-1], cv=5)
print(y_pred.shape) # (7040,)
Notice the shapes: why are these so?
scores.shape
has length 5 because it is a score computed with cross-validation over 5 folds (see argument cv=5
). Therefore, a single real value is computed for each fold. That value is the score of the classifier:
given true labels and predicted labels, how many answers the predictor were right in a particular fold?
In this case, the y labels given in input are used twice: to learn from data and to evaluate the performances of the classifier.
On the other hand, y_pred.shape
has length 7040, which is the shape of the dataset. That is the length of the input dataset. This means that each value is not a score computed on multiple values, but a single value: the prediction of the classifier:
given the input data and their labels, what is the prediction of the classifier on a specific example that was in a test set of a particular fold?
Note that you do not know what fold was used: each output was computed on the test data of a certain fold, but you can't tell which (from this output, at least).
In this case, the labels are used just once: to train the classifier. It's your job to compare these outputs to the true outputs to compute the score. If you just average them, as you did, the output is not a score, it's just the average prediction.
回答3:
So this question also bugged me and while the other's made good points, they didn't answer all aspects of OP's question.
The true answer is: The divergence in scores for increasing k is due to the chosen metric R2 (coefficient of determination). For e.g. MSE, MSLE or MAE there won't be any difference in using cross_val_score
or cross_val_predict
.
See the definition of R2:
R^2 = 1 - (MSE(ground truth, prediction)/ MSE(ground truth, mean(ground truth)))
The bold part explains why the score starts to differ for increasing k: the more splits we have, the fewer samples in the test fold and the higher the variance in the mean of the test fold.
Conversely, for small k, the mean of the test fold won't differ much of the full ground truth mean, as sample size is still large enough to have small variance.
Proof:
import numpy as np
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_log_error as msle, r2_score
predictions = np.random.rand(1000)*100
groundtruth = np.random.rand(1000)*20
def scores_for_increasing_k(score_func):
skewed_score = score_func(groundtruth, predictions)
print(f'skewed score (from cross_val_predict): {skewed_score}')
for k in (2,4,5,10,20,50,100,200,250):
fold_preds = np.split(predictions, k)
fold_gtruth = np.split(groundtruth, k)
correct_score = np.mean([score_func(g, p) for g,p in zip(fold_gtruth, fold_preds)])
print(f'correct CV for k={k}: {correct_score}')
for name, score in [('MAE', mae), ('MSLE', msle), ('R2', r2_score)]:
print(name)
scores_for_increasing_k(score)
print()
Output will be:
MAE
skewed score (from cross_val_predict): 42.25333901481263
correct CV for k=2: 42.25333901481264
correct CV for k=4: 42.25333901481264
correct CV for k=5: 42.25333901481264
correct CV for k=10: 42.25333901481264
correct CV for k=20: 42.25333901481264
correct CV for k=50: 42.25333901481264
correct CV for k=100: 42.25333901481264
correct CV for k=200: 42.25333901481264
correct CV for k=250: 42.25333901481264
MSLE
skewed score (from cross_val_predict): 3.5252449697327175
correct CV for k=2: 3.525244969732718
correct CV for k=4: 3.525244969732718
correct CV for k=5: 3.525244969732718
correct CV for k=10: 3.525244969732718
correct CV for k=20: 3.525244969732718
correct CV for k=50: 3.5252449697327175
correct CV for k=100: 3.5252449697327175
correct CV for k=200: 3.5252449697327175
correct CV for k=250: 3.5252449697327175
R2
skewed score (from cross_val_predict): -74.5910282783694
correct CV for k=2: -74.63582817089443
correct CV for k=4: -74.73848598638291
correct CV for k=5: -75.06145142821893
correct CV for k=10: -75.38967601572112
correct CV for k=20: -77.20560102267272
correct CV for k=50: -81.28604960074824
correct CV for k=100: -95.1061197684949
correct CV for k=200: -144.90258384605787
correct CV for k=250: -210.13375041871123
Of course, there is another effect not shown here, which was mentioned by others.
With increasing k, there are more models trained on more samples and validated on fewer samples, which will effect the final scores, but this is not induced by the choice between cross_val_score
and cross_val_predict
.