Why xgboost.cv and sklearn.cross_val_score give di

I'm trying to make a classifier on a data set. I first used XGBoost:

import xgboost as xgb
import pandas as pd
import numpy as np

train = pd.read_csv("train_users_processed_onehot.csv")
labels = train["Buy"].map({"Y":1, "N":0})

features = train.drop("Buy", axis=1)
data_dmat = xgb.DMatrix(data=features, label=labels)

params={"max_depth":5, "min_child_weight":2, "eta": 0.1, "subsamples":0.9, "colsample_bytree":0.8, "objective" : "binary:logistic", "eval_metric": "logloss"}
rounds = 180

result = xgb.cv(params=params, dtrain=data_dmat, num_boost_round=rounds, early_stopping_rounds=50, as_pandas=True, seed=23333)
print result

And the result is:

        test-logloss-mean  test-logloss-std  train-logloss-mean  
0             0.683539          0.000141            0.683407
179           0.622302          0.001504            0.606452

We can see it is around 0.622;

But when I switch to sklearn using the exactly same parameters(I think), the result is quite different. Below is my code:

from sklearn.model_selection import cross_val_score
from xgboost.sklearn import XGBClassifier
import pandas as pd

train_dataframe = pd.read_csv("train_users_processed_onehot.csv")
train_labels = train_dataframe["Buy"].map({"Y":1, "N":0})
train_features = train_dataframe.drop("Buy", axis=1)

estimator = XGBClassifier(learning_rate=0.1, n_estimators=190, max_depth=5, min_child_weight=2, objective="binary:logistic", subsample=0.9, colsample_bytree=0.8, seed=23333)
print cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss")

and the result is:[-4.11429976 -2.08675843 -3.27346662], after reversing it is still far from 0.622.

I tossed a break point into cross_val_score, and saw that the classifier is making crazy predictions by trying to predict every tuple in the test set to be negative with about 0.99 probability.

I'm wondering where have I gone wrong. Could someone help me?

标签： python machine-learning scikit-learn cross-validation xgboost

1条回答

爱情/是我丢掉的垃圾

2楼-- · 2019-02-24 05:45

This question is a bit old, but I ran into the problem today and figured out why the results given by xgboost.cv and sklearn.model_selection.cross_val_score are quite different.

By default cross_val_score use KFold or StratifiedKFold whose shuffle argument is False so the folds are not pulled randomly from the data.

So if you do this, then you should get the same results,

cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss", cv = StratifiedKFold(shuffle=True, random_state=23333))

Keep the random state in StratifiedKfold and seed in xgboost.cv same to get exactly reproducible results.

0人赞添加讨论(0) 举报

Why xgboost.cv and sklearn.cross_val_score give di

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间