I'm trying to make a classifier on a data set. I first used XGBoost:
import xgboost as xgb
import pandas as pd
import numpy as np
train = pd.read_csv("train_users_processed_onehot.csv")
labels = train["Buy"].map({"Y":1, "N":0})
features = train.drop("Buy", axis=1)
data_dmat = xgb.DMatrix(data=features, label=labels)
params={"max_depth":5, "min_child_weight":2, "eta": 0.1, "subsamples":0.9, "colsample_bytree":0.8, "objective" : "binary:logistic", "eval_metric": "logloss"}
rounds = 180
result = xgb.cv(params=params, dtrain=data_dmat, num_boost_round=rounds, early_stopping_rounds=50, as_pandas=True, seed=23333)
print result
And the result is:
test-logloss-mean test-logloss-std train-logloss-mean
0 0.683539 0.000141 0.683407
179 0.622302 0.001504 0.606452
We can see it is around 0.622;
But when I switch to sklearn
using the exactly same parameters(I think), the result is quite different. Below is my code:
from sklearn.model_selection import cross_val_score
from xgboost.sklearn import XGBClassifier
import pandas as pd
train_dataframe = pd.read_csv("train_users_processed_onehot.csv")
train_labels = train_dataframe["Buy"].map({"Y":1, "N":0})
train_features = train_dataframe.drop("Buy", axis=1)
estimator = XGBClassifier(learning_rate=0.1, n_estimators=190, max_depth=5, min_child_weight=2, objective="binary:logistic", subsample=0.9, colsample_bytree=0.8, seed=23333)
print cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss")
and the result is:[-4.11429976 -2.08675843 -3.27346662]
, after reversing it is still far from 0.622.
I tossed a break point into cross_val_score
, and saw that the classifier is making crazy predictions by trying to predict every tuple in the test set to be negative with about 0.99 probability.
I'm wondering where have I gone wrong. Could someone help me?
This question is a bit old, but I ran into the problem today and figured out why the results given by
xgboost.cv
andsklearn.model_selection.cross_val_score
are quite different.By default cross_val_score use
KFold
orStratifiedKFold
whose shuffle argument is False so the folds are not pulled randomly from the data.So if you do this, then you should get the same results,
Keep the
random state
inStratifiedKfold
andseed
inxgboost.cv
same to get exactly reproducible results.