I am fairly new to sci-kit learn and have been trying to hyper-paramater tune XGBoost. My aim is to use early stopping and grid search to tune the model parameters and use early stopping to control the number of trees and avoid overfitting.
As I am using cross validation for the grid search, I was hoping to also use cross-validation in the early stopping criteria. The code I have so far looks like this:
import numpy as np
import pandas as pd
from sklearn import model_selection
import xgboost as xgb
#Import training and test data
train = pd.read_csv("train.csv").fillna(value=-999.0)
test = pd.read_csv("test.csv").fillna(value=-999.0)
# Encode variables
y_train = train.price_doc
x_train = train.drop(["id", "timestamp", "price_doc"], axis=1)
# XGBoost - sklearn method
gbm = xgb.XGBRegressor()
xgb_params = {
'learning_rate': [0.01, 0.1],
'n_estimators': [2000],
'max_depth': [3, 5, 7, 9],
'gamma': [0, 1],
'subsample': [0.7, 1],
'colsample_bytree': [0.7, 1]
}
fit_params = {
'early_stopping_rounds': 30,
'eval_metric': 'mae',
'eval_set': [[x_train,y_train]]
}
grid = model_selection.GridSearchCV(gbm, xgb_params, cv=5,
fit_params=fit_params)
grid.fit(x_train,y_train)
The problem I am having is the 'eval_set' parameter. I understand that this wants the predictor and response variables but I am not sure how I can use the cross-validation data as the early stopping criteria.
Does anyone know how to overcome this problem? Thanks.