I recently found out about the folds
parameter in xgb.cv
, which allows one to specify the indices of the validation set. The helper function xgb.cv.mknfold
is then invoked within xgb.cv
, which then takes the remaining indices for each fold to be the indices of the training set for the respective fold.
Question: Can I specify both the training and validation indices via any interfaces in the xgboost interface?
My primary motivation is performing time-series cross validation, and I do not want the 'non-validation' indices to be automatically assigned as the training data. An example to illustrate what I want to do:
# assume i have 100 strips of time-series data, where each strip is X_i
# validate only on 10 points after training
fold1: train on X_1-X_10, validate on X_11-X_20
fold2: train on X_1-X_20, validate on X_21-X_30
fold3: train on X_1-X_30, validate on X_31-X_40
...
Currently, using the folds
parameter would force me to use the remaining examples as the validation set, which greatly increases the variance of the error estimate since the remaining data greatly outnumber the training data and may have a very different distribution from the training data especially for the earlier folds. Here's what I mean:
fold1: train on X_1-X_10, validate on X_11-X100 # huge error
...
I'm open to solutions from other packages if they are convenient (i.e. wouldn't require me to pry open source codes) and do not nullify the efficiencies in the original xgboost implementation.
I think the bottom part of the question is the wrong way round, should probably say:
It also seems that the mentioned helper function
xgb.cv.mknfold
is not around anymore. Note my version of xgboost is0.71.2
.However, it does seem that this could be achieved fairly straight-forward with a small modification of
xgb.cv
, e.g. something like:I have just added an optional argument
folds_train = NULL
and used that later on inside the function in this way (see above):Then you can use the new version of the function, e.g. like below:
So now you should be able to call the function with the extra argument, providing the additional indices for the training data.
Note that I have not had time to test this.