I'm having a lot of trouble figuring out how to correctly set the num_classes for xgboost.
I've got an example using the Iris data
df <- iris
y <- df$Species
num.class = length(levels(y))
levels(y) = 1:num.class
head(y)
df <- df[,1:4]
y <- as.matrix(y)
df <- as.matrix(df)
param <- list("objective" = "multi:softprob",
"num_class" = 3,
"eval_metric" = "mlogloss",
"nthread" = 8,
"max_depth" = 16,
"eta" = 0.3,
"gamma" = 0,
"subsample" = 1,
"colsample_bytree" = 1,
"min_child_weight" = 12)
model <- xgboost(param=param, data=df, label=y, nrounds=20)
This returns an error
Error in xgb.iter.update(bst$handle, dtrain, i - 1, obj) :
SoftmaxMultiClassObj: label must be in [0, num_class), num_class=3 but found 3 in label
If I change the num_class to 2 I get the same error. If I increase the num_class to 4 then the model runs, but I get 600 predicted probabilities back, which makes sense for 4 classes.
I'm not sure if I'm making an error or whether I'm failing to understand how xgboost works. Any help would be appreciated.
I was seeing the same error, my issue was that I was using an
eval_metric
that was only meant to be used for multiclass labels when my data had binary labels. Seeeval_metric
in the Learning Class Parameters section of the XGBoost docs for a list of all of the options.label must be in [0, num_class) in your script add
y<-y-1
beforemodel <-...
I ran into this rather weird problem as well. It seemed in my class to be a result of not properly encoding the labels.
First, using a string vector with N classes as the labels, I could only get the algorithm to run by setting
num_class
= N + 1. However, this result was useless, because I only had N actual classes and N+1 buckets of predicted probabilities.I re-encoded the labels as integers and then
num_class
worked fine when set to N.and
For example.