ValueError: feature_names mismatch: in xgboost in

I have trained an XGBoostRegressor model. When I have to use this trained model for predicting for a new input, the predict() function throws a feature_names mismatch error, although the input feature vector has the same structure as the training data.

Also, in order to build the feature vector in the same structure as the training data, I am doing a lot inefficient processing such as adding new empty columns (if data does not exist) and then rearranging the data columns so that it matches with the training structure. Is there a better and cleaner way of formatting the input so that it matches the training structure?

标签： python pandas machine-learning regression xgboost

6条回答

爱情/是我丢掉的垃圾

2楼-- · 2019-04-06 17:33

Do this while creating the DMatrix for XGB:

dtrain = xgb.DMatrix(np.asmatrix(X_train), label=y_train)
dtest = xgb.DMatrix(np.asmatrix(X_test), label=y_test)

Do not pass X_train and X_test directly.

0人赞添加讨论(0) 举报

smile是对你的礼貌

3楼-- · 2019-04-06 17:34

Check the exception. What you should see are two arrays. One is the column names of the dataframe you’re passing in and the other is the XGBoost feature names. They should be the same length. If you put them side by side in an Excel spreadsheet you will see that they are not in the same order. My guess is that the XGBoost names were written to a dictionary so it would be a coincidence if the names in then two arrays were in the same order.

The fix is easy. Just reorder your dataframe columns to match the XGBoost names:

f_names = model.feature_names
df = df[f_names]

0人赞添加讨论(0) 举报

唯我独甜

4楼-- · 2019-04-06 17:36

I also had this problem when i used pandas DataFrame (non-sparse representation).

I converted training and testing data into numpy ndarray.

          `X_train = X_train.as_matrix()
           X_test = X_test.as_matrix()`

This how i got rid of that Error!

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

5楼-- · 2019-04-06 17:38

Try converting data into ndarray before passing it to fit/predict. For eg: if your train data is train_df and test data is test_df. Use below code:

train_x = train_df.values
test_x = test_df.values

Now fit the model:

xgb.fit(train_x,train_y)

Finally, predict:

pred = xgb.predict(test_x)

Hope this helps!

0人赞添加讨论(0) 举报

姐就是有狂的资本

6楼-- · 2019-04-06 17:40

I came across the same problem and it's been solved by adding passing the train dataframe column name to the test dataframe via adding the following code:

test_df = test_df[train_df.columns]

0人赞添加讨论(0) 举报

聊天终结者

7楼-- · 2019-04-06 17:45

From what I could find, the predict function does not take the DataFrame (or a sparse matrix) as input. It is one of the bugs which can be found here https://github.com/dmlc/xgboost/issues/1238

In order to get around this issue, use as_matrix() function in case of a DataFrame or toarray() in case of a sparse matrix.

This is the only workaround till the bug is fixed or the feature is implemented in a different manner.

0人赞添加讨论(0) 举报

ValueError: feature_names mismatch: in xgboost in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间