XGBoost Categorical Variables: Dummification vs en

When using XGBoost we need to convert categorical variables into numeric.

Would there be any difference in performance/evaluation metrics between the methods of:

dummifying your categorical variables
encoding your categorical variables from e.g. (a,b,c) to (1,2,3)

ALSO:

Would there be any reasons not to go with method 2 by using for example labelencoder?

标签： python categorical-data xgboost

3条回答

Evening l夕情丶

2楼-- · 2020-05-11 10:36

Here is a code example of adding One hot encodings columns to a Pandas DataFrame with Categorical columns:

ONE_HOT_COLS = ["categorical_col1", "categorical_col2", "categorical_col3"]
print("Starting DF shape: %d, %d" % df.shape)


for col in ONE_HOT_COLS:
    s = df[col].unique()

    # Create a One Hot Dataframe with 1 row for each unique value
    one_hot_df = pd.get_dummies(s, prefix='%s_' % col)
    one_hot_df[col] = s

    print("Adding One Hot values for %s (the column has %d unique values)" % (col, len(s)))
    pre_len = len(df)

    # Merge the one hot columns
    df = df.merge(one_hot_df, on=[col], how="left")
    assert len(df) == pre_len
    print(df.shape)

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

3楼-- · 2020-05-11 10:42

I want to answer this question not just in terms of XGBoost but in terms of any problem dealing with categorical data. While "dummification" creates a very sparse setup, specially if you have multiple categorical columns with different levels, label encoding is often biased as the mathematical representation is not reflective of the relationship between levels.

For Binary Classification problems, a genius yet unexplored approach which is highly leveraged in traditional credit scoring models is to use Weight of Evidence to replace the categorical levels. Basically every categorical level is replaced by the proportion of Goods/ Proportion of Bads.

Can read more about it here.

Python library here.

This method allows you to capture the "levels" under one column and avoid sparsity or induction of bias that would occur through dummifying or encoding.

Hope this helps !

0人赞添加讨论(0) 举报

Juvenile、少年°

4楼-- · 2020-05-11 10:44

xgboost only deals with numeric columns.

if you have a feature [a,b,b,c] which describes a categorical variable (i.e. no numeric relationship)

Using LabelEncoder you will simply have this:

array([0, 1, 1, 2])

Xgboost will wrongly interpret this feature as having a numeric relationship! This just maps each string ('a','b','c') to an integer, nothing more.

Proper way

Using OneHotEncoder you will eventually get to this:

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

This is the proper representation of a categorical variable for xgboost or any other machine learning tool.

Pandas get_dummies is a nice tool for creating dummy variables (which is easier to use, in my opinion).

Method #2 in above question will not represent the data properly

0人赞添加讨论(0) 举报

XGBoost Categorical Variables: Dummification vs en

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间