Using categorical data as features in sklean Logis

I'm trying to understand how to use categorical data as features in sklearn.linear_model's LogisticRegression.

I understand of course I need to encode it.

What I don't understand is how to pass the encoded feature to the Logistic regression so it's processed as a categorical feature, and not interpreting the int value it got when encoding as a standard quantifiable feature.
(Less important) Can somebody explain the difference between using preprocessing.LabelEncoder(), DictVectorizer.vocabulary or just encoding the categorical data yourself with a simple dict? Alex A.'s comment here touches on the subject but not very deeply.

Especially with the first one!

标签： python scikit-learn regression data-modeling logistic-regression

3条回答

Explosion°爆炸

2楼-- · 2020-07-03 05:43

You can create indicator variables for different categories. For example:

animal_names = {'mouse';'cat';'dog'}

Indicator_cat = strcmp(animal_names,'cat')
Indicator_dog = strcmp(animal_names,'dog')

Then we have:

                [0                         [0
Indicator_cat =  1        Indicator_dog =   0
                 0]                         1]

And you can concatenate these onto your original data matrix:

X_with_indicator_vars = [X, Indicator_cat, Indicator_dog]

Remember though to leave one category without an indicator if a constant term is included in the data matrix! Otherwise, your data matrix won't be full column rank (or in econometric terms, you have multicollinearity).

[1  1  0  0         Notice how constant term, an indicator for mouse,
 1  0  1  0         an indicator for ca,t and an indicator for dog
 1  0  0  1]        leads to a less than full column rank matrix:
                    the first column is the sum of the last three.

0人赞添加讨论(0) 举报

一夜七次

3楼-- · 2020-07-03 05:53

Suppose the type of each categorical variable is "object". Firstly, you can create an panda.index of categorical column names:

import pandas as pd    
catColumns = df.select_dtypes(['object']).columns

Then, you can create the indicator variables using a for-loop below. For the binary categorical variables, use the LabelEncoder() to convert it to 0 and 1. For categorical variables with more than two categories, use pd.getDummies() to obtain the indicator variables and then drop one category (to avoid multicollinearity issue).

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

for col in catColumns:
    n = len(df[col].unique())
    if (n > 2):
       X = pd.get_dummies(df[col])
       X = X.drop(X.columns[0], axis=1)
       df[X.columns] = X
       df.drop(col, axis=1, inplace=True)  # drop the original categorical variable (optional)
    else:
       le.fit(df[col])
       df[col] = le.transform(df[col])

0人赞添加讨论(0) 举报

Deceive 欺骗

4楼-- · 2020-07-03 06:00

Standart approach to convert categorial features into numerical - OneHotEncoding
It's completely different classes:

[DictVectorizer][2].vocabulary_

A dictionary mapping feature names to feature indices.

i.e After fit() DictVectorizer has all possible feature names, and now it knows in which particular column it will place particular value of a feature. So DictVectorizer.vocabulary_ contains indicies of features, but not values.

LabelEncoder in opposite maps each possible label (Label could be string, or integer) to some integer value, and returns 1D vector of these integer values.

0人赞添加讨论(0) 举报

Using categorical data as features in sklean Logis

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间