Using categorical data as features in sklean Logis

2020-07-03 05:04发布

问题:

I'm trying to understand how to use categorical data as features in sklearn.linear_model's LogisticRegression.

I understand of course I need to encode it.

  1. What I don't understand is how to pass the encoded feature to the Logistic regression so it's processed as a categorical feature, and not interpreting the int value it got when encoding as a standard quantifiable feature.

  2. (Less important) Can somebody explain the difference between using preprocessing.LabelEncoder(), DictVectorizer.vocabulary or just encoding the categorical data yourself with a simple dict? Alex A.'s comment here touches on the subject but not very deeply.

Especially with the first one!

回答1:

You can create indicator variables for different categories. For example:

animal_names = {'mouse';'cat';'dog'}

Indicator_cat = strcmp(animal_names,'cat')
Indicator_dog = strcmp(animal_names,'dog')

Then we have:

                [0                         [0
Indicator_cat =  1        Indicator_dog =   0
                 0]                         1]

And you can concatenate these onto your original data matrix:

X_with_indicator_vars = [X, Indicator_cat, Indicator_dog]

Remember though to leave one category without an indicator if a constant term is included in the data matrix! Otherwise, your data matrix won't be full column rank (or in econometric terms, you have multicollinearity).

[1  1  0  0         Notice how constant term, an indicator for mouse,
 1  0  1  0         an indicator for ca,t and an indicator for dog
 1  0  0  1]        leads to a less than full column rank matrix:
                    the first column is the sum of the last three.


回答2:

  1. Standart approach to convert categorial features into numerical - OneHotEncoding
  2. It's completely different classes:

    [DictVectorizer][2].vocabulary_

    A dictionary mapping feature names to feature indices.

    i.e After fit() DictVectorizer has all possible feature names, and now it knows in which particular column it will place particular value of a feature. So DictVectorizer.vocabulary_ contains indicies of features, but not values.

    LabelEncoder in opposite maps each possible label (Label could be string, or integer) to some integer value, and returns 1D vector of these integer values.



回答3:

Suppose the type of each categorical variable is "object". Firstly, you can create an panda.index of categorical column names:

import pandas as pd    
catColumns = df.select_dtypes(['object']).columns

Then, you can create the indicator variables using a for-loop below. For the binary categorical variables, use the LabelEncoder() to convert it to 0 and 1. For categorical variables with more than two categories, use pd.getDummies() to obtain the indicator variables and then drop one category (to avoid multicollinearity issue).

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

for col in catColumns:
    n = len(df[col].unique())
    if (n > 2):
       X = pd.get_dummies(df[col])
       X = X.drop(X.columns[0], axis=1)
       df[X.columns] = X
       df.drop(col, axis=1, inplace=True)  # drop the original categorical variable (optional)
    else:
       le.fit(df[col])
       df[col] = le.transform(df[col])