I'm trying to understand how to use categorical data as features in sklearn.linear_model
's LogisticRegression
.
I understand of course I need to encode it.
What I don't understand is how to pass the encoded feature to the Logistic regression so it's processed as a categorical feature, and not interpreting the int value it got when encoding as a standard quantifiable feature.
(Less important) Can somebody explain the difference between using preprocessing.LabelEncoder()
, DictVectorizer.vocabulary
or just encoding the categorical data yourself with a simple dict? Alex A.'s comment here touches on the subject but not very deeply.
Especially with the first one!
You can create indicator variables for different categories. For example:
animal_names = {'mouse';'cat';'dog'}
Indicator_cat = strcmp(animal_names,'cat')
Indicator_dog = strcmp(animal_names,'dog')
Then we have:
[0 [0
Indicator_cat = 1 Indicator_dog = 0
0] 1]
And you can concatenate these onto your original data matrix:
X_with_indicator_vars = [X, Indicator_cat, Indicator_dog]
Remember though to leave one category without an indicator if a constant term is included in the data matrix! Otherwise, your data matrix won't be full column rank (or in econometric terms, you have multicollinearity).
[1 1 0 0 Notice how constant term, an indicator for mouse,
1 0 1 0 an indicator for ca,t and an indicator for dog
1 0 0 1] leads to a less than full column rank matrix:
the first column is the sum of the last three.
Suppose the type of each categorical variable is "object". Firstly, you can create an panda.index
of categorical column names:
import pandas as pd
catColumns = df.select_dtypes(['object']).columns
Then, you can create the indicator variables using a for-loop below. For the binary categorical variables, use the LabelEncoder()
to convert it to 0
and 1
. For categorical variables with more than two categories, use pd.getDummies()
to obtain the indicator variables and then drop one category (to avoid multicollinearity issue).
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for col in catColumns:
n = len(df[col].unique())
if (n > 2):
X = pd.get_dummies(df[col])
X = X.drop(X.columns[0], axis=1)
df[X.columns] = X
df.drop(col, axis=1, inplace=True) # drop the original categorical variable (optional)
else:
le.fit(df[col])
df[col] = le.transform(df[col])