I'm trying to understand how to use categorical data as features in sklearn.linear_model
's LogisticRegression
.
I understand of course I need to encode it.
What I don't understand is how to pass the encoded feature to the Logistic regression so it's processed as a categorical feature, and not interpreting the int value it got when encoding as a standard quantifiable feature.
(Less important) Can somebody explain the difference between using
preprocessing.LabelEncoder()
,DictVectorizer.vocabulary
or just encoding the categorical data yourself with a simple dict? Alex A.'s comment here touches on the subject but not very deeply.
Especially with the first one!
You can create indicator variables for different categories. For example:
Then we have:
And you can concatenate these onto your original data matrix:
Remember though to leave one category without an indicator if a constant term is included in the data matrix! Otherwise, your data matrix won't be full column rank (or in econometric terms, you have multicollinearity).
Suppose the type of each categorical variable is "object". Firstly, you can create an
panda.index
of categorical column names:Then, you can create the indicator variables using a for-loop below. For the binary categorical variables, use the
LabelEncoder()
to convert it to0
and1
. For categorical variables with more than two categories, usepd.getDummies()
to obtain the indicator variables and then drop one category (to avoid multicollinearity issue).It's completely different classes:
[DictVectorizer][2].vocabulary_
i.e After
fit()
DictVectorizer
has all possible feature names, and now it knows in which particular column it will place particular value of a feature. SoDictVectorizer.vocabulary_
contains indicies of features, but not values.LabelEncoder
in opposite maps each possible label (Label could be string, or integer) to some integer value, and returns 1D vector of these integer values.