Is numerical encoding necessary for the target var

2019-08-03 16:25发布

问题:

I am using sklearn for text classification, all my features are numerical but my target variable labels are in text. I can understand the rationale behind encoding features to numerics but don't think this applies for the target variable?

回答1:

If your target variable is in textual form, you can transform it into numeric form (or you can leave it alone, please see my note below) in order for any Scikit-learn algorithm to pick it in an OVA (One Versus All) scheme: your learning algorithm will try to guess each class as compared against the residual ones only when they will be transformed into numeric codes starting from 0 to (number of classes - 1).

For instance, in this example from the Scikit-Learn documentation, you can figure out the class of your iris because there are three models that evaluate each possible class:

  • class 0 versus classes 1 and 2
  • class 1 versus classes 0 and 2
  • class 2 versus classes 0 and 1

Naturally, classes 0, 1 and 2 are Setosa, Versicolor, and Virginica, but the algorithm needs them expressed as numeric codes, as you can verify by exploring the results of the example code:

list(iris.target_names)
['setosa', 'versicolor', 'virginica']

np.unique(Y)
array([0, 1, 2])

NOTE: it is true that Scikit-learn encodes by itself the target labels if they are strings. On Scikit-learn's Github page for logistic regression (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py) you can see at rows 1623 and 1624 where the code calls the label encoder and it encodes labels automatically:

# Encode for string labels
label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)