I am using sklearn for text classification, all my features are numerical but my target variable labels are in text. I can understand the rationale behind encoding features to numerics but don't think this applies for the target variable?
问题:
回答1:
If your target variable is in textual form, you can transform it into numeric form (or you can leave it alone, please see my note below) in order for any Scikit-learn algorithm to pick it in an OVA (One Versus All) scheme: your learning algorithm will try to guess each class as compared against the residual ones only when they will be transformed into numeric codes starting from 0 to (number of classes - 1).
For instance, in this example from the Scikit-Learn documentation, you can figure out the class of your iris because there are three models that evaluate each possible class:
- class 0 versus classes 1 and 2
- class 1 versus classes 0 and 2
- class 2 versus classes 0 and 1
Naturally, classes 0, 1 and 2 are Setosa, Versicolor, and Virginica, but the algorithm needs them expressed as numeric codes, as you can verify by exploring the results of the example code:
list(iris.target_names)
['setosa', 'versicolor', 'virginica']
np.unique(Y)
array([0, 1, 2])
NOTE: it is true that Scikit-learn encodes by itself the target labels if they are strings. On Scikit-learn's Github page for logistic regression (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py) you can see at rows 1623 and 1624 where the code calls the label encoder and it encodes labels automatically:
# Encode for string labels label_encoder = LabelEncoder().fit(y) y = label_encoder.transform(y)