Is numerical encoding necessary for the target var

2019-08-03 16:14发布

I am using sklearn for text classification, all my features are numerical but my target variable labels are in text. I can understand the rationale behind encoding features to numerics but don't think this applies for the target variable?

标签： python machine-learning sklearn-pandas

1条回答

Rolldiameter

2楼-- · 2019-08-03 17:19

If your target variable is in textual form, you can transform it into numeric form (or you can leave it alone, please see my note below) in order for any Scikit-learn algorithm to pick it in an OVA (One Versus All) scheme: your learning algorithm will try to guess each class as compared against the residual ones only when they will be transformed into numeric codes starting from 0 to (number of classes - 1).

For instance, in this example from the Scikit-Learn documentation, you can figure out the class of your iris because there are three models that evaluate each possible class:

class 0 versus classes 1 and 2
class 1 versus classes 0 and 2
class 2 versus classes 0 and 1

Naturally, classes 0, 1 and 2 are Setosa, Versicolor, and Virginica, but the algorithm needs them expressed as numeric codes, as you can verify by exploring the results of the example code:

list(iris.target_names)
['setosa', 'versicolor', 'virginica']

np.unique(Y)
array([0, 1, 2])

NOTE: it is true that Scikit-learn encodes by itself the target labels if they are strings. On Scikit-learn's Github page for logistic regression (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py) you can see at rows 1623 and 1624 where the code calls the label encoder and it encodes labels automatically:
# Encode for string labels
label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)

0人赞添加讨论(0) 举报

Is numerical encoding necessary for the target var

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间