How do I use use scikit LabelEncoder for new label

So my code like is:

>>> le = preprocessing.LabelEncoder()
>>> le.fit(train["capital city"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1])
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

But what if in my test dataset, I has something like "beijing" but "beijing" does not exist in the training set? Is there a way for the encoder to handle this without adding in every possible capital city in the globe?

标签： pandas machine-learning scikit-learn sklearn-pandas scikits

3条回答

Juvenile、少年°

2楼-- · 2019-08-21 16:12

you can try solution from "sklearn.LabelEncoder with never seen before values" https://stackoverflow.com/a/48169252/9043549

0人赞添加讨论(0) 举报

地球回转人心会变

3楼-- · 2019-08-21 16:14

For a real world scenario, where all you have is training data and new classes can come up later, you can try my solution:

le.classes_ = np.append(le.classes_, "new_class_name")
le.transform(new_y)

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

4楼-- · 2019-08-21 16:20

You can pass a total list of df['capital city'] to the LabelEncoder.fit() before splitting the dataframe df into train and test.

For example, if df is like this:

df['capital city'] = ['amsterdam', 'paris', 'tokyo', 'beijing', 'tokyo', 'newyork', 'paris']

Then, you can use:

le = preprocessing.LabelEncoder();
le.fit(df['capital city'])

le.classes_
Output: ['amsterdam', 'beijing', 'newyork', 'paris', 'tokyo']

Then use transform() on train and test data to convert them to integers correctly.

train["capital city integers"] = le.transform(train["capital city"])
test["capital city integers"] = le.transform(test["capital city"])

Hope this helps.

Note: Although the above given siggestion will work for you and is perfectly acceptable when you are learning, but you should consider about the real world scenarios when employing this for real tasks. Because in real world, all od your available data will be training data (so you use and encode the capital cities), and then new data may come which contains a never before seen capital city value. What would you like to do in that case?

0人赞添加讨论(0) 举报

How do I use use scikit LabelEncoder for new label

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间