Getting ValueError: y contains new labels when usi

I have a series like:

df['ID'] = ['ABC123', 'IDF345', ...]

I'm using scikit's LabelEncoder to convert it to numerical values to be fed into the RandomForestClassifier.

During the training, I'm doing as follows:

le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID)

But, now for testing/prediction, when I pass in new data, I want to transform the 'ID' from this data based on le_id i.e., if same values are present then transform it according to the above label encoder, otherwise assign a new numerical value.

In the test file, I was doing as follows:

new_df['ID'] = le_dpid.transform(new_df.ID)

But, I'm getting the following error: ValueError: y contains new labels

How do I fix this?? Thanks!

UPDATE:

So the task I have is to use the below (for example) as training data and predict the 'High', 'Mod', 'Low' values for new BankNum, ID combinations. The model should learn the characteristics where a 'High' is given, where a 'Low' is given from the training dataset. For example, below a 'High' is given when there are multiple entries with same BankNum and different IDs.

df = 

BankNum   | ID    | Labels

0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low

And then predict it on something like:

BankNum   |  ID | 

00982222  | AB999 | 
00982222  | AB999 |
00981111  | AB890 |

I'm doing something like this:

df['BankNum'] = df.BankNum.astype(np.float128)

    le_id = LabelEncoder()
    df['ID'] = le_id.fit_transform(df.ID)

X_train, X_test, y_train, y_test = train_test_split(df[['BankNum', 'ID'], df.Labels, test_size=0.25, random_state=42)
    clf = RandomForestClassifier(random_state=42, n_estimators=140)
    clf.fit(X_train, y_train)

标签： python machine-learning scikit-learn prediction

2条回答

姐就是有狂的资本

2楼-- · 2019-06-24 04:51

you can try solution from "sklearn.LabelEncoder with never seen before values" https://stackoverflow.com/a/48169252/9043549 The thing is to create dictionary with classes, than map column and fill new classes with some "known value"

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
suf="_le"
col="a"
df[col+suf] = le.fit_transform(df[col])
dic = dict(zip(le.classes_, le.transform(le.classes_)))
col='b'
df[col+suf]=df[col].map(dic).fillna(dic["c"]).astype(int)

0人赞添加讨论(0) 举报

Getting ValueError: y contains new labels when usi

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间