How to handle One-Hot Encoding in production envir

While doing certain experiments, we usually train on 70% and test on 33%. But, what happens when your model is in production? The following may occur:

Training Set:

-----------------------
| Ser |Type Of Car    |
-----------------------
|  1  | Hatchback     |
|  2  | Sedan         |
|  3  | Coupe         |
|  4  | SUV           |
-----------------------

After One- Hot Encoding this, this is what we get:

-----------------------------------------
| Ser | Hatchback | Sedan | Coupe | SUV |
-----------------------------------------
|  1  |     1     |   0   |   0    |  0 |
|  2  |     0     |   1   |   0    |  0 |
|  3  |     0     |   0   |   1    |  0 |
|  4  |     0     |   0   |   0    |  1 |
-----------------------------------------

My model is trained and and now I want to deploy it across multiple dealerships. The model is trained for 4 features. Now, a certain dealership only sells Sedan and Coupes:

Test Set :

-----------------------
| Ser |Type Of Car    |
-----------------------
|  1  | Coupe         |
|  2  | Sedan         |
-----------------------

One-Hot Encoding results in :

---------------------------
| Ser | Coupe     | Sedan |
---------------------------
|  1  |     1     |   0   |
|  2  |     0     |   1   |
|  3  |     1     |   0   |
---------------------------

Here our test set has only 2 features. It does not make sense to build a model for every new dealership. How to handle such problems in production? Is there any other encoding method that can be used to handle Categorical variables?

标签： python machine-learning feature-selection one-hot-encoding

2条回答

成全新的幸福

2楼-- · 2019-05-31 01:31

The input to your model in production should be the same as during training. So if during training you one-hot encode 4 categories - do the same in production. Use zeros for missing features. Drop features you have not seen during training.

0人赞添加讨论(0) 举报

beautiful°

3楼-- · 2019-05-31 01:32

I'll assume you are using pandas to do the one hot encoding. If not, you have to do some more work, but the logic is still the same.

import pandas as pd

known_categories = ['Sedan','Coupe','Limo'] # from training set

car_type = pd.Series(['Sedan','Ferrari']) # new category in production, 'Ferrari'

car_type = pd.Categorical(car_type, categories = known_categories)

pd.get_dummies(car_type)

Result is

    Sedan   Coupe   Limo
0   1.0      0.0    0.0    # Sedan entry
1   0.0      0.0    0.0    # Ferrari entry

Since Ferrari is not in the list of known categories, all the one ot encoded entries for the Ferrari are zero. If you find a new car type in your production data, the rows encoding the car type should all be 0.

0人赞添加讨论(0) 举报

How to handle One-Hot Encoding in production envir

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间