How to handle One-Hot Encoding in production envir

2019-05-31 01:13发布

While doing certain experiments, we usually train on 70% and test on 33%. But, what happens when your model is in production? The following may occur:

Training Set:

-----------------------
| Ser |Type Of Car    |
-----------------------
|  1  | Hatchback     |
|  2  | Sedan         |
|  3  | Coupe         |
|  4  | SUV           |
-----------------------

After One- Hot Encoding this, this is what we get:

-----------------------------------------
| Ser | Hatchback | Sedan | Coupe | SUV |
-----------------------------------------
|  1  |     1     |   0   |   0    |  0 |
|  2  |     0     |   1   |   0    |  0 |
|  3  |     0     |   0   |   1    |  0 |
|  4  |     0     |   0   |   0    |  1 |
-----------------------------------------

My model is trained and and now I want to deploy it across multiple dealerships. The model is trained for 4 features. Now, a certain dealership only sells Sedan and Coupes:

Test Set :

-----------------------
| Ser |Type Of Car    |
-----------------------
|  1  | Coupe         |
|  2  | Sedan         |
-----------------------

One-Hot Encoding results in :

---------------------------
| Ser | Coupe     | Sedan |
---------------------------
|  1  |     1     |   0   |
|  2  |     0     |   1   |
|  3  |     1     |   0   |
---------------------------

Here our test set has only 2 features. It does not make sense to build a model for every new dealership. How to handle such problems in production? Is there any other encoding method that can be used to handle Categorical variables?

2条回答
成全新的幸福
2楼-- · 2019-05-31 01:31

The input to your model in production should be the same as during training. So if during training you one-hot encode 4 categories - do the same in production. Use zeros for missing features. Drop features you have not seen during training.

查看更多
beautiful°
3楼-- · 2019-05-31 01:32

I'll assume you are using pandas to do the one hot encoding. If not, you have to do some more work, but the logic is still the same.

import pandas as pd

known_categories = ['Sedan','Coupe','Limo'] # from training set

car_type = pd.Series(['Sedan','Ferrari']) # new category in production, 'Ferrari'

car_type = pd.Categorical(car_type, categories = known_categories)

pd.get_dummies(car_type)

Result is

    Sedan   Coupe   Limo
0   1.0      0.0    0.0    # Sedan entry
1   0.0      0.0    0.0    # Ferrari entry

Since Ferrari is not in the list of known categories, all the one ot encoded entries for the Ferrari are zero. If you find a new car type in your production data, the rows encoding the car type should all be 0.

查看更多
登录 后发表回答