While doing certain experiments, we usually train on 70% and test on 33%. But, what happens when your model is in production? The following may occur:
Training Set:
-----------------------
| Ser |Type Of Car |
-----------------------
| 1 | Hatchback |
| 2 | Sedan |
| 3 | Coupe |
| 4 | SUV |
-----------------------
After One- Hot Encoding this, this is what we get:
-----------------------------------------
| Ser | Hatchback | Sedan | Coupe | SUV |
-----------------------------------------
| 1 | 1 | 0 | 0 | 0 |
| 2 | 0 | 1 | 0 | 0 |
| 3 | 0 | 0 | 1 | 0 |
| 4 | 0 | 0 | 0 | 1 |
-----------------------------------------
My model is trained and and now I want to deploy it across multiple dealerships. The model is trained for 4 features. Now, a certain dealership only sells Sedan and Coupes:
Test Set :
-----------------------
| Ser |Type Of Car |
-----------------------
| 1 | Coupe |
| 2 | Sedan |
-----------------------
One-Hot Encoding results in :
---------------------------
| Ser | Coupe | Sedan |
---------------------------
| 1 | 1 | 0 |
| 2 | 0 | 1 |
| 3 | 1 | 0 |
---------------------------
Here our test set has only 2 features. It does not make sense to build a model for every new dealership. How to handle such problems in production? Is there any other encoding method that can be used to handle Categorical variables?
The input to your model in production should be the same as during training. So if during training you one-hot encode 4 categories - do the same in production. Use zeros for missing features. Drop features you have not seen during training.
I'll assume you are using pandas to do the one hot encoding. If not, you have to do some more work, but the logic is still the same.
Result is
Since Ferrari is not in the list of known categories, all the one ot encoded entries for the Ferrari are zero. If you find a new car type in your production data, the rows encoding the car type should all be 0.