While doing certain experiments, we usually train on 70% and test on 33%. But, what happens when your model is in production? The following may occur:
Training Set:
-----------------------
| Ser |Type Of Car |
-----------------------
| 1 | Hatchback |
| 2 | Sedan |
| 3 | Coupe |
| 4 | SUV |
-----------------------
After One- Hot Encoding this, this is what we get:
-----------------------------------------
| Ser | Hatchback | Sedan | Coupe | SUV |
-----------------------------------------
| 1 | 1 | 0 | 0 | 0 |
| 2 | 0 | 1 | 0 | 0 |
| 3 | 0 | 0 | 1 | 0 |
| 4 | 0 | 0 | 0 | 1 |
-----------------------------------------
My model is trained and and now I want to deploy it across multiple dealerships. The model is trained for 4 features. Now, a certain dealership only sells Sedan and Coupes:
Test Set :
-----------------------
| Ser |Type Of Car |
-----------------------
| 1 | Coupe |
| 2 | Sedan |
-----------------------
One-Hot Encoding results in :
---------------------------
| Ser | Coupe | Sedan |
---------------------------
| 1 | 1 | 0 |
| 2 | 0 | 1 |
| 3 | 1 | 0 |
---------------------------
Here our test set has only 2 features. It does not make sense to build a model for every new dealership. How to handle such problems in production? Is there any other encoding method that can be used to handle Categorical variables?