I have 3 sets of data (training, validation and testing) and when I run:
training_x = pd.get_dummies(training_x, columns=['a', 'b', 'c'])
It gives me a certain number of features. But then when I run it across validation data, it gives me a different number and the same for testing. Is there any way to normalize (wrong word, I know) across all data sets so the number of features aligns?
As already statet, normally you should do one hot encoding before splitting. But there is another problem. One day you surely want to apply your trained ML model to data in the wild. I mean data, that you have not seen before and you need to do exactly the same transformation for the dummies, as when you trained the model. Then you could have to deal with two cases.
You can address this by using the sklearn equivalent to get_dummies (with just a little more work), which looks like this:
With sklearn
OneHotEncoder
you can separate the identification of the categories from the actual one-hot-encoding (the creation of the dummies). And you could also save the fitted one hot encoder, to be able to apply it later during the application of your model. Note the handle_unknown option, which tells the one hot encoder, that in case it will encouter something unknown later, it should just ignore it, instead of raising an error.One simple solution is to align your validation and test sets to the training dataset after applying the dummies function. Here is how:
Referenced from kaggle : Link
Don't forget to add
fill_value=0
to avoid NaN in test...You can convert the datatype to
category
of the columns need to be converted to dummy variabledummies should be created before dividing the dataset into train, test or validate
suppose i have train and test dataframe as follows
so dummy for 7 and 8 category will only be present in test and thus will result with different feature
Now train and test will have same set of features