I'm trying to encode categorical values to dummy vectors. pandas.get_dummies does a perfect job, but the dummy vectors depend on the values present in the Dataframe. How to encode a second Dataframe according to the same dummy vectors as the first Dataframe?
import pandas as pd
df=pd.DataFrame({'cat1':['A','N','K','P'],'cat2':['C','S','T','B']})
b=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(b)
cat1_A cat1_K cat1_N cat1_P
0 1 0 0 0
1 0 0 1 0
2 0 1 0 0
3 0 0 0 1
df_test=df=pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
c=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(c)
cat1_A cat1_N
0 1 0
1 0 1
How can I get this output ?
cat1_A cat1_K cat1_N cat1_P
0 1 0 0 0
1 0 0 1 0
I was thinking to manually compute uniques for each column and then create a dictionary to map the second Dataframe, but I'm sure there is already a function for that... Thanks!
A always use categorical_encoding because it has a great choice of encoders. It also works with Pandas very nicely, is pip installable and is written inline with the sklearn API. It means you can quick test different types of encoders with the
fit
andtransform
methods or in aPipeline
.If you wish to encode just the first column, like in your example, we can do so.
The default is to have column names have numerical encoding instead of the original letters. This is helpful though when you have long strings as categories. This can be changed by passing the
use_cat_names=True
kwarg, as mentioned by Arthur.Now we can use the
transform
method to encode your second DataFrame.As commented in line 5, not setting
cols
defaults to encode all string columns.I had the same problem before. This is what I did which is not necessary the best way to do this. But this works for me.
Using the function astype with 'categories' values passed in as parameter.
To apply the same category to all DFs, you better store the category values to a variable like
Then use astype like