How to encode two Pandas dataframes according to t

2019-06-09 07:39发布

问题:

I'm trying to encode categorical values to dummy vectors. pandas.get_dummies does a perfect job, but the dummy vectors depend on the values present in the Dataframe. How to encode a second Dataframe according to the same dummy vectors as the first Dataframe?

 import pandas as pd


df=pd.DataFrame({'cat1':['A','N','K','P'],'cat2':['C','S','T','B']})
b=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(b)



  cat1_A  cat1_K  cat1_N  cat1_P
0       1       0       0       0
1       0       0       1       0
2       0       1       0       0
3       0       0       0       1



df_test=df=pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
c=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(c)

   cat1_A  cat1_N
0       1       0
1       0       1

How can I get this output ?

 cat1_A  cat1_K  cat1_N  cat1_P
0       1       0       0       0
1       0       0       1       0

I was thinking to manually compute uniques for each column and then create a dictionary to map the second Dataframe, but I'm sure there is already a function for that... Thanks!

回答1:

I had the same problem before. This is what I did which is not necessary the best way to do this. But this works for me.

df=pd.DataFrame({'cat1':['A','N'],'cat2':['C','S']})

df['cat1'] = df['cat1'].astype('category', categories=['A','N','K','P'])
# then run the get_dummies
b=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')

Using the function astype with 'categories' values passed in as parameter.

To apply the same category to all DFs, you better store the category values to a variable like

cat1_categories = ['A','N','K','P']
cat2_categories = ['C','S','T','B']

Then use astype like

df_test=df=pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
df['cat1'] = df['cat1'].astype('category', categories=cat1_categories)
c=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(c)

   cat1_A  cat1_N  cat1_K  cat1_P
0       1       0       0       0
1       0       1       0       0


回答2:

A always use categorical_encoding because it has a great choice of encoders. It also works with Pandas very nicely, is pip installable and is written inline with the sklearn API. It means you can quick test different types of encoders with the fit and transform methods or in a Pipeline.

If you wish to encode just the first column, like in your example, we can do so.

import pandas as pd
import category_encoders as ce

df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})
enc_ohe = ce.one_hot.OneHotEncoder(cols=['cat1'])
# cols=None, all string columns encoded

df_trans = enc_ohe.fit_transform(df)
print(df_trans)

   cat1_0  cat1_1  cat1_2  cat1_3 cat2
0       0       1       0       0    C
1       0       0       0       1    S
2       1       0       0       0    T
3       0       0       1       0    B

The default is to have column names have numerical encoding instead of the original letters. This is helpful though when you have long strings as categories. This can be changed by passing the use_cat_names=True kwarg, as mentioned by Arthur.

Now we can use the transform method to encode your second DataFrame.

df_test = pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
df_test_trans = enc_ohe.transform(df_test)

print(df_test_trans)

   cat1_1  cat1_3 cat2
0       1       0    T
1       0       1    B

As commented in line 5, not setting cols defaults to encode all string columns.