How do I pivot one DataFrame column to a truth tab

2019-06-25 12:09发布

问题:

I have one df with a user_id and a category. I'd like to transform this to a truth table for whether or not that user has at least one entry for that category. However, the final table should also include columns for all categories that appear in 'df_list', which may not appear at all in df.

Right now I create the truth table with a groupby + size and then check if any columns are missing, and then manually set those columns to False, but I was wondering if there was a way to accomplish this in the initial groupby step.

Here's an example:

import pandas as pd
df = pd.DataFrame({'user_id': [1,1,1,2,2],
                 'category': ['A', 'B', 'D', 'A', 'F']})
df_list = pd.DataFrame({'category': ['A', 'B', 'C', 'D', 'E', 'F']})

df_truth = df.groupby(['user_id', 'category']).size().unstack(fill_value=0).astype(bool)
#category     A      B      D      F
#user_id                            
#1         True   True   True  False
#2         True  False  False   True

To then get to the desired output I then do:

missing_vals = df_list.category.unique()[~pd.Series(df_list.category.unique()).isin(df_truth.columns)]
for element in missing_vals:
    df_truth.loc[:,element] = False
#category     A      B      D      F      C      E
#user_id                                          
#1         True   True   True  False  False  False
#2         True  False  False   True  False  False

回答1:

Option 1
crosstab
I'd recommend converting that column to a categorical dtype. crosstab/pivot will then handle the rest.

i = df.user_id
j = pd.Categorical(df.category, categories=df_list.category)

pd.crosstab(i, j).astype(bool)

col_0       A      B      C      D      E      F
user_id                                         
1        True   True  False   True  False  False
2        True  False  False  False  False   True 

Option 2
unstack + reindex
To fix your existing code, you can simplify the second step with reindex:

(df.groupby(['user_id', 'category'])
   .size()
   .unstack(fill_value=0)
   .reindex(df_list.category, axis=1, fill_value=0)
   .astype(bool)
)

category     A      B      C      D      E      F
user_id                                          
1         True   True  False   True  False  False
2         True  False  False  False  False   True