I have one df
with a user_id
and a category
. I'd like to transform this to a truth table for whether or not that user has at least one entry for that category. However, the final table should also include columns for all categories that appear in 'df_list', which may not appear at all in df
.
Right now I create the truth table with a groupby
+ size
and then check if any columns are missing, and then manually set those columns to False
, but I was wondering if there was a way to accomplish this in the initial groupby
step.
Here's an example:
import pandas as pd
df = pd.DataFrame({'user_id': [1,1,1,2,2],
'category': ['A', 'B', 'D', 'A', 'F']})
df_list = pd.DataFrame({'category': ['A', 'B', 'C', 'D', 'E', 'F']})
df_truth = df.groupby(['user_id', 'category']).size().unstack(fill_value=0).astype(bool)
#category A B D F
#user_id
#1 True True True False
#2 True False False True
To then get to the desired output I then do:
missing_vals = df_list.category.unique()[~pd.Series(df_list.category.unique()).isin(df_truth.columns)]
for element in missing_vals:
df_truth.loc[:,element] = False
#category A B D F C E
#user_id
#1 True True True False False False
#2 True False False True False False
Option 1
crosstab
I'd recommend converting that column to a categorical dtype.
crosstab
/pivot
will then handle the rest.Option 2
unstack
+reindex
To fix your existing code, you can simplify the second step with
reindex
: