Given a dataframe, collapsing values into a set per group for a column is straightforward:
df.groupby('A')['B'].apply(set)
But how do you do it in a pythonic way if you want to do it on multiple columns and the result to be in a dataframe?
For example for the following dataframe:
import pandas as pd
df = pd.DataFrame({'user_id': [1, 2, 3, 4, 1, 2, 3],
'class_type': ['Krav Maga', 'Yoga', 'Ju-jitsu', 'Krav Maga', 'Ju-jitsu','Krav Maga', 'Karate'],
'instructor': ['Bob', 'Alice','Bob', 'Alice','Alice', 'Alice','Bob']})
The result wanted is the data frame below produced in a pythonic way:
|user_id|class_type |instructor |
|-------|-----------------------|---------------|
| 1 | {Krav Maga, Ju-jitsu} | {Bob, Alice} |
| 2 | {Krav Maga, Yoga} | {Alice} |
| 3 | {Karate, Ju-jitsu} | {Bob} |
| 4 | {Krav Maga} | {Alice} |
This is a dummy example. The question spurred from: "what if I have a table with 30 columns and I want to achieve this in a pythonic way?"
Currently I have a solution but I don't think is the best way to do it:
df[['grouped_B', 'grouped_C']] = df.groupby('A')[['B','C']].transform(set)
deduped_and_collapsed_df = df.groupby('A')[['A','grouped_B', 'grouped_C']].head(1)
Thank you in advance!