I have a multiple dataframe like below.
df1 = pd.DataFrame({'Col1':["aaa","ddd","ggg"],'Col2':["bbb","eee","hhh"],'Col3':"ccc","fff","iii"]})
df2= pd.DataFrame({'Col1':["aaa","zzz","qqq"],'Col2':["bbb","xxx","eee"],'Col3':["ccc", yyy","www"]})
df3= pd.DataFrame({'Col1':"rrr","zzz","qqq","ppp"],'Col2':"ttt","xxx","eee","ttt"],'Col3':"yyy","yyy","www","qqq"]})
The dataframe has 3 columns and sometimes their rows overlap among the dataframes. (e.g. df1 and df2 has an identical row as "aaa, bbb, ccc").
I want to know how the rows overlap among dataframes and want to make an output like below.
In this output, if an identical row is detected in the dataframe, the output will be 1, otherwise 0. Does anyone know how to make this output?
In the actual data, I have ~100 dataframes. I first tried to use pd.merge but could not apply this to 100 dataframes...
Thank you very much for your help.
Here is one way using concat
and get_dummies
:
l = [df1,df2,df3] #create a list of dataframes
final = pd.concat([i.assign(key=f"df{e+1}") for e,i in enumerate(l)],sort=False)
final = (final.assign(**pd.get_dummies(final.pop('key')))
.groupby(['Col1','Col2','Col3']).max().reset_index())
Col1 Col2 Col3 df1 df2 df3
0 aaa bbb ccc 1 1 0
1 ddd eee fff 1 0 0
2 ggg hhh iii 1 0 0
3 ppp ttt qqq 0 0 1
4 qqq eee www 0 1 1
5 rrr ttt yyy 0 0 1
6 zzz xxx yyy 0 1 1
Using pandas.concat
and groupby
:
dfs = [df1,df2,df3]
dfs = [d.assign(df='df%s' % n) for n, d in enumerate(dfs, start=1)]
new_df = pd.concat(dfs).groupby(['Col1', 'Col2', 'Col3','df']).size().unstack(fill_value=0)
print(new_df)
Output:
df df1 df2 df3
Col1 Col2 Col3
aaa bbb ccc 1 1 0
ddd eee fff 1 0 0
ggg hhh iii 1 0 0
ppp ttt qqq 0 0 1
qqq eee www 0 1 1
rrr ttt yyy 0 0 1
zzz xxx yyy 0 1 1
Setup:
df1 = pd.DataFrame({'Col1':["aaa","ddd","ggg"],'Col2':["bbb","eee","hhh"],'Col3':["ccc","fff","iii"]})
df2= pd.DataFrame({'Col1':["aaa","zzz","qqq"],'Col2':["bbb","xxx","eee"],'Col3':["ccc", "yyy","www"]})
df3= pd.DataFrame({'Col1':["rrr","zzz","qqq","ppp"],'Col2':["ttt","xxx","eee","ttt"],'Col3':["yyy","yyy","www","qqq"]})
Solution:
First create a indicate column for each dataframe, then concat, groupby and sum.
df1['df1'] = df2['df2'] = df3['df3'] = 1
(
pd.concat([df1, df2, df3], sort=False)
.groupby(by=['Col1', 'Col2', 'Col3'])
.max().astype(int)
.reset_index()
)
Col1 Col2 Col3 df1 df2 df3
0 aaa bbb ccc 1 1 0
1 ddd eee fff 1 0 0
2 ggg hhh iii 1 0 0
3 ppp ttt qqq 0 0 1
4 qqq eee www 0 1 1
5 rrr ttt yyy 0 0 1
6 zzz xxx yyy 0 1 1