Concat python dataframes based on unique rows

2019-02-25 22:01发布

问题:

My dataframe reads like :

df1

user_id    username firstname lastname 
 123         abc      abc       abc
 456         def      def       def 
 789         ghi      ghi       ghi

df2

user_id     username  firstname lastname
 111         xyz       xyz       xyz
 456         def       def       def
 234         mnp       mnp        mnp

Now I want a output dataframe like

 user_id    username firstname lastname 
 123         abc      abc       abc
 456         def      def       def 
 789         ghi      ghi       ghi
 111         xyz       xyz       xyz
 234         mnp       mnp        mnp

As user_id 456 is common across both the dataframes. I have tried groupby on user_id groupby(['user_id']) . But looks like groupby need to be followed by some aggregation which I don't want here.

回答1:

Use concat + drop_duplicates:

df = pd.concat([df1, df2]).drop_duplicates('user_id').reset_index(drop=True)
print (df)
   user_id username firstname lastname
0      123      abc       abc      abc
1      456      def       def      def
2      789      ghi       ghi      ghi
3      111      xyz       xyz      xyz
4      234      mnp       mnp      mnp

Solution with groupby and aggregate first is slowier:

df = pd.concat([df1, df2]).groupby('user_id', as_index=False, sort=False).first()
print (df)
   user_id username firstname lastname
0      123      abc       abc      abc
1      456      def       def      def
2      789      ghi       ghi      ghi
3      111      xyz       xyz      xyz
4      234      mnp       mnp      mnp

EDIT:

Another solution with boolean indexing and numpy.in1d:

df = pd.concat([df1, df2[~np.in1d(df2['user_id'], df1['user_id'])]], ignore_index=True)
print (df)
   user_id username firstname lastname
0      123      abc       abc      abc
1      456      def       def      def
2      789      ghi       ghi      ghi
3      111      xyz       xyz      xyz
4      234      mnp       mnp      mnp


回答2:

One approach with masking -

def app1(df1,df2):
    df20 = df2[~df2.user_id.isin(df1.user_id)]
    return pd.concat([df1, df20],axis=0)

Two more approaches using the underlying array data, np.in1d, np.searchsorted to get the mask of matches and then stacking those two and constructing an output dataframe from the stacked array data -

def app2(df1,df2):    
    df20_arr = df2.values[~np.in1d(df1.user_id.values, df2.user_id.values)]
    arr = np.vstack(( df1.values, df20_arr ))
    df_out = pd.DataFrame(arr, columns= df1.columns)
    return df_out

def app3(df1,df2):
    a = df1.values
    b = df2.values

    df20_arr = b[~np.in1d(a[:,0], b[:,0])]
    arr = np.vstack(( a, df20_arr ))
    df_out = pd.DataFrame(arr, columns= df1.columns)
    return df_out

def app4(df1,df2):
    a = df1.values
    b = df2.values

    b0 = b[:,0].astype(int)
    as0 = np.sort(a[:,0].astype(int))
    df20_arr = b[as0[np.searchsorted(as0,b0)] != b0]
    arr = np.vstack(( a, df20_arr ))
    df_out = pd.DataFrame(arr, columns= df1.columns)
    return df_out

Timings for given sample -

In [49]: %timeit app1(df1,df2)
    ...: %timeit app2(df1,df2)
    ...: %timeit app3(df1,df2)
    ...: %timeit app4(df1,df2)
    ...: 
1000 loops, best of 3: 753 µs per loop
10000 loops, best of 3: 192 µs per loop
10000 loops, best of 3: 181 µs per loop
10000 loops, best of 3: 171 µs per loop

# @jezrael's edited solution
In [85]: %timeit pd.concat([df1, df2[~np.in1d(df2['user_id'], df1['user_id'])]], ignore_index=True)
1000 loops, best of 3: 614 µs per loop

Would be interesting to see how these fare on larger datasets.



回答3:

Another approach is to use np.in1d to check for duplicate user_id.

pd.concat([df1,df2[df2.user_id.isin(np.setdiff1d(df2.user_id,df1.user_id))]])

Or to use a set to get unique rows from the merged records from df1 and df2. This one seems to be a few times faster.

pd.DataFrame(data=np.vstack({tuple(row) for row in np.r_[df1.values,df2.values]}),columns=df1.columns)

Timings:

%timeit pd.concat([df1,df2[df2.user_id.isin(np.setdiff1d(df2.user_id,df1.user_id))]])
1000 loops, best of 3: 2.48 ms per loop

%timeit pd.DataFrame(data=np.vstack({tuple(row) for row in np.r_[df1.values,df2.values]}),columns=df1.columns)

1000 loops, best of 3: 632 µs per loop