My dataframe reads like :
df1
user_id username firstname lastname
123 abc abc abc
456 def def def
789 ghi ghi ghi
df2
user_id username firstname lastname
111 xyz xyz xyz
456 def def def
234 mnp mnp mnp
Now I want a output dataframe like
user_id username firstname lastname
123 abc abc abc
456 def def def
789 ghi ghi ghi
111 xyz xyz xyz
234 mnp mnp mnp
As user_id 456
is common across both the dataframes. I have tried groupby on user_id groupby(['user_id'])
. But looks like groupby need to be followed by some aggregation
which I don't want here.
Use
concat
+drop_duplicates
:Solution with
groupby
and aggregatefirst
is slowier:EDIT:
Another solution with
boolean indexing
andnumpy.in1d
:One approach with masking -
Two more approaches using the underlying array data,
np.in1d
,np.searchsorted
to get the mask of matches and then stacking those two and constructing an output dataframe from the stacked array data -Timings for given sample -
Would be interesting to see how these fare on larger datasets.
Another approach is to use np.in1d to check for duplicate user_id.
Or to use a set to get unique rows from the merged records from df1 and df2. This one seems to be a few times faster.
Timings: