I have two dataframes of different size (df1
nad df2
). I would like to remove from df1
all the rows which are stored within df2
.
So if I have df2
equals to:
A B
0 wer 6
1 tyu 7
And df1
equals to:
A B C
0 qwe 5 a
1 wer 6 s
2 wer 6 d
3 rty 9 f
4 tyu 7 g
5 tyu 7 h
6 tyu 7 j
7 iop 1 k
The final result should be like so:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
I was able to achieve my goal by using a for loop but I would like to know if there is a better and more elegant and efficient way to perform such operation.
Here is the code I wrote in case you need it:
import pandas as pd
df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
'B' : [ 5, 6, 6, 9, 7, 7, 7, 1],
'C' : ['a' , 's', 'd', 'f', 'g', 'h', 'j', 'k']})
df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
'B' : [ 6, 7]})
for i, row in df2.iterrows():
df1 = df1[(df1['A']!=row['A']) & (df1['B']!=row['B'])].reset_index(drop=True)
Use merge
with outer join with filter by query
, last remove helper column by drop
:
df = pd.merge(df1, df2, on=['A','B'], how='outer', indicator=True)
.query("_merge != 'both'")
.drop('_merge', axis=1)
.reset_index(drop=True)
print (df)
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
You can use np.in1d to check if any row in df1 exists in df2. And then use it as a reversed mask to select rows from df1.
df1[~df1[['A','B']].apply(lambda x: np.in1d(x,df2).all(),axis=1)]\
.reset_index(drop=True)
Out[115]:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
pandas
has a method called isin
, however this relies on unique indices. We can define a lambda function to create columns we can use in this from the existing 'A'
and 'B'
of df1
and df2
. We then negate this (as we want the values not in df2
) and reset the index:
import pandas as pd
df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
'B' : [ 5, 6, 6, 9, 7, 7, 7, 1],
'C' : ['a' , 's', 'd', 'f', 'g', 'h', 'j', 'k']})
df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
'B' : [ 6, 7]})
unique_ind = lambda df: df['A'].astype(str) + '_' + df['B'].astype(str)
print df1[~unique_ind(df1).isin(unique_ind(df2))].reset_index(drop=True)
printing:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
The cleanest way I found was to use drop from pandas using the index of the dataframe you want to drop:
df1.drop(df2.index, axis=0,inplace=True)