How to remove a pandas dataframe from another dataframe, just like the set subtraction:
a=[1,2,3,4,5]
b=[1,5]
a-b=[2,3,4]
And now we have two pandas dataframe, how to remove df2 from df1:
In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b'])
In [6]: df1
Out[6]:
a b
0 1 2
1 3 4
2 5 6
In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b'])
In [10]: df2
Out[10]:
a b
0 1 2
1 5 6
Then we expect df1-df2 result will be:
In [14]: df
Out[14]:
a b
0 3 4
How to do it?
Thank you.
An easiest option is to use indexes.
Append df1 and df2 and reset their indexes.
df = df1.concat(df2)
df.reset_index(inplace=True)
e.g:
This will give df2 indexes
indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) ) result_index = df.index[~index_df2] result_data = df.iloc[ result_index,:]
Hope it will help to new readers, although the question posted a little time ago :)
My shot with merge df1 and df2 from the question.
Using 'indicator' parameter
You can use
.duplicated
, which has the benefit of being fairly expressive:For comparison:
In sum, using the
np.array
comparison is fastest. Don't need the.tolist()
there.Solution
Use
pd.concat
followed bydrop_duplicates(keep=False)
It looks like
Explanation
pd.concat
adds the twoDataFrame
s together by appending one right after the other. if there is any overlap, it will be captured by thedrop_duplicates
method. However,drop_duplicates
by default leaves the first observation and removes every other observation. In this case, we want every duplicate removed. Hence, thekeep=False
parameter which does exactly that.A special note to the repeated
df2
. With only onedf2
any row indf2
not indf1
won't be considered a duplicate and will remain. This solution with only onedf2
only works whendf2
is a subset ofdf1
. However, if we concatdf2
twice, it is guaranteed to be a duplicate and will subsequently be removed.I think the first
tolist()
needs to be removed, but keep the second one:A masking approach