How to remove a pandas dataframe from another data

2020-05-15 15:04发布

How to remove a pandas dataframe from another dataframe, just like the set subtraction:

a=[1,2,3,4,5]
b=[1,5]
a-b=[2,3,4]

And now we have two pandas dataframe, how to remove df2 from df1:

In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b'])
In [6]: df1
Out[6]:
   a  b
0  1  2
1  3  4
2  5  6


In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b'])
In [10]: df2
Out[10]:
   a  b
0  1  2
1  5  6

Then we expect df1-df2 result will be:

In [14]: df
Out[14]:
   a  b
0  3  4

How to do it?

Thank you.

7条回答
等我变得足够好
2楼-- · 2020-05-15 15:07

An easiest option is to use indexes.

  1. Append df1 and df2 and reset their indexes.

    df = df1.concat(df2)
    df.reset_index(inplace=True)

  2. e.g:
    This will give df2 indexes

    indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) ) result_index = df.index[~index_df2] result_data = df.iloc[ result_index,:]

Hope it will help to new readers, although the question posted a little time ago :)

查看更多
【Aperson】
3楼-- · 2020-05-15 15:10

My shot with merge df1 and df2 from the question.

Using 'indicator' parameter

In [74]: df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
Out[74]: 
   a  b
1  3  4
查看更多
Melony?
4楼-- · 2020-05-15 15:11

You can use .duplicated, which has the benefit of being fairly expressive:

%%timeit
combined = df1.append(df2)
combined[~combined.index.duplicated(keep=False)]

1000 loops, best of 3: 875 µs per loop

For comparison:

%timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']

100 loops, best of 3: 4.57 ms per loop


%timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

1000 loops, best of 3: 987 µs per loop


%timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)]

1000 loops, best of 3: 546 µs per loop

In sum, using the np.array comparison is fastest. Don't need the .tolist() there.

查看更多
可以哭但决不认输i
5楼-- · 2020-05-15 15:20

Solution

Use pd.concat followed by drop_duplicates(keep=False)

pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

It looks like

   a  b
1  3  4

Explanation

pd.concat adds the two DataFrames together by appending one right after the other. if there is any overlap, it will be captured by the drop_duplicates method. However, drop_duplicates by default leaves the first observation and removes every other observation. In this case, we want every duplicate removed. Hence, the keep=False parameter which does exactly that.

A special note to the repeated df2. With only one df2 any row in df2 not in df1 won't be considered a duplicate and will remain. This solution with only one df2 only works when df2 is a subset of df1. However, if we concat df2 twice, it is guaranteed to be a duplicate and will subsequently be removed.

查看更多
爱情/是我丢掉的垃圾
6楼-- · 2020-05-15 15:26

I think the first tolist() needs to be removed, but keep the second one:

df1[df1.apply(lambda x: x.values() not in df2.values.tolist(), axis=1)]
查看更多
Animai°情兽
7楼-- · 2020-05-15 15:30

A masking approach

df1[df1.apply(lambda x: x.values.tolist() not in df2.values.tolist(), axis=1)]

   a  b
1  3  4
查看更多
登录 后发表回答