I'm looking for the fastest and idiomatic analog to SQL MINUS (AKA EXCEPT) operator.

Here is what I mean - given two Pandas DataFrames as follows:

In [77]: d1
Out[77]:
   a  b  c
0  0  0  1
1  0  1  2
2  1  0  3
3  1  1  4
4  0  0  5
5  1  1  6
6  2  2  7

In [78]: d2
Out[78]:
   a  b   c
0  1  1  10
1  0  0  11
2  1  1  12

How to find a result of d1 MINUS d2 taking into account only columns "a" and "b" in order to get the following result:

In [62]: res
Out[62]:
   a  b  c
1  0  1  2
2  1  0  3
6  2  2  7

MVCE:

d1 = pd.DataFrame({
    'a': [0, 0, 1, 1, 0, 1, 2], 
    'b': [0, 1, 0, 1, 0, 1, 2], 
    'c': [1, 2, 3, 4, 5, 6, 7]
})

d2 = pd.DataFrame({
    'a': [1, 0, 1], 
    'b': [1, 0, 1], 
    'c': [10, 11, 12]
})

What have I tried:

In [65]: tmp1 = d1.reset_index().set_index(["a", "b"])

In [66]: idx = tmp1.index.difference(d2.set_index(["a","b"]).index)

In [67]: res = d1.loc[tmp1.loc[idx, "index"]]

In [68]: res
Out[68]:
   a  b  c
1  0  1  2
2  1  0  3
6  2  2  7

it gives me correct results, but I have a feeling that there must be a more idiomatic and nicer / cleaner way to achieve that.

PS DataFrame.isin() method won't help in this case as it'll produce a wrong result set

标签： python sql pandas dataframe

5条回答

够拽才男人

2楼-- · 2020-01-30 01:47

I am thinking a little bit like excel here:

d1[~d1[['a','b']].astype(str).sum(axis=1).isin(d2[['a','b']].astype(str).sum(axis=1))]

0人赞添加讨论(0) 举报

ら.Afraid

3楼-- · 2020-01-30 01:58

We can use pandas.concat with drop_duplicates here and pass it the argument to drop all duplicates with keep=False:

pd.concat([d1, d2]).drop_duplicates(['a', 'b'], keep=False)

   a  b  c
1  0  1  2
2  1  0  3
6  2  2  7

Edit after comment by OP

If you want to make sure that unique rows in df2 arnt taken into account, we can duplicate that df:

pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)

   a  b  c
1  0  1  2
2  1  0  3
6  2  2  7

0人赞添加讨论(0) 举报

Luminary・发光体

4楼-- · 2020-01-30 01:58

I had similar question, I tried your idea

(
In [65]: tmp1 = d1.reset_index().set_index(["a", "b"])

In [66]: idx = tmp1.index.difference(d2.set_index(["a","b"]).index)

In [67]: res = d1.loc[tmp1.loc[idx, "index"]]

)

for test and it works.

However, I use the way in my sqlite, tow databases that have the same Structure,that means its tables and tables' columns are the same, and it occurred some mistakes, it shows that this two df seems don't have the same shap.

if u r happy to give me a hand and want more details, we can have a further conversation thanks a lot

0人赞添加讨论(0) 举报

倾城　Initia

5楼-- · 2020-01-30 02:06

One possible solution with merge and indicator=True:

df = (d1.reset_index()
        .merge(d2, on=['a','b'], indicator=True, how='outer', suffixes=('','_'))
        .query('_merge == "left_only"')
        .set_index('index')
        .rename_axis(None)
        .reindex(d1.columns, axis=1))
print (df)
   a  b  c
1  0  1  2
2  1  0  3
6  2  2  7

Solution with isin:

df = d1[~d1.set_index(["a", "b"]).index.isin(d2.set_index(["a","b"]).index)]
print (df)
   a  b  c
1  0  1  2
2  1  0  3
6  2  2  7

0人赞添加讨论(0) 举报

迷人小祖宗

6楼-- · 2020-01-30 02:06

Execution time comparison for larger data sets:

In [100]: df1 = pd.concat([d1] * 10**5, ignore_index=True)

In [101]: df2 = pd.concat([d2] * 10**5, ignore_index=True)

In [102]: df1.shape
Out[102]: (700000, 3)

In [103]: df2.shape
Out[103]: (300000, 3)

`pd.concat().drop_duplicates()` approach:

In [10]: %%timeit
    ...: res = pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)
    ...:
    ...:
2.59 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

multi-index NOT IS IN approach:

In [11]: %%timeit
    ...: res = df1[~df1.set_index(["a", "b"]).index.isin(df2.set_index(["a","b"]).index)]
    ...:
    ...:
484 ms ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

multi-index difference approach:

In [12]: %%timeit
    ...: tmp1 = df1.reset_index().set_index(["a", "b"])
    ...: idx = tmp1.index.difference(df2.set_index(["a","b"]).index)
    ...: res = df1.loc[tmp1.loc[idx, "index"]]
    ...:
    ...:
1.04 s ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

`merge(how="outer")` approach - gives me a `MemoryError`:

In [106]: %%timeit
     ...: res =  (df1.reset_index()
     ...:         .merge(df2, on=['a','b'], indicator=True, how='outer', suffixes=('','_'))
     ...:         .query('_merge == "left_only"')
     ...:         .set_index('index')
     ...:         .rename_axis(None)
     ...:         .reindex(df1.columns, axis=1))
     ...:
     ...:
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)

compare concatenated strings approach:

In [13]: %%timeit
    ...: res = df1[~df1[['a','b']].astype(str).sum(axis=1).isin(df2[['a','b']].astype(str).sum(axis=1))]
    ...:
    ...:
2.05 s ± 65.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

0人赞添加讨论(0) 举报

Pandas analogue to SQL MINUS / EXCEPT operator, us

Execution time comparison for larger data sets:

pd.concat().drop_duplicates() approach:

multi-index NOT IS IN approach:

multi-index difference approach:

merge(how="outer") approach - gives me a MemoryError:

compare concatenated strings approach:

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间

`pd.concat().drop_duplicates()` approach:

`merge(how="outer")` approach - gives me a `MemoryError`: