I'm looking for the fastest and idiomatic analog to SQL MINUS (AKA EXCEPT) operator.
Here is what I mean - given two Pandas DataFrames as follows:
In [77]: d1
Out[77]:
a b c
0 0 0 1
1 0 1 2
2 1 0 3
3 1 1 4
4 0 0 5
5 1 1 6
6 2 2 7
In [78]: d2
Out[78]:
a b c
0 1 1 10
1 0 0 11
2 1 1 12
How to find a result of d1 MINUS d2
taking into account only columns "a"
and "b"
in order to get the following result:
In [62]: res
Out[62]:
a b c
1 0 1 2
2 1 0 3
6 2 2 7
MVCE:
d1 = pd.DataFrame({
'a': [0, 0, 1, 1, 0, 1, 2],
'b': [0, 1, 0, 1, 0, 1, 2],
'c': [1, 2, 3, 4, 5, 6, 7]
})
d2 = pd.DataFrame({
'a': [1, 0, 1],
'b': [1, 0, 1],
'c': [10, 11, 12]
})
What have I tried:
In [65]: tmp1 = d1.reset_index().set_index(["a", "b"])
In [66]: idx = tmp1.index.difference(d2.set_index(["a","b"]).index)
In [67]: res = d1.loc[tmp1.loc[idx, "index"]]
In [68]: res
Out[68]:
a b c
1 0 1 2
2 1 0 3
6 2 2 7
it gives me correct results, but I have a feeling that there must be a more idiomatic and nicer / cleaner way to achieve that.
PS DataFrame.isin() method won't help in this case as it'll produce a wrong result set
I am thinking a little bit like excel here:
We can use
pandas.concat
withdrop_duplicates
here and pass it the argument to drop all duplicates withkeep=False
:Edit after comment by OP
If you want to make sure that unique rows in
df2
arnt taken into account, we can duplicate thatdf
:I had similar question, I tried your idea
for test and it works.
However, I use the way in my sqlite, tow databases that have the same Structure,that means its tables and tables' columns are the same, and it occurred some mistakes, it shows that this two df seems don't have the same shap.
if u r happy to give me a hand and want more details, we can have a further conversation thanks a lot
One possible solution with
merge
andindicator=True
:Solution with
isin
:Execution time comparison for larger data sets:
pd.concat().drop_duplicates()
approach:multi-index NOT IS IN approach:
multi-index difference approach:
merge(how="outer")
approach - gives me aMemoryError
:compare concatenated strings approach: