Inequality joins in Pandas?

I usually use Dataframe.merge to combine dataframes in pandas. From my understanding, this only works on equality joins. What is the idiomatic way to join two dataframes using other types of checks (e.g. inequality)?

标签： python join pandas merge dataframe

2条回答

够拽才男人

2楼-- · 2020-03-01 18:30

merge() is fairly limited. You can accomplish more complex joins using pandasql.sqldf. You can write pretty much any sql query and refer to your existing dataframes as table names in the sql statements.
https://github.com/yhat/pandasql/ A known bug is the inability to select multiple tables in product joins, such as

select d1.something, d2.something else from df1 as d1, df2 as d2 where d1.date=d2.date

However, if you can do joins without any problem, and a statement like I have above can be translated into a join.

0人赞添加讨论(0) 举报

唯我独甜

3楼-- · 2020-03-01 18:34

Pandas merge() allows for outer, left, right joins (not just inner joins) between two data frames, so you can return unmatched records. Additionally, merge() can even be generalized to return a cross join (all combination matches between two data frames) and with filtering afterwards you can return unmatched records. Still more, there is the isin() pandas method.

Consider the following demonstration. Below are two data frames of something we come to enjoy, computer languages. As seen, the first data frame is a subset of second data frame. An outer join returns records in both with NaN for unmatched columns which can be later filtered out. A cross join returns full complete rows which can be filtered and isin() searches values within columns:

import pandas as pd

df1 = pd.DataFrame({'Languages': ['C++', 'C', 'Java', 'C#', 'Python', 'PHP'],
                    'Uses': ['computing', 'computing', 'application', 'application', 'application', 'web'], 
                    'Type': ['Proprietary', 'Proprietary', 'Proprietary', 'Proprietary', 'Open-Source', 'Open-Source']})

df2 = pd.DataFrame({'Languages': ['C++', 'C', 'Java', 'C#', 'Python', 'PHP',
                                 'Perl', 'R', 'Ruby', 'VB.NET', 'Javascript', 'Matlab'],
                    'Uses': ['computing', 'computing', 'application', 'application', 'application', 'web',
                            'application', 'computing', 'web', 'application', 'web', 'computing'],
                    'Type': ['Proprietary', 'Proprietary', 'Proprietary', 'Proprietary', 'Open-Source',
                            'Open-Source', 'Open-Source', 'Open-Source', 'Open-Source', 'Proprietary',
                            'Open-Source', 'Proprietary']})    

# OUTER JOIN 
mergedf = pd.merge(df1, df2, on=['Languages'], how='outer')
# FILTER OUT LANGUAGES IN SMALLER THAT IS NULL
mergedf = mergedf[pd.isnull(mergedf['Type_x'])][['Languages', 'Uses_y', 'Type_y']]

#     Languages       Uses_y       Type_y
#6         Perl  application  Open-Source
#7            R    computing  Open-Source
#8         Ruby          web  Open-Source
#9       VB.NET  application  Proprietary
#10  Javascript          web  Open-Source
#11      Matlab    computing  Proprietary


# ISIN COMPARISON, RETURNING RECORDS IN LARGER NOT IN SMALLER
unequaldf = df2[~df2.Languages.isin(df1['Languages'])]

#     Languages         Type         Uses
#6         Perl  Open-Source  application
#7            R  Open-Source    computing
#8         Ruby  Open-Source          web
#9       VB.NET  Proprietary  application
#10  Javascript  Open-Source          web
#11      Matlab  Proprietary    computing


# CROSS JOIN 
df1['key'] = 1                 # (REQUIRES A JOIN KEY OF SAME VALUE)
df2['key'] = 1                    
crossjoindf = pd.merge(df1, df2, on=['key'])
# FILTER FOR LANGUAGES IN LARGER NOT IN SMALLER (ALSO USING ISIN)
crossjoindf = crossjoindf[~crossjoindf['Languages_y'].isin(crossjoindf['Languages_x'])]\
                    [['Languages_y', 'Uses_y', 'Type_y']].drop_duplicates()

#   Languages_y       Uses_y       Type_y
#6         Perl  application  Open-Source
#7            R    computing  Open-Source
#8         Ruby          web  Open-Source
#9       VB.NET  application  Proprietary
#10  Javascript          web  Open-Source
#11      Matlab    computing  Proprietary

Admittedly, the cross join may be redundant and verbose here but should your unmatched needs require permutations across data frames, it can be handy.

0人赞添加讨论(0) 举报

Inequality joins in Pandas?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间