I usually use Dataframe.merge to combine dataframes in pandas. From my understanding, this only works on equality joins. What is the idiomatic way to join two dataframes using other types of checks (e.g. inequality)?
问题:
回答1:
Pandas merge() allows for outer
, left
, right
joins (not just inner
joins) between two data frames, so you can return unmatched records. Additionally, merge()
can even be generalized to return a cross join (all combination matches between two data frames) and with filtering afterwards you can return unmatched records. Still more, there is the isin() pandas method.
Consider the following demonstration. Below are two data frames of something we come to enjoy, computer languages. As seen, the first data frame is a subset of second data frame. An outer join returns records in both with NaN
for unmatched columns which can be later filtered out. A cross join returns full complete rows which can be filtered and isin()
searches values within columns:
import pandas as pd
df1 = pd.DataFrame({'Languages': ['C++', 'C', 'Java', 'C#', 'Python', 'PHP'],
'Uses': ['computing', 'computing', 'application', 'application', 'application', 'web'],
'Type': ['Proprietary', 'Proprietary', 'Proprietary', 'Proprietary', 'Open-Source', 'Open-Source']})
df2 = pd.DataFrame({'Languages': ['C++', 'C', 'Java', 'C#', 'Python', 'PHP',
'Perl', 'R', 'Ruby', 'VB.NET', 'Javascript', 'Matlab'],
'Uses': ['computing', 'computing', 'application', 'application', 'application', 'web',
'application', 'computing', 'web', 'application', 'web', 'computing'],
'Type': ['Proprietary', 'Proprietary', 'Proprietary', 'Proprietary', 'Open-Source',
'Open-Source', 'Open-Source', 'Open-Source', 'Open-Source', 'Proprietary',
'Open-Source', 'Proprietary']})
# OUTER JOIN
mergedf = pd.merge(df1, df2, on=['Languages'], how='outer')
# FILTER OUT LANGUAGES IN SMALLER THAT IS NULL
mergedf = mergedf[pd.isnull(mergedf['Type_x'])][['Languages', 'Uses_y', 'Type_y']]
# Languages Uses_y Type_y
#6 Perl application Open-Source
#7 R computing Open-Source
#8 Ruby web Open-Source
#9 VB.NET application Proprietary
#10 Javascript web Open-Source
#11 Matlab computing Proprietary
# ISIN COMPARISON, RETURNING RECORDS IN LARGER NOT IN SMALLER
unequaldf = df2[~df2.Languages.isin(df1['Languages'])]
# Languages Type Uses
#6 Perl Open-Source application
#7 R Open-Source computing
#8 Ruby Open-Source web
#9 VB.NET Proprietary application
#10 Javascript Open-Source web
#11 Matlab Proprietary computing
# CROSS JOIN
df1['key'] = 1 # (REQUIRES A JOIN KEY OF SAME VALUE)
df2['key'] = 1
crossjoindf = pd.merge(df1, df2, on=['key'])
# FILTER FOR LANGUAGES IN LARGER NOT IN SMALLER (ALSO USING ISIN)
crossjoindf = crossjoindf[~crossjoindf['Languages_y'].isin(crossjoindf['Languages_x'])]\
[['Languages_y', 'Uses_y', 'Type_y']].drop_duplicates()
# Languages_y Uses_y Type_y
#6 Perl application Open-Source
#7 R computing Open-Source
#8 Ruby web Open-Source
#9 VB.NET application Proprietary
#10 Javascript web Open-Source
#11 Matlab computing Proprietary
Admittedly, the cross join may be redundant and verbose here but should your unmatched needs require permutations across data frames, it can be handy.
回答2:
merge() is fairly limited. You can accomplish more complex joins using pandasql.sqldf. You can write pretty much any sql query and refer to your existing dataframes as table names in the sql statements.
https://github.com/yhat/pandasql/
A known bug is the inability to select multiple tables in product joins, such as
select d1.something, d2.something else from df1 as d1, df2 as d2 where d1.date=d2.date
However, if you can do joins without any problem, and a statement like I have above can be translated into a join.