How to do left outer join in spark sql?

2019-02-06 00:59发布

问题:

I am trying to do a left outer join in spark (1.6.2) and it doesn't work. My sql query is like this:

sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p 
ON t.uuid = p.uuid 
where t.created_year = 2016 
and p.created_year = 2016").show()

The result is like this:

+--------------------+--------------------+--------------------+
|                type|                uuid|                uuid|
+--------------------+--------------------+--------------------+
|              tained|89759dcc-50c0-490...|89759dcc-50c0-490...|
|             swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...|

I got same result either using LEFT JOIN or LEFT OUTER JOIN (the second uuid is not null).

I would expect the second uuid column to be null only. how to do a left outer join correctly?

=== Additional information ==

If I using dataframe to do left outer join i got correct result.

s = sqlCtx.sql('select * from symptom_type where created_year = 2016')
p = sqlCtx.sql('select * from plugin where created_year = 2016')

s.join(p, s.uuid == p.uuid, 'left_outer')
.select(s.type, s.uuid.alias('s_uuid'), 
        p.uuid.alias('p_uuid'), s.created_date, p.created_year, p.created_month).show()

I got result like this:

+-------------------+--------------------+-----------------+--------------------+------------+-------------+
|               type|              s_uuid|           p_uuid|        created_date|created_year|created_month|
+-------------------+--------------------+-----------------+--------------------+------------+-------------+
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|
|             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null|

Thanks,

回答1:

I don't see any issues in your code. Both "left join" or "left outer join" will work fine. Please check the data again the data you are showing is for matches.

You can also perform Spark SQL join by using:

// Left outer join explicit

df1.join(df2, df1("col1") === df2("col1"), "left_outer")


回答2:

You are filtering out null values for p.created_year (and for p.uuid) with

where t.created_year = 2016 
and p.created_year = 2016

The way to avoid this is to move filtering clause for p to the ON statement:

sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p 
ON t.uuid = p.uuid 
and p.created_year = 2016
where t.created_year = 2016").show()

This is correct but inefficient because we also need to filter on t.created_year before the join happens. So it is recommended to use subqueries:

sqlContext.sql("select t.type, t.uuid, p.uuid
from (
  SELECT type, uuid FROM symptom_type WHERE created_year = 2016 
) t LEFT JOIN (
  SELECT uuid FROM plugin WHERE created_year = 2016
) p 
ON t.uuid = p.uuid").show()    


回答3:

I think you just need to use LEFT OUTER JOIN instead of LEFT JOIN keyword for what you want. For more informations look at the Spark documentation.