Difference between SAS merge and full outer join [

2019-08-09 04:40发布

This question already has an answer here:

Table t1:

person | visit | code_num1 | code_desc1
     1       1         100         OTD
     1       2         101         SED
     2       3         102         CHM
     3       4         103         OTD 
     3       4         103         OTD
     4       5         101         SED

Table t2:

 person | visit | code_num2 | code_desc2
     1       1         104         DME
     1       6         104         DME
     3       4         103         OTD 
     3       4         103         OTD
     3       7         103         OTD
     4       5         104         DME

I have the following SAS code that merges the two tables t1 and t2 by person and visit:

DATA t3;
    MERGE t1 t2;
    BY person visit;
RUN;

Which produces the following output:

person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
      1       1         100         OTD        104          DME
      1       2         101         SED   
      1       6                                104          DME           
      2       3         102         CHM 
      3       4         103         OTD        103          OTD
      3       4         103         OTD        103          OTD
      3       7                                103          OTD
      4       5         101         SED        104          DME

I want to replicate this in a hive query, and tried using a full outer join:

create table t3 as 
select case when a.person is null then b.person else a.person end as person,
       case when a.visit is null then b.visit else a.visit end as visit,
       a.code_num1, a.code_desc1, b.code_num2, b.code_desc2
       from t1 a 
       full outer join t2 b
       on a.person=b.person and a.visit=b.visit

Which yields the table:

person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
      1       1         100         OTD        104          DME
      1       2         101         SED        null        null
      1       6         null        null       104          DME           
      2       3         102         CHM        null        null
      3       4         103         OTD        103          OTD
      3       4         103         OTD        103          OTD
      3       4         103         OTD        103          OTD
      3       4         103         OTD        103          OTD
      3       7         null        null       103          OTD
      4       5         101         SED        104          DME

Which is almost the same as SAS, but we have 2 extra rows for (person=3, visit=4). I assume this is because hive is matching each row in one table with two rows in the other, producing the 4 rows in t3, whereas SAS does not. Any suggestions on how I could get my query to match the output of the SAS merge?

1条回答
ゆ 、 Hurt°
2楼-- · 2019-08-09 05:27

If you merge two data sets and they have variables with the same names (besides the by variables) then variables from the second data set will overwwrite any variables having the same name in the first data set. So your sas code creates a overlaid dataset. A full outer join does not do this.

It seems to me if you first dedupe the right side table then do a full outer join you should get the equivalent table in hive. I don't see a need for the case when statements either as Joe pointed out. Just do a join on the key values:

create table t3 as 
select  coalesce(a.person, b.person) as person
      , coalesce(a.visit, b.visit) as visit
      , a.code_num1
      , a.code_desc1
      , b.code_num2
      , b.code_desc2
   from 
   (select * from t1) a 
   full outer join
   (select person, visit, code_num2, code_desc2
       group by person, visit, code_num2, code_desc2 from t2) b
   on a.person=b.person and a.visit=b.visit
   ;

I can't test this code currently so be sure to test it. Good luck.

查看更多
登录 后发表回答