This question already has an answer here:
- How to replicate a SAS merge 2 answers
Table t1:
person | visit | code_num1 | code_desc1
1 1 100 OTD
1 2 101 SED
2 3 102 CHM
3 4 103 OTD
3 4 103 OTD
4 5 101 SED
Table t2:
person | visit | code_num2 | code_desc2
1 1 104 DME
1 6 104 DME
3 4 103 OTD
3 4 103 OTD
3 7 103 OTD
4 5 104 DME
I have the following SAS code that merges the two tables t1 and t2 by person and visit:
DATA t3;
MERGE t1 t2;
BY person visit;
RUN;
Which produces the following output:
person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED
1 6 104 DME
2 3 102 CHM
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 103 OTD
4 5 101 SED 104 DME
I want to replicate this in a hive query, and tried using a full outer join:
create table t3 as
select case when a.person is null then b.person else a.person end as person,
case when a.visit is null then b.visit else a.visit end as visit,
a.code_num1, a.code_desc1, b.code_num2, b.code_desc2
from t1 a
full outer join t2 b
on a.person=b.person and a.visit=b.visit
Which yields the table:
person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED null null
1 6 null null 104 DME
2 3 102 CHM null null
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 null null 103 OTD
4 5 101 SED 104 DME
Which is almost the same as SAS, but we have 2 extra rows for (person=3, visit=4). I assume this is because hive is matching each row in one table with two rows in the other, producing the 4 rows in t3, whereas SAS does not. Any suggestions on how I could get my query to match the output of the SAS merge?
If you merge two data sets and they have variables with the same names (besides the by variables) then variables from the second data set will overwwrite any variables having the same name in the first data set. So your sas code creates a overlaid dataset. A full outer join does not do this.
It seems to me if you first dedupe the right side table then do a full outer join you should get the equivalent table in hive. I don't see a need for the case when statements either as Joe pointed out. Just do a join on the key values:
I can't test this code currently so be sure to test it. Good luck.