I have two tables, t1 and t2:
t1
person | visit | code1 | type1
1 1 50 50
1 1 50 50
1 2 75 50
t2
person | visit | code2 | type2
1 1 50 50
1 1 50 50
1 1 50 50
When SAS runs the following code:
DATA t3;
MERGE t1 t2;
BY person visit;
RUN;
It generates the following dataset:
person | visit | code1 | type1 | code2 | type2
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 2 75 50
I want to replicate this process in SQL, and my idea was to use a full-outer-join. This works unless there are duplicate rows. When we have duplicate rows like in the above example, a full outer join produces the following table:
person | visit | code1 | type1 | code2 | type2
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 2 75 50
I'm wondering how I would get the SQl table to match the SAS table.
You can replicate a SAS merge by adding a
row_number()
to each table:Notes:
??
means to put in the column(s) used for ordering. SAS datasets have an intrinsic order. SQL tables do not, so the ordering needs to be specified.t1.*, t2.*
in the outer query). I think SAS only includesperson
andvisit
once in the resulting dataset.EDIT:
Note: the above produces separate values for the key columns. This is easy enough to fix:
That fixes the columns issue. You can fix the copying issue by using
first_value()
/last_value()
or by using a more complicatedjoin
condition:This implements the "keep the last row" logic in a single join. Probably for performance reasons, you would want to put this into separate
left join
s on the original logic.Gordon's answer is close; but it misses one point. Here's its output:
The third row's nulls are incorrect, while the fourth's are correct.
As far as I know, in SQL there's not a really good way to do this other than splitting things up into a few queries. I think there are five possibilities:
I think the last two might be workable into one query, but I think the second and third have to be separate queries. You can union everything together, of course.
So here's an example, using some temporary tables that are a little more well suited to see what's going on. Note that the third row is now filled in for
code1
andtype1
, even though those are 'extra'. I've only added three of the five criteria - the three you had in your initial example - but the other two aren't too hard.Note that this is an example of something far faster in SAS - because SAS has a row-wise concept, ie, it's capable of going one row at a time. SQL tends to take a lot longer at these, with large tables, unless it's possible to partition things very neatly and have very good indexes - and even then I've never seen a SQL DBA do anywhere near as well as SAS at some of these types of problems. That's something you'll have to accept of course - SQL has its own advantages, one of which being probably price...
Here's my example code. I'm sure it's not terribly elegant, hopefully one of the SQL folk can improve it. This is written to work in SQL Server (using table variables), same thing should work with some changes (to use temporary tables) in other variants, assuming they implement windowing. (SAS of course can't do this particular thing - as even FedSQL implements ANSI 1999, not ANSI 2008.) This is based on Gordon's initial query, then modified with the additional bits at the end. Anyone who wants to improve this please feel free to edit and/or copy to a new/existing answer any bit you wish.