I have two tables, t1 and t2:
t1
person | visit | code1 | type1
1 1 50 50
1 1 50 50
1 2 75 50
t2
person | visit | code2 | type2
1 1 50 50
1 1 50 50
1 1 50 50
When SAS runs the following code:
DATA t3;
MERGE t1 t2;
BY person visit;
RUN;
It generates the following dataset:
person | visit | code1 | type1 | code2 | type2
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 2 75 50
I want to replicate this process in SQL, and my idea was to use a full-outer-join. This works unless there are duplicate rows. When we have duplicate rows like in the above example, a full outer join produces the following table:
person | visit | code1 | type1 | code2 | type2
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 1 50 50 50 50
1 2 75 50
I'm wondering how I would get the SQl table to match the SAS table.
Gordon's answer is close; but it misses one point. Here's its output:
person visit code1 type1 seqnum person visit code2 type2 seqnum
1 1 1 1 1 1 1 1 1 1
1 1 2 2 2 1 1 2 2 2
NULL NULL NULL NULL NULL 1 1 3 3 3
1 2 1 3 1 NULL NULL NULL NULL NULL
The third row's nulls are incorrect, while the fourth's are correct.
As far as I know, in SQL there's not a really good way to do this other than splitting things up into a few queries. I think there are five possibilities:
- Matching person/visit, Matching seqnums
- Matching person/visit, Left has more seqnums
- Matching person/visit, Right has more seqnums
- Left has unmatched person/visit
- Right has unmatched person/visit
I think the last two might be workable into one query, but I think the second and third have to be separate queries. You can union everything together, of course.
So here's an example, using some temporary tables that are a little more well suited to see what's going on. Note that the third row is now filled in for code1
and type1
, even though those are 'extra'. I've only added three of the five criteria - the three you had in your initial example - but the other two aren't too hard.
Note that this is an example of something far faster in SAS - because SAS has a row-wise concept, ie, it's capable of going one row at a time. SQL tends to take a lot longer at these, with large tables, unless it's possible to partition things very neatly and have very good indexes - and even then I've never seen a SQL DBA do anywhere near as well as SAS at some of these types of problems. That's something you'll have to accept of course - SQL has its own advantages, one of which being probably price...
Here's my example code. I'm sure it's not terribly elegant, hopefully one of the SQL folk can improve it. This is written to work in SQL Server (using table variables), same thing should work with some changes (to use temporary tables) in other variants, assuming they implement windowing. (SAS of course can't do this particular thing - as even FedSQL implements ANSI 1999, not ANSI 2008.) This is based on Gordon's initial query, then modified with the additional bits at the end. Anyone who wants to improve this please feel free to edit and/or copy to a new/existing answer any bit you wish.
declare @t1 table (person INT, visit INT, code1 INT, type1 INT);
declare @t2 table (person INT, visit INT, code2 INT, type2 INT);
insert into @t1 values (1,1,1,1)
insert into @t1 values (1,1,2,2)
insert into @t1 values (1,2,1,3)
insert into @t2 values (1,1,1,1)
insert into @t2 values (1,1,2,2)
insert into @t2 values (1,1,3,3)
select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (select *,
row_number() over (partition by person, visit order by type1) as seqnum
from @t1
) t1 inner join
(select *,
row_number() over (partition by person, visit order by type2) as seqnum
from @t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum
union all
select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (
(select person, visit, MAX(seqnum) as max_rownum from (
select person, visit,
row_number() over (partition by person, visit order by type1) as seqnum
from @t1) t1_f
group by person, visit
) t1_m inner join
(select *, row_number() over (partition by person, visit order by type1) as seqnum
from @t1
) t1
on t1.person=t1_m.person and t1.visit=t1_m.visit
and t1.seqnum=t1_m.max_rownum
inner join
(select *,
row_number() over (partition by person, visit order by type2) as seqnum
from @t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum < t2.seqnum
)
union all
select t1.person, t1.visit, t1.code1, t1.type1, t2.code2, t2.type2
from @t1 t1 left join @t2 t2
on t2.person=t1.person and t2.visit=t1.visit
where t2.code2 is null
You can replicate a SAS merge by adding a row_number()
to each table:
select t1.*, t2.*
from (select t1.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum;
Notes:
- The
??
means to put in the column(s) used for ordering. SAS datasets have an intrinsic order. SQL tables do not, so the ordering needs to be specified.
- You should list the columns explicitly (instead of using
t1.*, t2.*
in the outer query). I think SAS only includes person
and visit
once in the resulting dataset.
EDIT:
Note: the above produces separate values for the key columns. This is easy enough to fix:
select coalesce(t1.person, t2.person) as person,
coalesce(t1.key, t2.key) as key,
t1.code1, t1.type1, t2.code2, t2.type2
from (select t1.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
t1.seqnum = t2.seqnum;
That fixes the columns issue. You can fix the copying issue by using first_value()
/last_value()
or by using a more complicated join
condition:
select coalesce(t1.person, t2.person) as person,
coalesce(t1.visit, t2.visit) as visit,
t1.code1, t1.type1, t2.code2, t2.type2
from (select t1.*,
count(*) over (partition by person, visit) as cnt,
row_number() over (partition by person, visit order by ??) as seqnum
from t1
) t1 full outer join
(select t2.*,
count(*) over (partition by person, visit) as cnt,
row_number() over (partition by person, visit order by ??) as seqnum
from t2
) t2
on t1.person = t2.person and t1.visit = t2.visit and
(t1.seqnum = t2.seqnum or
(t1.cnt > t2.cnt and t1.seqnum > t2.seqnum and t2.seqnum = t2.cnt) or
(t2.cnt > t1.cnt and t2.seqnum > t1.seqnum and t1.seqnum = t1.cnt)
This implements the "keep the last row" logic in a single join. Probably for performance reasons, you would want to put this into separate left join
s on the original logic.