How do I group people who are related, even indirectly? Concretely, using the first two columns of the data set like below, how do I in SAS (maybe using a DATA step or PROC SQL) programmatically derive the third column? Is there a non-iterative algorithm?
Background: Each person has multiple addresses. Through each address, each person is connected to zero or more persons. If two people are connected, they get the same group ID. If person A is directly connected to B and B is connected to C, then persons A, B, and C share a group.
data people;
input person_id address_id $ household_id;
datalines;
1 A 1
2 B 2
3 B 2
4 C 3
5 C 3
5 D 3
6 D 3
;
The general methods for finding all connected components of a graph are Breadth-First-Search or Depth-First-Search. SAS is not the best tool for implementing such algorithms since they require using such data structures as queues.
Still it can be done with hash objects. Here's the code for BF-search.
data people;
input person_id address_id $ household_id;
datalines;
1 A 1
2 B 2
3 B 2
4 C 3
5 C 3
5 D 3
6 D 3
;
run;
Create adjacency list - all pairs of people with a common address. And empty variable cluster
which will be populated later with groups' IDs:
proc sql;
create table connections as
select distinct a.person_id as person_id_a, b.person_id as person_id_b, . as cluster
from people a
inner join people b
on a.address_id=b.address_id
;
quit;
Here goes the BF-search itself:
data _null_;
Declare hash object and its iterator for all unique people (vertices of the graph):
if 0 then set Connections;
dcl hash V(dataset:'Connections', ordered:'y');
V.defineKey('person_id_a');
V.defineData('person_id_a','cluster');
dcl hiter Vi('V');
V.defineDone();
Declare hash object for all connections (edges of the graph):
dcl hash E(dataset:'Connections', multidata:'y');
E.defineKey('person_id_a');
E.defineData('person_id_a','person_id_b');
E.defineDone();
Declare hash object and its iterator for the queue:
dcl hash Q(ordered:'y');
Q.defineKey('qnum','person_id_a');
Q.defineData('qnum','person_id_a');
dcl hiter Qi('Q');
Q.defineDone();
The outermost loop - for taking a new person without assigned cluster to be a root of the next cluster, when the queue is empty:
rc1=Vi.first();
do while(rc1=0);
if missing(cluster) then do;
qnum=1; Q.add(); *qnum-number of the person in the queue, to ensure that new people are added to the end of the queue.;
n+1; cluster=n;
V.replace();*assign cluster number to a person;
In the following two nested loops we de-queue the first person in the queue and look for all people connected to this person in adjacency list. Every found 'connection' we add to the end of the queue. When done with the first person, we delete him/her and de-queue the next one (who became the first now). All of them will be in the same cluster. And so on, until the queue is empty. Then we take a new root person for a new cluster.
rc2=Qi.first();
do while(rc2=0);
qnum=qnum+Q.num_items-1;
rc3=E.find();
do while(rc3=0);
person_id_a=person_id_b;
rc4=V.find();
if rc4=0 and missing(cluster) then do;
qnum+1; Q.add();
cluster=n;
V.replace();
end;
rc3=E.find_next();
end;
Qi.first();
Qi.delete();
Q.remove();
Qi=_new_ hiter ('Q');
rc2=Qi.first();
end;
end;
rc1=Vi.next();
end;
Output list of people with assigned clusters.
V.output(dataset:'clusters');
run;
proc sort data=clusters; by cluster; run;
This is a common problem that has complex solutions. How complex you need depends primarily on the complexity of your data. How often are linkages more than single linkages - ie, in your example above, C and D are linked by 5
. Can you have an E that is linked to D by 6
? If so then this requires either a different approach or a resolution step.
I show one simple method here. This is a very simplistic solution, but it sometimes is easier to understand and implement. Record linkage is a well covered subject that has a lot of papers available to explore; much better solutions exist that are more able to handle multiple linkage than the below solution (which handles 2 level linkage but not further, and has some weaknesses in handling data crosslinkages).
data people;
input person_id address_id $ household_id;
datalines;
1 A 1
2 B 2
3 B 2
4 C 3
5 C 3
5 D 3
6 D 3
6 E 3
7 E 3
8 B 2
;
run;
data links(keep=link:);
set people;
by person_id address_id;
retain link_start;
if first.person_id and not last.person_id then do;
link_start = address_id;
end;
if first.address_id and not first.person_id then do;
link_end = address_id;
output;
end;
run;
data for_fmt;
set links;
start=link_end;
label=link_Start;
retain fmtname '$linkf';
output;
run;
proc sort nodupkey data=for_fmt;
by start;
run;
proc format cntlin=for_fmt;
quit;
data people_linked;
set people;
new_addressid = put(address_id,$linkf.);
new_addressid = put(new_addressid, $linkf.);
run;
proc sort data=people_linked;
by new_addressid;
run;
data people_final;
set people_linked;
by new_addressid;
if first.new_addressID then
new_householdID+1;
run;
I have been working with a problem that requires a similar thing. I was able to solve using SAS OR using proc OPTNET (statement CONCOMP). Documentation even bring an example that illustrate the concept very well .
Thanks,
Murilo