Finding all possible paths in a dataset using sas

2020-08-03 04:22发布

问题:

I want to compare two columns in the dataset shown below

Pid       cid
1          2
2          3
2          5
3          6
4          8
8          9
9          4

Then produce a result like below

1 2 3 6
1 2 5
2 3 6
2 5
3 6
4 8 9 4
8 9 4
9 4

First we print the first two values 1 and 2, search for 2 in first column, if its present print its corresponding value from column 2, which is 3. Search for 3 in column 1, if present print the corresponding value from column 2 which is 6

How can this be done using SAS?

回答1:

The links comprise a directed graph and need recursion to traverse the paths.

In data step, the multiple children of a parent can be stored in a Hash of Hashes structure, but recursion in data step is quite awkward (you would have to manually maintain your own stack and local variables in yet another hash)

In Proc DS2 recursion is far more traditional and obvious, and Package Hash is available. However, the Package Hash hashing is different than data step. The data values are only allowed to be scalars, so Hash of Hashes is out :(.

The lack of hash of hashes can be remediated by setting up the hash to have multidata. Each data (child) of a key (parent) are retrieved with the pattern find, and loop for has_next, with find_next.

Another issue with hashes in DS2 is that they must be global to the data step, and the same for any host variables used for keys and data. This makes for tricky management of variables during recursion. The code at scope depth N can not have any reliance on global variables that can get changed at scope depth N+1.

Fortunately, an anonymous hash can be created in any scope and it's reference is maintained locally... but the key and data variables must still be global; so more careful attention is needed.

The anonymous hash is used to store the multidata retrieved by a key; this is necessary because recursion would affect the has_next get_next operation.

Sample code. Requires a rownum variable to prevent cycling that would occur when a child is allowed to act as a parent in a prior row.

data have; rownum + 1;input
Pid       cid;datalines;
1          2
2          3
2          5
3          6
4          8
5          12
6          2
8          9
9          4
12         1
12         2
12         14
13         15
14         20
14         21
14         21
15         1
run;

proc delete data=paths;
proc delete data=rows;

%let trace=;

proc ds2 libs=work;
data _null_ ;
  declare double rownum pid cid id step pathid;
  declare int hIndex;

  declare package hash rows();
  declare package hash links();
  declare package hash path();
  declare package hash paths();

  method leaf(int _rootRow, int _step);
    declare double _idLast _idLeaf;

&trace. put ' ';
&trace. put 'LEAF';
&trace. put ' ';
    * no children, at a leaf -- output path;
    rownum = _rootRow;
    if _step < 2 then return;

    * check if same as last one;

    do step = 0 to _step;
      paths.find();  _idLast = id;
      path.find();   _idLeaf = id;
      if _idLast ne _idLeaf then leave;
    end;

    if _idLast = _idLeaf then return;

    pathid + 1;

    do step = 0 to _step;
      path.find();
      paths.add();
    end;
  end;

  method saveStep(int _step, int _id);
&trace. put 'PATH UPDATE' _step ',' _id '               <-------';
    step = _step;
    id = _id;
    path.replace();
  end;

  method descend(int _rootRow, int _fromRow, int _id, int _step);
    declare package hash h;
    declare double _hIndex;
    declare varchar(20) p;

    if _step > 10 then return;

    p = repeat (' ', _step-1);
&trace. put p 'DESCEND:' _rootRow= _fromRow= _id= _step=;

    * given _id as parent, track in path and descend by child(ren);

    * find links to children;
    pid = _id;
&trace. put p 'PARENT KEY:' pid=;
    if links.find() ne 0 then do;
&trace. put p 'NO KEY';
      saveStep(_step, _id);
      leaf(_rootRow, _step);
      return; 
    end;

    * convert multidata to hash, emulating hash of hash;
    * if not, has_next / find_next multidata traversal would be
    * corrupted by a find in the recursive use of descent;

        * new hash reference in local variable;
        h = _new_ hash ([hindex], [cid rownum], 0,'','ascending');

        hIndex = 1;

&trace. put p 'CHILD' hIndex= cid= rownum=;
        if rownum > _fromRow then h.add();

        do while (links.has_next() = 0);
          hIndex + 1;
          links.find_next();

&trace. put p 'CHILD' hIndex= cid= rownum=;
          if rownum > _fromRow then h.add();
        end;

    if h.num_items = 0 then do;
      * no eligble (forward rowed) children links;
&trace. put p 'NO FORWARD CHILDREN';
      leaf(_rootRow, _step-1);
      return;
    end;

    * update data for path step;
    saveStep (_step, _id);

    * traverse hash that was from multidata;
    * locally instantiated hash is protected from meddling outside current scope;
    * hIndex is local variable;
    do _hIndex = 1 to hIndex;
      hIndex = _hIndex;
      h.find();

&trace. put p 'TRAVERSE:' hIndex= cid= rownum= ;

      descend(_rootRow, rownum, cid, _step+1);
    end;

&trace. put p 'TRAVERSE DONE:' _step=;
  end;

  method init(); 
    declare int index;

    * data keyed by rownum;
    rows.keys([rownum]);
    rows.data([rownum pid cid]);
    rows.ordered('A');
    rows.defineDone();

    * multidata keyed by pid;
    links.keys([pid]);
    links.data([cid rownum]);
    links.multidata('yes');
    links.defineDone();

    * recursively discovered ids of path;
    path.keys([step]);
    path.data([step id]);
    path.ordered('A');
    path.defineDone();

    * paths discovered;
    paths.keys([pathid step]);
    paths.data([pathid step id]);
    paths.ordered('A');
    paths.defineDone();
  end;

  method run();
    set have;
    rows.add();
    links.add();
  end;

  method term();
    declare package hiter rowsiter('rows');
    declare int n;

    do while (rowsiter.next() = 0);
      step = 0;
      saveStep (step, pid);
      descend (rownum, rownum, cid, step+1);
    end;

    paths.output('paths');
    rows.output('rows');
  end;
run;
quit;

proc transpose data=paths prefix=ID_ out=paths_across(drop=_name_);
  by pathid;
  id step;
  var id;
  format id_: 4.;
run;


回答2:

As the comments says, infinite cycle and searching path are not clear at least. So let's start with the most simple case: always search from top to bottom and nerve look back.

Just start at create data set:

data test;
    input Pid Cid;
    cards;
    1 2
    2 3
    2 5
    3 6
    4 8
    8 9
    9 4
    ;
run;

With this assumption, my thought is:

  1. Generate a row indicator, E.g. Ord +1 ;
  2. Use a left-join with connection condition a.Pid = b.Cid and a.Ord > b.Ord where both a and b stand test;
  3. Compare the new data set and the old data set;
  4. Loop 2 and 3 while new data set differs from the old data set;

Well, sometimes we may care result more than path, so here is another answer:

data _null_;
    set test nobs = nobs;

    do i = 1 to nobs;
        set test(rename=(Pid=PidTmp Cid=CidTmp)) point = i;
        if Cid = PidTmp then Cid = CidTmp;
    end;
    put (Pid Cid)(=);
run;

Result:

Pid=1 Cid=6
Pid=2 Cid=6
Pid=2 Cid=5
Pid=3 Cid=6
Pid=4 Cid=4
Pid=8 Cid=4
Pid=9 Cid=4


回答3:

I have tried the below, the result is not perfect though

data want;
  obs1 = 1; 
  do i=1 to 6;
    set ar ;
    obs2 = obs1 + 1;
    set
      ar(
        rename=(
        pid = pid2 
        cid = cid2
        )
      ) point=obs2
    ;
       if cid =pid2
    then k=catx("",pid,cid,cid2);
    else k=catx("",pid,cid);
    output; 
    obs1 + 1; 

  end; 

run;

Result:

pid cid k
1   2   1 2 3
2   3   2 3
2   5   2 5
3   6   3 6
4   8   4 8 9
8   9   8 9 4
9   4   9 4


回答4:

Not enough reputation, so this is another answer, hahaha.
First of all, I can see @Richard answer is very good though I can not use ds2 and hash so skillfully yet. It's a nice example to learn recursion.
So now I know your purpose is definitely the path not the end point, store each result while recursing each observation then be necessary. Your own answer has reflecting this but with a failed do-loop, obs1 = 1 and obs2 = obs1 + 1 and obs1 + 1 would always return obs2 = _N_ + 1 which result in only once loop.
I supplemented and improved the original code this time:

data test;
    set test nobs = nobs;

    array Rst[*] Cid Cid1-Cid10;
    do i = _N_ to nobs;
        set test(rename=(Pid=PidTmp Cid=CidTmp)) point = i;
        do j = 1 to dim(Rst);
            if Rst[j] = PidTmp then Rst[j+1] = CidTmp;
        end;
    end;
run;

I use an oversize array to store the path and change do i = 1 to nobs; to do i = _N_ to nobs; since I find do i = 1 to nobs; will cause the loop look back.



回答5:

proc ds2;
data _null_;
    declare int t1[7];
    declare int t2[7];
    declare varchar(100) lst;

    method f2(int i, int Y);
        do while (y ^= t1[i] and i < dim(t1));
            i+1;
        end;
        if y = t1[i] then do; 
           lst = cat(lst,t2[i]);
           f2(i, t2[i]);  
        end;
    end;

    method f1(int n, int x, int y);
        dcl int i;
        dcl int match;
        match=0;
        do i = n to dim(t1);
            lst = cat(x,y); 
            if (y = t1[i]) then do;
               f2(i,y);
               put lst=;
               match = 1;
            end;
        end;
        if ^match then put lst=;
    end;

    method init();
    dcl int i;
        t1 := (1 2 2 3 4 8 9);
        t2 := (2 3 5 6 8 9 4);
        do i = 1 to dim(t1);
           f1(i, t1[i], t2[i]);
        end;
    end;
enddata;
run;
quit;`enter code here`


标签: sas sas-macro