Keeping or deleting a group of observations based

2019-09-15 03:33发布

问题:

I answered a SAS question a few minutes ago and realized there is a generalization that might be more useful than that one (here). I didn't see this question already in StackOverflow.

The general question is: How can you process and keep an entire BY-group based on some characteristic of the BY-group that you might not know until you have looked at all the observations in the group?

Using input data similar to that from the earlier question:

* For some reason, we are tasked with keeping only observations that
* are in groups of ID_1 and ID_2 that contain at least one obs with
* a VALUE of 0.;
* In the following data, the following ID and ID_2 groups should be
* kept:
* A 2 (2 obs)
* B 1 (3 obs)
* B 3 (2 obs)
* B 4 (1 obs)
* The resulting dataset will have 8 observations.;
data x;
    input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;

回答1:

Double DoW loop solution:

data have;
    input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;

data want;
do _n_ = 1 by 1 until(last.id_2);
    set have;
    by id id_2;
    flag = sum(flag,value=0);
end;
do _n_ = 1 to _n_;
    set have;
    if flag then output;
end;
drop flag;
run;

I've tested this against the point approach using ~55m rows and found no appreciable difference in performance. Dataset used:

data have;
do ID = 1 to 10000000;
    do id_2 = 1 to ceil(ranuni(1)*10);
        do value = floor(ranuni(2) * 5);
            output;
        end;
    end;
end;
run;


回答2:

My answer might not be the most efficient, especially for large datasets, and I'm interested in seeing other possible answers. Here it is:

* For some reason, we are tasked with keeping only observations that
* are in groups of ID_1 and ID_2 that contain at least one obs with
* a VALUE of 0.;
* In the following data, the following ID and ID_2 groups should be
* kept:
* A 2 (2 obs)
* B 1 (3 obs)
* B 3 (2 obs)
* B 4 (1 obs)
* The resulting dataset will have 8 observations.;
data x;
    input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;

* I realize the data are already sorted, but I think it is better
* not to assume they are.;
proc sort data=x;
    by id id_2;
run;

data obstokeep;
    keep id id_2 value;
    retain startptr haszero;

    * This SET statement reads through the dataset in sequence and 
    * uses the CUROBS option to obtain the observation number. In
    * most situations, this will be the same as the _N_ automatic
    * variable, but CUROBS is probably safer.;
    set x curobs=myptr;
    by id id_2;

    * When this is the first observation in a BY-group, save the 
    * current observation number (pointer).
    * Also initialize a flag variable that will become 1 if any 
    * obs contains a VALUE of 0; 
    * The variables are in a RETAIN statement, so they keep their
    * values as the SET statement above is executed for each obs
    * in the BY-group.;
    if first.id_2
     then do;
        startptr=myptr;
        haszero=0;
     end;

    * This statement is executed for each observation. We check
    * whether VALUE is 0 and, if so, record that fact.;
    if value = 0
     then haszero=1;

    * At the end of the BY-group, we check to see if there were
    * any observations with VALUE = 0. If so, we go back using
    * another SET statement, re-read them via direct access, and
    * write them to the output dataset.
    * (Note that if VALUE order is not relevant, you can gain a bit
    * more efficiency by writing the current obs first, then going
    * back to get the rest.);
    if last.id_2 and haszero
     then do;
        * When LAST and FIRST at the same time, there is only one
        * obs, so no need to backtrack, just output and go on.;
        if first.id_2
         then output obstokeep;
         else do;
            * Here we assume that the observations are sequential
            * (which they will be for a sequential SET statement),
            * so we re-read these observations using another SET 
            * statement with the POINT option for direct access 
            * starting with the first obs of the by-group (the 
            * saved pointer) and ending with the current one (the
            * current pointer).;
            do i=startptr to myptr;
              set x point=i;
              output obstokeep;
            end;
         end;
     end;

run;


回答3:

proc sql;
   select a.*,b.value from (select id,id_2 from have where value=0)a left join have b
   on a.id=b.id and a.id_2=b.id_2;
quit;


标签: sas