When using multiple datasets in a SET or a MERGE w

2019-07-06 02:25发布

问题:

When running a data step with two datasets in the set statement, sometimes variables do not reset to missing between iterations. This is also true of merge when you have duplicate by values (ie, when your by variables do not guarantee a unique record).

For example:

data have1;
  do x=1 to 5;
    y=1;
    output;
  end;
run;

data have2;
  do x = 6 to 10;
     z=x+1;
     output;
  end;
run;

data want;
  set have1 have2;
  if missing(y) and mod(z,2)=0 then y=2;
run;

Here, y is given a value of 2 for every record coming from have2, as opposed to only the even z values.

Similarly,

data have1;
  do x = 1 to 5;
    y=1;
    output;    
  end;
run;

data have2;
  do x = 1 to 5;
    do z = 1 to 4;
       output;
    end;
  end;
run;

data want;
  merge have1 have2;
  by x;
  if mod(z,4)=3 then y=3;
run;

Why does this happen, and how can I prevent it from causing unexpected consequences?

回答1:

Why is this happening?

As discussed at length in the SAS documentation in Combining SAS Datsets: Methods, this arises from the fact that variables that are defined on a set, merge, or update statement are not set to missing on each iteration of the data step (This is the equivalent to using retain for all variables on the incoming data sets).

For the first example, this follows naturally from the retain concept: y is retained, so when it is not replaced by a new record from set having a value of y on it, it keeps its last value. (As we'll see later, it is cleared once: when the set dataset changes, hence why it doesn't still have the earlier value from the previous dataset).

However, this doesn't quite explain the functionality of the merge (how it goes back and forth). That is caused by a different behavior when a by group is involved.

Specifically, variables are not set to missing between each data step iteration; however, they are set to missing for each new by group or data set. From the documentation:

The values of the variables in the program data vector are set to missing each time SAS starts to read a new data set and when the BY group changes.

The implications of this are why the second example has y set back to 1 for the first two iterations of z but is kept at 3 for the z=4 iteration.

In order, labelling each iteration by its z value:

  • Z=1: first record of by group, so everything is set to missing. HAVE1 is read, HAVE2 is read. X=1, Y=1, Z=1 are all set.
  • Z=2: Second record of have2 is read. y retains its value of 1 from the previous iteration.
  • Z=3: Third record of have2 is read. y is set to 3.
  • Z=4: Fourth record of have2 is read. y retains its value of 3 from the previous iteration.

Note that HAVE1 is only read once, on the z=1 iteration. If this were a many-to-many merge, HAVE1 would be read once for each different row that had the same x value on it.

How do we prevent it from happening?

You have several options to deal with this, assuming you want it to act as if it was not automatically retained.

  • Add a by statement

As was noted before, on new by values it will automatically reset everything to missing. So if you ran

data want;
  set have1 have2;
  by x;
  if missing(y) and mod(z,2)=0 then y=2;
run;

This would work as expected (though giving a slightly different result here).

  • Set some or all variables to missing on your own

You can do this in two places:

data want;
  set have1 have2;
  if missing(y) and mod(z,2)=0 then y=2;
  output;
  call missing(of _all_);
run;

or

data want;
  y=.;
  set have1 have2;
  if missing(y) and mod(z,2)=0 then y=2;
run;

One or the other may be more appropriate for your program depending on your needs (the first sets everything to missing, but requires an extra statement (output;), while the second only sets y to missing (which is all that's needed) but changes the variable order by putting y first).

For a merge with duplicate by values, if you want to preserve the value of y you may need to do something like:

data want;
  merge have1 have2;
  by x;
  y_new=y;
  if mod(z,4)=3 then y_new=3;
  rename y_new=y;
  drop y;
run;

which gets around things by using a separate variable to store the new value. You also can set it to missing similarly to the above, if that is what is desired.



标签: sas