When running a data step with two datasets in the set
statement, sometimes variables do not reset to missing between iterations. This is also true of merge
when you have duplicate by values (ie, when your by
variables do not guarantee a unique record).
For example:
data have1;
do x=1 to 5;
y=1;
output;
end;
run;
data have2;
do x = 6 to 10;
z=x+1;
output;
end;
run;
data want;
set have1 have2;
if missing(y) and mod(z,2)=0 then y=2;
run;
Here, y
is given a value of 2 for every record coming from have2
, as opposed to only the even z
values.
Similarly,
data have1;
do x = 1 to 5;
y=1;
output;
end;
run;
data have2;
do x = 1 to 5;
do z = 1 to 4;
output;
end;
end;
run;
data want;
merge have1 have2;
by x;
if mod(z,4)=3 then y=3;
run;
Why does this happen, and how can I prevent it from causing unexpected consequences?
Why is this happening?
As discussed at length in the SAS documentation in Combining SAS Datsets: Methods, this arises from the fact that variables that are defined on a set
, merge
, or update
statement are not set to missing on each iteration of the data step (This is the equivalent to using retain
for all variables on the incoming data sets).
For the first example, this follows naturally from the retain
concept: y
is retained, so when it is not replaced by a new record from set
having a value of y
on it, it keeps its last value. (As we'll see later, it is cleared once: when the set
dataset changes, hence why it doesn't still have the earlier value from the previous dataset).
However, this doesn't quite explain the functionality of the merge (how it goes back and forth). That is caused by a different behavior when a by
group is involved.
Specifically, variables are not set to missing between each data step iteration; however, they are set to missing for each new by group or data set. From the documentation:
The values of the variables in the program data vector are set to
missing each time SAS starts to read a new data set and when the BY
group changes.
The implications of this are why the second example has y
set back to 1 for the first two iterations of z
but is kept at 3 for the z=4
iteration.
In order, labelling each iteration by its z
value:
- Z=1: first record of by group, so everything is set to missing.
HAVE1
is read, HAVE2
is read. X=1
, Y=1
, Z=1
are all set.
- Z=2: Second record of
have2
is read. y
retains its value of 1 from the previous iteration.
- Z=3: Third record of
have2
is read. y
is set to 3.
- Z=4: Fourth record of
have2
is read. y
retains its value of 3 from the previous iteration.
Note that HAVE1
is only read once, on the z=1
iteration. If this were a many-to-many merge, HAVE1
would be read once for each different row that had the same x
value on it.
How do we prevent it from happening?
You have several options to deal with this, assuming you want it to act as if it was not automatically retained.
As was noted before, on new by
values it will automatically reset everything to missing. So if you ran
data want;
set have1 have2;
by x;
if missing(y) and mod(z,2)=0 then y=2;
run;
This would work as expected (though giving a slightly different result here).
- Set some or all variables to missing on your own
You can do this in two places:
data want;
set have1 have2;
if missing(y) and mod(z,2)=0 then y=2;
output;
call missing(of _all_);
run;
or
data want;
y=.;
set have1 have2;
if missing(y) and mod(z,2)=0 then y=2;
run;
One or the other may be more appropriate for your program depending on your needs (the first sets everything to missing, but requires an extra statement (output;
), while the second only sets y
to missing (which is all that's needed) but changes the variable order by putting y
first).
For a merge
with duplicate by
values, if you want to preserve the value of y
you may need to do something like:
data want;
merge have1 have2;
by x;
y_new=y;
if mod(z,4)=3 then y_new=3;
rename y_new=y;
drop y;
run;
which gets around things by using a separate variable to store the new value. You also can set it to missing similarly to the above, if that is what is desired.