Could someone explain why the following two pieces of code give different results? I would like to simulate some simple time series processes in SAS, but I'm struggling with the lag function.
Specifically, in program 1, the variable b contains no data, which is unexpected. In program 2, the lag function works as expected.
/*Program 1*/
data lagtest;
a = 1;
b=lag(a);
output;
a = 2;
b= lag(a);
output;
a = 3;
b= lag(a);
output;
run;
/*Program 2*/
data lagtest2;
input a;
datalines;
1
2
3
;
run;
data lagtest2;
set lagtest2;
b= lag(a);
run;
I've been reading about the lag function, but can't find references to its use in a datastep that does not take an input dataset.
Thanks very much for any help.
Keith's roughly correct in that the correct approach is what he shows, but the reasoning isn't accurate. LAG
works on data; input and output is irrelevant (and not really a meaningful distinction). It is, in fact, quite possible to make this work with only programmatically provided data.
data lagtest;
do a=1 to 3;
b=lag(a);
output;
end;
run;
Similarly, it's possible to make the second example not work, with a somewhat absurd example:
data lagtest2;
p=1;
set lagtest2 point=p;
b= lag(a);
output;
p=2;
set lagtest2 point=p;
b=lag(a);
output;
p=3;
set lagtest2 point=p;
b=lag(a);
output;
stop;
run;
The reason the first example doesn't work isn't the source of data; it's the number of lag calls. One of the most common mistakes is to believe that lag
retrieves a value from previous record; that isn't true. The way lag
works is that each call to lag
creates a queue. Each time that lag statement is encountered, whatever value is in the argument is pushed onto the queue, and if the queue is at least the defined length+1 long, the value at the front of the queue is popped off. (For lag
or lag1
, the queue must be 2 long; for lag2
it must be 3 long; etc. - ie, the number of the function plus the value just popped on).
In your first example, you call lag
three times, so three separate queues are created, and none of them ever are called a second time. In your second example, you call lag
once, so one queue is created, and it is called three times.
The LAG function works on input data, not output data. In your first example there is no input data, just output, therefore the lag value is always blank.
In your second example you don't need the 2 sections of code, you could just put :
data lagtest2;
input a;
b= lag(a);
datalines;
1
2
3
;
run;