-->

Partitioned Running Totals in SAS

2020-04-20 12:02发布

问题:

How can I create a new column for an existing dataset that is the running total of an existing column - partitioned by some identifier?

ID |  Value |     New Value
---|--------|--------------------
1  |   10   |     10
1  |   5    |     15  = 10 + 5
1  |   3    |     18  = 10 + 5 + 3
2  |   45   |     45
2  |   15   |     60  = 45 + 15

I'm used to accomplishing this in SQL (Oracle) with a simple SUM() OVER() statement, but that syntax is apparently not supported in PROC SQL.

If possible, I would like to accomplish this within PROC SQL (I am much more experienced with SQL than SAS coding).

Thanks!

Mike.

回答1:

In the data step this is accomplished with the sum statement.

data want;
  set have;
  by id;
  if first.id then running_total=0;
  running_Total + value;
run;

In PROC SQL this would be impossible unless you have an ordering variable (in which you could do something like this):

proc sql;
  create table want as
   select id, value,
     (select sum(value) from have V
     where V.id=H.id and V.ordervar le H.ordervar
     ) as running_total
   from have H
 ;
quit;

But SAS doesn't have a partition by concept - the SAS data step is far more powerful than that.



回答2:

I generally do running totals in SAS using a data step rather than proc sql. You could theoretically do that with proc sql and a cross-join, but that's not very practical in the real-world.

Step 1: Sort the data by your ID value

proc sort data=have;
     by ID;
run;

Step 2: Calculate the running total using a sum statement

data want;
    set have;
    by ID;

    New_Value+Value;

    output;

    if(last.ID) then New_Value = 0;
 run;

Explanation

By-group processing is one of SAS's most powerful tools. When you sort by your ID value and use a by-statement on it in a data step, you unlock two new special boolean variables: first.<by-group variable> and last.<by-group variable>. SAS knows exactly when a group of ID variables starts and stops when you do by-group processing (you can actually use a special option called notsorted to get away without sorting the ID values, but that is an advanced concept). In your case, let's look and see how this works:

ID Value first.ID last.ID
1  10    1        0
1  5     0        0
1  3     0        1
2  45    1        0 
2  15    0        1

If there was only one observation in an ID group, first.ID and last.ID would both be 1.

There is a special statement in SAS called a sum statement that does the job of both retaining and summing up a variable. SAS is inherently a looping language; whenever it runs through the data step, it is looking at one record and one record only. When it hits the bottom of the data step and goes back up to the top, begins to read the next record in the table, and assumes all variables are now missing again until it reads or calculates the variables' value. The thing that you are literally outputting are the contents of something called the Program Data Vector. In the background, this is what you are manipulating.

By default, it will output to a data set only after it hits the run boundary. If you explicitly tell SAS to output, it will output only when you tell it to. To cumulatively sum up a variable, we want to tell SAS four things:

  1. Retain the previous value of a variable (don't reset it to missing on a when reading a new record)
  2. Add the variable to itself
  3. Output after I add the value to itself
  4. After I output and it's the last observation in a group of ID variables, reset my cumulative variable to 0

To achieve Step 1 & 2, you can do two options:

 data want;
      set have;
      retain New_Value;

      New_Value = sum(New_Value, Value);

OR

 data want;
       set have;
       New_Value+Value;

Note that the 2nd option does exactly what the first does but with less work. Think of it as a shortcut. This is called a Sum Statement. It looks like it's syntactically wrong, but it's a special, and very useful, case.

To achieve step 3, we simply tell SAS to output right afterwards rather than at the end of the data step by default:

data want;
    set have;

    New_Value+Value;

    output;

If you run the code above as it is, New_Value will add up indefinitely until it reaches the very end of the file. We want to reset this value once it reaches a new ID group. We use by-group processing to unlock those two boolean variables to reset New_Value to 0 during a specified condition after we output:

data want;
    set have;
    by ID;

    New_Value+Value;

    output;

    if(last.ID) then New_Value = 0;
run;

New_Value will not be reset to 0 unless we are in the very last of observation of the specified ID group. Note that we put the conditional if statement below the output statement. If this were above, you would see the following phenomenon:

ID Value New_Value first.ID last.ID
1  10    10         1        0
1  5     15         0        0
1  3      0         0        1
2  45    45         1        0 
2  15     0         0        1

We want to output the cumulative sum before New_Value is reset to 0.

There are some other SAS procedures that you can use to do similar things, but they are designed for specific situations. You can repurpose them to do what you want in this case, but it's better to learn data step processing before jumping into procedure repurposing.



回答3:

Here's another option:

data want;
do until (last.id);
    set have;
    by id;
    new_value + value;
    output;
end;
new_value = 0;
run;


回答4:

Joe - your answer did not work for whatever reason, but got me on the right track to figure it out. Thank you!

data want;
    set have;
    by id;
    if first.id then running_total = 0;
    if first.id then retained_total = 0;
    running_total = retained_total + value;
    retained_total = running_total;
    retain retained_total;
run;