How can I create a new column for an existing dataset that is the running total of an existing column - partitioned by some identifier?
ID | Value | New Value
---|--------|--------------------
1 | 10 | 10
1 | 5 | 15 = 10 + 5
1 | 3 | 18 = 10 + 5 + 3
2 | 45 | 45
2 | 15 | 60 = 45 + 15
I'm used to accomplishing this in SQL (Oracle) with a simple SUM() OVER() statement, but that syntax is apparently not supported in PROC SQL.
If possible, I would like to accomplish this within PROC SQL (I am much more experienced with SQL than SAS coding).
Thanks!
Mike.
Here's another option:
I generally do running totals in SAS using a
data
step rather thanproc sql
. You could theoretically do that withproc sql
and a cross-join, but that's not very practical in the real-world.Step 1: Sort the data by your ID value
Step 2: Calculate the running total using a sum statement
Explanation
By-group processing is one of SAS's most powerful tools. When you sort by your ID value and use a by-statement on it in a data step, you unlock two new special boolean variables:
first.<by-group variable>
andlast.<by-group variable>.
SAS knows exactly when a group of ID variables starts and stops when you do by-group processing (you can actually use a special option callednotsorted
to get away without sorting the ID values, but that is an advanced concept). In your case, let's look and see how this works:If there was only one observation in an ID group,
first.ID
andlast.ID
would both be 1.There is a special statement in SAS called a sum statement that does the job of both retaining and summing up a variable. SAS is inherently a looping language; whenever it runs through the
data
step, it is looking at one record and one record only. When it hits the bottom of thedata
step and goes back up to the top, begins to read the next record in the table, and assumes all variables are now missing again until it reads or calculates the variables' value. The thing that you are literally outputting are the contents of something called the Program Data Vector. In the background, this is what you are manipulating.By default, it will output to a data set only after it hits the
run
boundary. If you explicitly tell SAS to output, it will output only when you tell it to. To cumulatively sum up a variable, we want to tell SAS four things:To achieve Step 1 & 2, you can do two options:
OR
Note that the 2nd option does exactly what the first does but with less work. Think of it as a shortcut. This is called a Sum Statement. It looks like it's syntactically wrong, but it's a special, and very useful, case.
To achieve step 3, we simply tell SAS to output right afterwards rather than at the end of the data step by default:
If you run the code above as it is,
New_Value
will add up indefinitely until it reaches the very end of the file. We want to reset this value once it reaches a new ID group. We use by-group processing to unlock those two boolean variables to resetNew_Value
to 0 during a specified condition after we output:New_Value
will not be reset to 0 unless we are in the very last of observation of the specified ID group. Note that we put the conditionalif
statement below the output statement. If this were above, you would see the following phenomenon:We want to output the cumulative sum before
New_Value
is reset to 0.There are some other SAS procedures that you can use to do similar things, but they are designed for specific situations. You can repurpose them to do what you want in this case, but it's better to learn
data
step processing before jumping into procedure repurposing.In the data step this is accomplished with the
sum statement
.In PROC SQL this would be impossible unless you have an ordering variable (in which you could do something like this):
But SAS doesn't have a
partition by
concept - the SAS data step is far more powerful than that.Joe - your answer did not work for whatever reason, but got me on the right track to figure it out. Thank you!