Summing a Column By Group In a Dataset With Macros

2020-05-07 06:59发布

问题:

I have a dataset that looks like:

 Month   Cost_Center      Account    Actual    Annual_Budget
 June     53410           Postage       13      234
 June     53420           Postage       0       432
 June     53430           Postage       48      643
 June     53440           Postage       0       917
 June     53710           Postage       92      662
 June     53410           Phone         73      267
 June     53420           Phone         103     669
 June     53430           Phone         90      763
 ...

I would like to first sum the Actual and Annual columns, respectively and then create a variable where it flags if the Actual extrapolated for the entire year is greater than than Annual column.

I have the following code:

 Data Test; 
 set Combined;
 %All_CC; /*MACRO TO INCLUDE ALL COST CENTERS*/
 %Total_Other_Expenses;/*MACRO TO INCLUDE SPECIFIC Account Descriptions*/
 Sum_Actual = sum(Actual);
 Sum_Annual = sum(Annual_Budget);
 Run_Rate = Sum_Actual*12;
 if Run_Rate > Sum_Annual then Over_Budget_Alarm = 1;
 run; 

However, when I run this code, it does not sum by group, for example, this is the output I get:

 Account_Description    Sum_Actual Sum_Annual   Run_Rate  Over_Budget_Alarm
      Postage             13      234             146           
      Postage             0       432              0 
      Postage             48      643             963            1
      Postage             0       917             0
      Postage             92      662             634            1

I'm looking for output where all the 'postage' are summed for Actual and Annual, leaving just one row of data.

回答1:

  1. Use PROC MEANS to summarize the data
  2. Use a data step and IF/THEN statement to create your flags.

    proc means data=have N SUM NWAY STACKODS;
       class account;
       var amount annual_budget;
       ods output summary = summary_stats1;
       output out = summary_stats2 N = SUM= / AUTONAME;
    run;
    
    data want;
      set summary_stats;
      if sum_actual > sum_annual_budget then flag=1; 
      else flag=0;
    run;
    


回答2:

SAS DATA step behavior is quite complex ("About DATA Step Execution" in SAS Language Reference: Concepts). The default behavior, that you're seeing, is: at the end of each iteration (i.e. for each input row) the row is written to the output data set, and the PDV - all data step variables - is reset.

You can't expect to write Base SAS "intuitively" without spending a few days learning it first, so I recommend using PROC SQL, unless you have a reason not to.

If you really want to aggregate in data step, you have to use something called BY groups processing: after ensuring the input data set is sorted by the BY vars, you can use something like the following:

data Test (keep = Month Account Sum_Actual Sum_Annual /*...your Run_Rate and Over_Budget_Alarm...*/);
    set Combined; /* the input table */
    by Month Account; /* must be sorted by these */

    retain Sum_Actual Sum_Annual; /* don't clobber for each input row */
    if first.account then do; /* instead do it manually for each group */
        Sum_Actual = 0;
        Sum_Annual = 0;
    end;

    /* accumulate the values from each row */
    Sum_Actual = sum(Sum_Actual, Actual);
    Sum_Annual = sum(Sum_Annual, Annual_Budget);
    /* Note that Sum_Actual = Sum_Actual+Actual; will not work if any of the input values is 'missing'. */

    if last.account then do;
        /* The group has been processed.
           Do any additional processing for the group as a whole, e.g.
           calculate Over_Budget_Alarm. */
        output; /* write one output row per group */
    end;
run;


回答3:

Proc SQL can be very effective for understanding aggregate data examination. With out seeing what the macros do, I would say perform the run rate checks after outputting data set test.

You don't show rows for other months, but I must presume the annual_budget values are constant across all months -- if so, I don't see a reason to ever sum annual_budget; comparing anything to sum(annual_budget) is probably at the incorrect time scale and not useful.

From the show data its hard to tell if you want to know any of these

  • which (or if some) months had a run_rate that exceeded the annual_budget
  • which (or if some) months run_rate exceeded the balance of annual_budget (i.e. the annual_budget less the prior months expenditure)

Presume each row in test is for a single year/month/costCenter/account -- if not the underlying data would have to be aggregated to that level.

Proc SQL;
  * retrieve presumed constant annual_budget values from data;
  * this information might (should) already exist in another table;

  * presume constant annual budget value at each cost center | account combination;
  * distinct because there are multiple months with the same info;

  create table annual_budgets as
  select distinct Cost_Center, Account, Annual_Budget
  from test;

  create table account_budgets as
  select account, sum(annual_budget) as annual_budget 
  from annual_budgets
  group by account;

  * flag for some run rate condition;

  create table annual_budget_mon_runrate_check as
  select 
    2019 as year,
    account,
    sum(actual) as yr_actual,  /* across all month/cost center */
    min (
      select annual_budget from account_budgets as inner
      where inner.account = outer.account
    ) as account_budget,

    max (
      case when actual * 12 > annual_budget then 1 else 0 end
    ) as
      excessive_runrate_flag label="At least one month had a cost center run rate that would exceed its annual_budget")
  from 
    test as outer
  group by
    year, account;

You can add a where clause to restrict the accounts processed.

Changing the max to sum in the flag computation would return the number of cost center months with excessive run rates.



标签: sas