Summing a Column By Group In a Dataset With Macros

2020-05-07 06:21发布

I have a dataset that looks like:

 Month   Cost_Center      Account    Actual    Annual_Budget
 June     53410           Postage       13      234
 June     53420           Postage       0       432
 June     53430           Postage       48      643
 June     53440           Postage       0       917
 June     53710           Postage       92      662
 June     53410           Phone         73      267
 June     53420           Phone         103     669
 June     53430           Phone         90      763
 ...

I would like to first sum the Actual and Annual columns, respectively and then create a variable where it flags if the Actual extrapolated for the entire year is greater than than Annual column.

I have the following code:

 Data Test; 
 set Combined;
 %All_CC; /*MACRO TO INCLUDE ALL COST CENTERS*/
 %Total_Other_Expenses;/*MACRO TO INCLUDE SPECIFIC Account Descriptions*/
 Sum_Actual = sum(Actual);
 Sum_Annual = sum(Annual_Budget);
 Run_Rate = Sum_Actual*12;
 if Run_Rate > Sum_Annual then Over_Budget_Alarm = 1;
 run; 

However, when I run this code, it does not sum by group, for example, this is the output I get:

 Account_Description    Sum_Actual Sum_Annual   Run_Rate  Over_Budget_Alarm
      Postage             13      234             146           
      Postage             0       432              0 
      Postage             48      643             963            1
      Postage             0       917             0
      Postage             92      662             634            1

I'm looking for output where all the 'postage' are summed for Actual and Annual, leaving just one row of data.

标签: sas
3条回答
迷人小祖宗
2楼-- · 2020-05-07 07:00

SAS DATA step behavior is quite complex ("About DATA Step Execution" in SAS Language Reference: Concepts). The default behavior, that you're seeing, is: at the end of each iteration (i.e. for each input row) the row is written to the output data set, and the PDV - all data step variables - is reset.

You can't expect to write Base SAS "intuitively" without spending a few days learning it first, so I recommend using PROC SQL, unless you have a reason not to.

If you really want to aggregate in data step, you have to use something called BY groups processing: after ensuring the input data set is sorted by the BY vars, you can use something like the following:

data Test (keep = Month Account Sum_Actual Sum_Annual /*...your Run_Rate and Over_Budget_Alarm...*/);
    set Combined; /* the input table */
    by Month Account; /* must be sorted by these */

    retain Sum_Actual Sum_Annual; /* don't clobber for each input row */
    if first.account then do; /* instead do it manually for each group */
        Sum_Actual = 0;
        Sum_Annual = 0;
    end;

    /* accumulate the values from each row */
    Sum_Actual = sum(Sum_Actual, Actual);
    Sum_Annual = sum(Sum_Annual, Annual_Budget);
    /* Note that Sum_Actual = Sum_Actual+Actual; will not work if any of the input values is 'missing'. */

    if last.account then do;
        /* The group has been processed.
           Do any additional processing for the group as a whole, e.g.
           calculate Over_Budget_Alarm. */
        output; /* write one output row per group */
    end;
run;
查看更多
狗以群分
3楼-- · 2020-05-07 07:11

Proc SQL can be very effective for understanding aggregate data examination. With out seeing what the macros do, I would say perform the run rate checks after outputting data set test.

You don't show rows for other months, but I must presume the annual_budget values are constant across all months -- if so, I don't see a reason to ever sum annual_budget; comparing anything to sum(annual_budget) is probably at the incorrect time scale and not useful.

From the show data its hard to tell if you want to know any of these

  • which (or if some) months had a run_rate that exceeded the annual_budget
  • which (or if some) months run_rate exceeded the balance of annual_budget (i.e. the annual_budget less the prior months expenditure)

Presume each row in test is for a single year/month/costCenter/account -- if not the underlying data would have to be aggregated to that level.

Proc SQL;
  * retrieve presumed constant annual_budget values from data;
  * this information might (should) already exist in another table;

  * presume constant annual budget value at each cost center | account combination;
  * distinct because there are multiple months with the same info;

  create table annual_budgets as
  select distinct Cost_Center, Account, Annual_Budget
  from test;

  create table account_budgets as
  select account, sum(annual_budget) as annual_budget 
  from annual_budgets
  group by account;

  * flag for some run rate condition;

  create table annual_budget_mon_runrate_check as
  select 
    2019 as year,
    account,
    sum(actual) as yr_actual,  /* across all month/cost center */
    min (
      select annual_budget from account_budgets as inner
      where inner.account = outer.account
    ) as account_budget,

    max (
      case when actual * 12 > annual_budget then 1 else 0 end
    ) as
      excessive_runrate_flag label="At least one month had a cost center run rate that would exceed its annual_budget")
  from 
    test as outer
  group by
    year, account;

You can add a where clause to restrict the accounts processed.

Changing the max to sum in the flag computation would return the number of cost center months with excessive run rates.

查看更多
来,给爷笑一个
4楼-- · 2020-05-07 07:19
  1. Use PROC MEANS to summarize the data
  2. Use a data step and IF/THEN statement to create your flags.

    proc means data=have N SUM NWAY STACKODS;
       class account;
       var amount annual_budget;
       ods output summary = summary_stats1;
       output out = summary_stats2 N = SUM= / AUTONAME;
    run;
    
    data want;
      set summary_stats;
      if sum_actual > sum_annual_budget then flag=1; 
      else flag=0;
    run;
    
查看更多
登录 后发表回答