Variable check and summary out

2019-08-31 12:22发布

问题:

Problem/Question

I'm trying to do a simple check on on a list of variables in a data set (revenue, costs, profits, and vcosts) that grabs the largest and second largest from each variable, checks if their total is greater than 90% of the sum of the variable, and if so, flags that variable. I want to also check that the largest variable is not larger than 60% of the total sum.

I got a bit of help from this Macro that outputs table with testing results of SAS tableMacro that outputs table with testing results of SAS table but now I'm trying to answer a much more basic question. This doesn't seem to hard, but I can't figure out how to setup the basic table at the end.

I know all the variable names.

Here is a sample dataset I've created : https://www.dropbox.com/s/x575w5d551uu47p/dataset%20%281%29.csv?dl=0

Desired Output

I would like to turn this basic table :

Into another table like this :

Reproducible example

/* Create some dummy data with three variables to assess */
data have;
    do firm = 1 to 3;
        revenue = rand("uniform");
        costs = rand("uniform");
        profits = rand("uniform");
        vcost = rand("uniform");
        output;
    end;
run;

回答1:

Based on your comment on the previous answer. It looks like top_2_total is sum of the 2 maximum total values. For that purpose you would need to code some extra step. I'm using proc transpose and a datastep to get what was already acheieved in the previous answer. I have coded PROC SUMMARY to get the top 2 maximum total values and reusing the dataset to create the final answer. Let me know if it helps.

data have;
    do firm = 1 to 3;
        revenue = rand("uniform");
        costs = rand("uniform");
        profits = rand("uniform");
        vcost = rand("uniform");
        output;
    end;
run;

proc transpose data=have out=want prefix=top_;
    var revenue--vcost;
run;

data want;
set want end=eof;
    array top(*) top_3-top_1;
    call sortn(of top[*]);
    total=sum(of top[*]);
run;
/* Getting the maximum 2 total values using PROC SUMMARY*/
proc summary data=want nway;
    output out=total_top_2_rec(drop=_:) idgroup(max(total) out[2](total)=);
run;

data want;
/* Loop to get the values from previous step and generate TOP_2_TOTAL variable */
if _n_=1 then set total_top_2_rec;
    top_2_total=sum(total_1,total_2);

set want;
    if sum(top_1,top_2) > 0.9  * top_2_total then Flag90=1; else Flag90=0;
    if top_1 > top_2_total * 0.6 then Flag60=1; else Flag60=0;

drop total_1 total_2;
run;

proc print data=want;run;

EDIT : I have added a logic before my PROC TRANSPOSE where you can add the variables to consider for the calculation and rest is done by the code. No manual changes would be required to be done by code executor after that. The variables should be entered as space delimited list.

data have;
infile 'C:\dataset (1).csv' missover dsd dlm=',' firstobs=2;
input firm v1 v2 v3;
run;

/* add/remove columns here to consider variable */
%let variable_to_consider=v1 
                          v2 
                          v3
                          ;

%let variable_to_consider=%cmpres(&variable_to_consider);
proc sql noprint;
  select count(*) into : obs_count from have;
quit;
%let obs_count=&obs_count;

proc transpose data=have out=want prefix=top_;
    var &variable_to_consider; 
run;

data want;
set want end=eof;
    array top(*) top_&obs_count.-top_1;
    x=dim(top);
    call sortn(of top[*]);
    total=sum(of top[*]);

keep total top_1 top_2 _name_;
run;

/* Getting the maximum 2 total values using PROC SUMMARY*/
proc summary data=want nway;
    output out=total_top_2_rec(drop=_:) idgroup(max(total) out[2](total)=);
run;

data want;
/* Loop to get the values from previous step and generate TOP_2_TOTAL variable */
if _n_=1 then set total_top_2_rec;
    top_2_total=sum(total_1,total_2);

set want;
    if sum(top_1,top_2) > 0.9  * top_2_total then Flag90=1; else Flag90=0;
    if top_1 > top_2_total * 0.6 then Flag60=1; else Flag60=0;

drop total_1 total_2;
run;

proc print data=want;run;

EDIT 2014-04-05 : As discussed, i have updated the logic and fixed the issues. Below is the updated code.

data have1;
    do firm = 1 to 3;
        revenue = rand("uniform");
        costs = rand("uniform");
        profits = rand("uniform");
        vcost = rand("uniform");
        output;
    end;
run;

data have2;
infile 'dataset (1).csv' missover dsd dlm=',' firstobs=2;
input firm v1 v2 v3;
run;
/* add/remove columns here to consider variable */

%macro mymacro(input_dataset= ,output_dataset=, variable_to_consider=);

%let variable_to_consider=%cmpres(&variable_to_consider);
proc sql noprint;
  select count(*) into : obs_count from &input_dataset;
quit;
%let obs_count=&obs_count;

proc transpose data=&input_dataset out=&output_dataset prefix=top_;
    var &variable_to_consider; 
run;

data &output_dataset;
set &output_dataset end=eof;
    array top(*) top_&obs_count.-top_1;
    x=dim(top);
    call sortn(of top[*]);
    total=sum(of top[*]);

top_2_total=sum(top_1, top_2);
    if sum(top_1,top_2) > 0.9  * total then Flag90=1; else Flag90=0;
    if top_1 > total * 0.6 then Flag60=1; else Flag60=0;

keep total top_1 top_2 _name_ top_2_total total Flag60 Flag90;

run;
%mend mymacro;

%mymacro(input_dataset=have1, output_dataset=want1 ,variable_to_consider=revenue costs profits vcost)
%mymacro(input_dataset=have2, output_dataset=want2 ,variable_to_consider=v1 v2 v3 )


proc print data=want1;run;
proc print data=want2;run;


回答2:

The hard part here is extracting the top 2 values for each variable. This is simple in most implementations of SQL, but in SAS I don't think proc sql supports the select top n syntax.

I can think of several possible ways of doing this:

  1. Sort the dataset in descending order by each variable of interest, retrieve the values from the first 2 observations, transpose, and append them all together - this very inefficient due to multiple sorts, and it isn't much simpler than the other approaches.

  2. Write a (fairly complex) data step to extract the top 2 values for each variable.

  3. Get proc univariate to extract the top values for you and then transpose the output dataset into the right format.

Data step approach

data top2;
  array v{4} revenue costs profits vcost;
  array top1{4} (4*0);
  array top2{4} (4*0);
  set have end = eof;
  do i = 1 to 4;
    if v[i] > top1[i] then do;
      top2[i] = top1[i];
      top1[i] = v[i];
    end;
    if top2[i] < v[i] < top1[i] then top2[i] = v[i];
  end;
  length varname $8;
  if eof then do i = 1 to 4;
    varname = vname(v[i]);
    top_1    = top1[i];
    top_2    = top2[i];
    top_2_total = top_1 + top_2;
    output;
  end;
  keep varname top_:;
run;

Proc univariate approach

ods _all_ close;
ods output extremeobs = extremeobs(keep = varname high);
proc univariate data = have(drop = firm);
run;
ods listing;

data top2_b;
    set extremeobs;
    by varname notsorted;
    if first.varname then do;
        i = 0;
        call missing(top_2);
    end;
    i + 1;
    retain top_2;
    if i = 4 then top_2 = high;
    if i = 5 then do;
        top_1 = high;
        top_2_total = top_1 + top_2;
        output;
    end;
    drop i high;
run;

Once you've got this you can merge it with your existing simple table from proc means / proc summary and compute any further measures of interest.



回答3:

The flag1 and flag2 in the last step will have a positive integer for values with numerator greater than or equal to the denominator and zero if the numerator less than denominator.

data have(drop=firm);
    do firm = 1 to 4;
        VarName = 'Variable';
        revenue = rand("uniform");
        costs = rand("uniform");
        profits = rand("uniform");
        vcost = rand("uniform");
        output;
    end;
run;

Proc Transpose data=have out=transout
name=Variable
prefix=Var_;
run;

options Mprint;

%Macro calcflag(Varlist);
proc sql;
create table outtable as
select Variable,
sum(&Varlist) as Sum_var,
Largest(1,&Varlist) as Top_1,
Largest(2,&Varlist) as Top_2,
sum(Largest(1,&Varlist),Largest(2,&Varlist)) as Top_2_total,
floor(sum(Largest(1,&Varlist),Largest(2,&Varlist))/(sum(&Varlist)*0.9)) as flag1,
floor(Largest(1,&Varlist)/(sum(&Varlist)*0.6)) as flag2 
from transout;
quit;
%mend;

%calcflag(%str(Var_1,Var_2,Var_3,Var_4));


标签: sas