The overal goal is to stratify quantitative variables based on their percentile. I would like to break it up into 10 levels (e.g. 10th, 20th, ...100th percentile) and recode it as 1 if it falls into the 10th percentile, 2 if it falls into the 20th percentile, etc. This method needs to be applicable across any data set I plug in and I want this process to be as automated as possible. Below I have generated some test data:
data test (drop=i);
do i=1 to 1000;
a=round(uniform(1)*4,.01);
b=round(uniform(1)*10,.01);
c=round(uniform(1)*7.5,.01);
output;
end;
stop;
run;
The following macro is used to create a table of values that tells you the cut off for the 10 percentiles of each variable. I have added a picture of the example output below the code.
/*Recode variables based on quartiles from boxplot*/
%macro percentiles(var);
/* Count the number of values in the strinrecode */
%let count=%sysfunc(countw(&var));
/* Loop throurecodeh the total number of values */
%do i = 1 %to &count;
%let variables=%qscan(&var,&i,%str(,));
proc univariate data=test noprint;
var &variables;
output out=pcts pctlpts = 10 20 30 40 50 60 70 80 90 100
pctlpre = &variables;
run;
proc transpose data=pcts out=&variables (rename=(col1=&variables) drop=_NAME_ _LABEL_);
run;
%end;
data percentiles (drop=i);
do i=1 to 10;
recode=i;
percentile=i*10;
output;
end;
stop;
run;
data pcts;
merge percentiles %sysfunc(tranwrd(&var.,%str(,),%str( )));
run;
%mend;
%percentiles(%str(a,b,c));
The following code is how I am currently recoding my variables. I use the table generated in the above macro to fill in the cut off points for each percentile for each variable. As you can see, this is very tedious and will become prohibitive if I have a large number of variables to recode. Is there a better process for this or preferably a way I could automate this part?
data test;
set test;
if a <= .415 then recode_a = 1; else if a <= .785 then recode_a = 2; else if a <= 1.255 then recode_a = 3;
else if a <= 1.61 then recode_a = 4; else if a <= 2.03 then recode_a = 5; else if a <= 2.42 then recode_a = 6;
else if a <= 2.76 then recode_a = 7; else if a <= 3.18 then recode_a = 8; else if a <= 3.64 then recode_a = 9;
else if a <= 3.99 then recode_a = 10;
if b <= .845 then recode_b = 1; else if b <= 1.88 then recode_b = 2; else if b <= 2.86 then recode_b = 3;
else if b <= 4.005 then recode_b = 4; else if b <= 5.03 then recode_b = 5; else if b <= 6.07 then recode_b = 6;
else if b <= 6.995 then recode_b = 7; else if b <= 8.035 then recode_b = 8; else if b <= 9.16 then recode_b = 9;
else if b <= 10 then recode_b = 10;
if c <= .86 then recode_c = 1; else if c <= 1.58 then recode_c = 2; else if c <= 2.34 then recode_c = 3;
else if c <= 3.15 then recode_c = 4; else if c <= 3.85 then recode_c = 5; else if c <= 4.615 then recode_c = 6;
else if c <= 5.315 then recode_c = 7; else if c <= 5.96 then recode_c = 8; else if c <= 6.75 then recode_c = 9;
else if c <= 7.5 then recode_c = 10;
run;
proc print data=test (obs=5);
run;
A different option - PROC RANK. You could probably make it more 'automated' but it's pretty straightforward. Using PROC RANK you could also specify different ways of dealing with ties. Note that it would go from 0 to 9 rather than 1 to 10 but that's trivial to change.
No need for all of that code generation. Just use an array. Basically load the percentiles from the dataset generated by PROC UNIVARIATE into an two dimensional array and then find the decile rank for your actual values.
So if I use your generated sample here is what the log looks like with the MPRINT option turned on.
And the first five observations are:
The following should work for you dynamically with no hard-coding -- I edited to compact it into a single macro. Essentially it puts your desired variables into a list, creates a dataset using your output, and then uses the variable contents to put your data steps into long strings. These strings are then put into a macro variable and you can call it in your final data step. Again, no hard-coding involved.