Split large SAS dataset into smaller datasets

2020-03-23 17:56发布

问题:

I need some assistance with splitting a large SAS dataset into smaller datasets.

Each month I'll have a dataset containing a few million records. This number will vary from month to month. I need to split this dataset into multiple smaller datasets containing 250,000 records each. For example, if I have 1,050,000 records in the original dataset then I need the end result to be 4 datasets containing 250,000 records and 1 dataset containing 50,000 records.

From what I've been looking at it appears this will require using macros. Unfortunately I'm fairly new to SAS (unfamiliar with using macros) and don't have a lot of time to accomplish this. Any help would be greatly appreciated.

回答1:

Building on Joe's answer, maybe you could try something like this :

%MACRO SPLIT(DATASET);

%LET DATASET_ID = %SYSFUNC(OPEN(&DATASET.));
%LET NOBS = %SYSFUNC(ATTRN(&DATASET__ID., NLOBS));
%LET NB_DATASETS = %SYSEVALF(&NOBS. / 250000, CEIL);

DATA 
  %DO I=1 %TO &NB_DATASETS.;
    WANT&I. 
  %END;;

  SET WANT;

  %DO I=1 %TO &NB_DATASETS.;

    %IF &I. > 1 %THEN %DO; ELSE %END; IF _N_ LE 2.5E5 * &I. THEN OUTPUT WANT&I.;

  %END;
RUN;
%MEND SPLIT;    


回答2:

You can do it without macros at all, if you don't mind asking for datasets that may not exist, and have a reasonable bound on things.

data want1 want2 want3 want4 want5 want6 want7 want8 want9;
if _n_ le 2.5e5 then output want1;
else if _n_ le 5e5 then output want2;
else if _n_ le 7.5e5 then output want3;
... etc....
run;

Macros would make that more efficient to program and cleaner to read, but wouldn't change how it actually runs in reality.



回答3:

You can do it without macros, using CALL EXECUTE(). It creates SAS-code as text strings and then executes it, after your "manually written" code completed.

data _null_;
    if 0 then set have nobs=n;
    do i=1 to ceil(n/250000);
        call execute (cats("data want",i)||";");
        call execute ("set have(firstobs="||(i-1)*250000+1||" obs="||i*250000||");");
        call execute ("run;");
    end;
run;


回答4:

A more efficient option, if you have room in memory to store one of the smaller datasets, is a hash solution. Here's an example using basically what you're describing in the question:

data in_data;
  do recid = 1 to 1.000001e7;
    datavar = 1;
    output;
  end;
run;


data _null_;
  if 0 then set in_data;
  declare hash h_out();
  h_out.defineKey('_n_');
  h_out.defineData('recid','datavar');
  h_out.defineDone();

  do filenum = 1 by 1 until (eof);
    do _n_ = 1 to 250000 until (eof);
      set in_data end=eof;
      h_out.add();
    end;
    h_out.output(dataset:cats('file_',filenum));
    h_out.clear();
  end;
  stop;
run;

We define a hash object with the appropriate parameters, and simply tell it to output every 250k records, and clear it. We could do a hash-of-hashes here also, particularly if it weren't just "Every 250k records" but some other criteria drove things, but then you'd have to fit all of the records in memory, not just 250k of them.

Note also that we could do this without specifying the variables explicitly, but it requires having a useful ID on the dataset:

data _null_;
  if 0 then set in_data;
  declare hash h_out(dataset:'in_data(obs=0)');
  h_out.defineKey('recid');
  h_out.defineData(all:'y');
  h_out.defineDone();

  do filenum = 1 by 1 until (eof);
    do _n_ = 1 to 250000 until (eof);
      set in_data end=eof;
      h_out.add();
    end;
    h_out.output(dataset:cats('file_',filenum));
    h_out.clear();
  end;
  stop;
run;

Since we can't use _n_ anymore for the hash ID due to using the dataset option on the constructor (necessary for the all:'y' functionality), we have to have a record ID. Hopefully there is such a variable, or one could be added with a view.



回答5:

Here is a basic approach. This requires manual adjustment of the intervals, but is easy to understand.

* split data;
data output1;
set df;
if 1 <= _N_ < 5 then output;
run;


data output2;
set df;
if 5 <= _N_ < 10 then output;
run;


data output3;
set df;
if 10 <= _N_ < 15 then output;
run;


data output4;
set df;
if 15 <= _N_ < 22 then output;
run;


回答6:

The first result on Google is from the SAS User Group International (SUGI) These folks are your friends.

The article is here: http://www2.sas.com/proceedings/sugi27/p083-27.pdf

The code is:

%macro split(ndsn=2); 
data %do i = 1 %to &ndsn.; dsn&i. %end; ; 
 retain x; 
 set orig nobs=nobs; 
 if _n_ eq 1 
 then do; 
 if mod(nobs,&ndsn.) eq 0 
 then x=int(nobs/&ndsn.); 
 else x=int(nobs/&ndsn.)+1; 
 end; 
 if _n_ le x then output dsn1; 
 %do i = 2 %to &ndsn.; 
 else if _n_ le (&i.*x) 
 then output dsn&i.; 
 %end; 
 run; 
%mend split; 

%split(ndsn=10);

All you need to do is replace the 10 digit in "%split(ndsn=10);" with the number you require. In Line 4, "set orig nobs=nobs;", simply replace orig with your dataset name.

Hey presto!



标签: macros split sas