Random sampling without replacement in longitudina

2019-05-26 19:58发布

问题:

My data is longitudinal.

VISIT ID   VAR1
1     001  ...
1     002  ...
1     003  ...
1     004  ...
...
2     001  ...
2     002  ...
2     003  ...
2     004  ...

Our end goal is picking out 10% each visit to run a test. I tried to use proc SURVEYSELECT to do SRS without replacement and using "VISIT" as strata. But the final sample would have duplicated IDs. For example, ID=001 might be selected both in VISIT=1 and VISIT=2.

Is there any way to do that using SURVEYSELECT or other procedure (R is also fine)? Thanks a lot.

回答1:

This is possible with some fairly creative data step programming. The code below uses a greedy approach, sampling from each visit in turn, sampling only ids that have not previously been sampled. If more than 90% of the ids for a visit have already been sampled, less than 10% are output. In the extreme case, when every id for a visit has already been sampled, no rows are output for that visit.

/*Create some test data*/
data test_data;
  call streaminit(1);
  do visit = 1 to 1000;
    do id = 1 to ceil(rand('uniform')*1000);
      output;
    end;
  end;
run;


data sample;
  /*Create a hash object to keep track of unique IDs not sampled yet*/
  if 0 then set test_data;
  call streaminit(0);
  if _n_ = 1 then do;
    declare hash h();
    rc = h.definekey('id');
    rc = h.definedata('available');
    rc = h.definedone();
  end;
  /*Find out how many not-previously-sampled ids there are for the current visit*/
  do ids_per_visit = 1 by 1 until(last.visit);
    set test_data;
    by visit;
    if h.find() ne 0 then do;
      available = 1;
      rc = h.add();
    end;
    available_per_visit = sum(available_per_visit,available);
  end;
  /*Read through the current visit again, randomly sampling from the not-yet-sampled ids*/
  samprate = 0.1;
  number_to_sample = round(available_per_visit * samprate,1);
  do _n_ = 1 to ids_per_visit;
    set test_data;
    if available_per_visit > 0 then do;
      rc = h.find();
      if available = 1 then do;
        if rand('uniform') < number_to_sample / available_per_visit then do;
          available = 0;
          rc = h.replace();
          samples_per_visit = sum(samples_per_visit,1);
          output;
          number_to_sample = number_to_sample - 1;
        end;
        available_per_visit = available_per_visit - 1;
      end;
    end;
  end;
run;

/*Check that there are no duplicate IDs*/
proc sort data = sample out = sample_dedup nodupkey;
by id;
run;