Replace missing value with previous or next observ

2019-08-15 08:00发布

问题:

Here is a sample dataset. The scenario below represents a list of account transaction history:

  • Two customers A and B;
  • Customer A has 5 accounts 1, 2, 3, 4, 5 at Bank BA and Bank UB;
  • Customer B has 2 accounts 1, 6 at Bank WF;
  • Note that Bank WF has issued an account ID to B that's same as A has at Bank BA (which could be true in reality).

    data have;
     input Customer $ Account_id Bank_id $ Year;
    datalines;
    A  1  BA    .
    A  2  UB    .
    A  3  UB 2012
    A  3  UB    .
    A  4  UB    .
    A  4  UB 2013
    A  5  UB    .
    B  1  WF 2014
    B  1  WF    .
    B  6  WF    .
    ;
    

Some of the account open dates are missing, but for the same account, the open date should be the same. What I'm trying to do is to replace the missing date of an account with its previous/later value, if there's any; otherwise, keep it null.

    data want;
     input Customer $ Account_id Bank_id $ Year;
    datalines;
    A  1  BA    .
    A  2  UB    .
    A  3  UB 2012
    A  3  UB 2012
    A  4  UB 2013
    A  4  UB 2013
    A  5  UB    .
    B  1  WF 2014
    B  1  WF 2014
    B  6  WF    .
    ;

I looked at this example: How to write first non-missing value to first missing observations which was helpful, but I can't tweak it to work for my case, which has multiple groups.

回答1:

You can do this in a single proc sql step :

proc sql ;
  create table want2 as
  select a.Customer, a.Account_id, a.Bank_id, max(a.Year,b.Year) as Year
  from have a
       left join
       (select Customer, Account_id, Bank_id, max(Year) as Year
        from have
        group by Customer, Account_id, Bank_id) b on a.Customer = b.Customer
                                                    and a.Account_id = b.Account_id
                                                    and a.Bank_id = b.Bank_id ;
quit ;


回答2:

This should do it:

proc sort data=have;
  by Customer
     Bank_id
     Account_id
     descending Year;
run;

data want;
  set have;
  by Customer Bank_id Account_id;
  retain year_tmp (.);

  if not last.Account_id and Year ne . then year_tmp=Year;
  else if Year = . then Year = year_tmp;
  if last.Account_id then year_tmp = .;

  drop year_tmp;
run;

What we did was to declare a retain variable, that is, one that keeps its value from one row to the next. Then using the last flag, we either store the existing Year in that retain variable (or attribute the already stored value to a missing Year). Finally we reset the retain variable at the last row of each Account_id.



标签: sas