How can I find and replace specific text in a SAS

2019-08-01 11:18发布

问题:

I have a data set with 400 observations of 4 digit codes which I would like to pad with a space on both sides

ex. Dataset 
obs code
1   1111 
2   1112
3   3333
.
.
.
400 5999

How can I go through another large data set and replace every occurrence of any of the padded 400 codes with a " ".

ex. Large Dataset
obs text 
1   abcdef 1111 abcdef
2   abcdef 1111 abcdef 1112 8888
3   abcdef 1111 abcdef 11128888
... 

Data set that I want

ex. New Data set
obs text
1   abcdef   abcdef
2   abcdef   abcdef   8888
3   abcdef   abcdef 11128888
...

Note: I'm only looking to replace 4 digit codes that are padded on both sides by a space. So in obs 3, 1112 won't be replaced.

I've tried doing the following proc sql statement, but it only finds and replaces the first match, instead of all the matches.

proc sql;  
    select   
    *,  
    tranwrd(large_dataset.text, trim(small_dataset.code), ' ') as new_text  
from large_dataset  
    left join small_dataset  
    on findw(large_dataset.text, trim(small_dataset.code))
;
quit;

回答1:

You could just use a DO loop to scan through the small dataset of codes for each record in the large dataset. If you want to use TRANWRD() function then you will need to add extra space characters.

data want ;
  set have ;
  length code $4 ;
  do i=1 to nobs while (text ne ' ');
    set codes(keep=code) nobs=nobs point=i ;
    text = substr(tranwrd(' '||text,' '||code||' ',' '),2);
  end;
  drop code;
run;

The DO loop will read the records from your CODES list. Using the POINT= option on the SET statement lets you read the file multiple times. The WHILE clause will stop if the TEXT string is empty since there is no need to keep looking for codes to replace at that point.

If your list of codes is small enough and you can get the right regular expression then you might try using PRXCHANGE() function instead. You can use an SQL step to generate the codes as a list that you can use in the regular expression.

proc sql noprint ;
  select code into :codelist separated by '|'
  from codes
;
quit;

data want ;
  set have ;
  text=prxchange("s/\b(&codelist)\b/ /",-1,text);
run;


回答2:

There might be more efficient ways of doing this, but this seems to work fairly well:

/*Create test datasets*/
data codes;
input code;
cards;
1111 
1112
3333
5999
;
run;

data big_dataset;
infile cards truncover;
input text $100.;
cards;
abcdef 1111 abcdef
abcdef 1111 abcdef 1112 8888
abcdef 1111 abcdef 11128888
;
run;

/*Get the number of codes to use for array definition*/
data _null_;
    set codes(obs = 1) nobs = nobs;
    call symput('ncodes',nobs);
run;

%put ncodes = &ncodes;

data want;
    set big_dataset;
    /*Define and populate array with padded codes*/ 
    array codes{&ncodes} $6 _temporary_;
    if _n_ = 1 then do i = 1 to &ncodes;    
        set codes;
        codes[i] = cat(' ',put(code,4.),' '); 
    end;
    do i = 1 to &ncodes;
        text = tranwrd(text,codes[i],' ');
    end;
    drop i code;
run;

I expect a solution using prxchange is also possible, but I'm not sure how much work it is to construct a regex that matches all of your codes compared to just substituting them one by one.



回答3:

Taking Tom's solution and putting the code-lookup into a hash-table. Thereby the dataset will only be read once and the actual lookup is quite fast. If the Large Dataset is really large this will make a huge difference.

data want ;
  if _n_ = 1 then do;
    length code $4 ;
    declare hash h(dataset:"codes (keep=code)") ; 
    h.defineKey("code") ;
    h.defineDone() ;
    call missing (code);
    declare hiter hiter('h') ;
  end;
  set big_dataset ;

  rc = hiter.first() ;
  do while (rc = 0 and text ne ' ') ;
    text = substr(tranwrd(' '||text,' '||code||' ',' '),2) ;
    rc = hiter.next() ;
  end ;
  drop code rc ;
run;


回答4:

Use array and regular express:

proc transpose data=codes out=temp;
var code;
run;

data want;
if _n_=1 then  set temp;
array var col:;
set big_dataset;
do over var;
text = prxchange(cats('s/\b',var,'\b//'),-1,text);
end;
drop col:;
run;