I have a data set with 400 observations of 4 digit codes which I would like to pad with a space on both sides
ex. Dataset
obs code
1 1111
2 1112
3 3333
.
.
.
400 5999
How can I go through another large data set and replace every occurrence of any of the padded 400 codes with a " ".
ex. Large Dataset
obs text
1 abcdef 1111 abcdef
2 abcdef 1111 abcdef 1112 8888
3 abcdef 1111 abcdef 11128888
...
Data set that I want
ex. New Data set
obs text
1 abcdef abcdef
2 abcdef abcdef 8888
3 abcdef abcdef 11128888
...
Note: I'm only looking to replace 4 digit codes that are padded on both sides by a space. So in obs 3, 1112 won't be replaced.
I've tried doing the following proc sql statement, but it only finds and replaces the first match, instead of all the matches.
proc sql;
select
*,
tranwrd(large_dataset.text, trim(small_dataset.code), ' ') as new_text
from large_dataset
left join small_dataset
on findw(large_dataset.text, trim(small_dataset.code))
;
quit;
You could just use a DO loop to scan through the small dataset of codes for each record in the large dataset. If you want to use TRANWRD()
function then you will need to add extra space characters.
data want ;
set have ;
length code $4 ;
do i=1 to nobs while (text ne ' ');
set codes(keep=code) nobs=nobs point=i ;
text = substr(tranwrd(' '||text,' '||code||' ',' '),2);
end;
drop code;
run;
The DO loop will read the records from your CODES list. Using the POINT= option on the SET statement lets you read the file multiple times. The WHILE clause will stop if the TEXT string is empty since there is no need to keep looking for codes to replace at that point.
If your list of codes is small enough and you can get the right regular expression then you might try using PRXCHANGE()
function instead. You can use an SQL step to generate the codes as a list that you can use in the regular expression.
proc sql noprint ;
select code into :codelist separated by '|'
from codes
;
quit;
data want ;
set have ;
text=prxchange("s/\b(&codelist)\b/ /",-1,text);
run;
There might be more efficient ways of doing this, but this seems to work fairly well:
/*Create test datasets*/
data codes;
input code;
cards;
1111
1112
3333
5999
;
run;
data big_dataset;
infile cards truncover;
input text $100.;
cards;
abcdef 1111 abcdef
abcdef 1111 abcdef 1112 8888
abcdef 1111 abcdef 11128888
;
run;
/*Get the number of codes to use for array definition*/
data _null_;
set codes(obs = 1) nobs = nobs;
call symput('ncodes',nobs);
run;
%put ncodes = &ncodes;
data want;
set big_dataset;
/*Define and populate array with padded codes*/
array codes{&ncodes} $6 _temporary_;
if _n_ = 1 then do i = 1 to &ncodes;
set codes;
codes[i] = cat(' ',put(code,4.),' ');
end;
do i = 1 to &ncodes;
text = tranwrd(text,codes[i],' ');
end;
drop i code;
run;
I expect a solution using prxchange
is also possible, but I'm not sure how much work it is to construct a regex that matches all of your codes compared to just substituting them one by one.
Taking Tom's solution and putting the code-lookup into a hash-table. Thereby the dataset will only be read once and the actual lookup is quite fast. If the Large Dataset is really large this will make a huge difference.
data want ;
if _n_ = 1 then do;
length code $4 ;
declare hash h(dataset:"codes (keep=code)") ;
h.defineKey("code") ;
h.defineDone() ;
call missing (code);
declare hiter hiter('h') ;
end;
set big_dataset ;
rc = hiter.first() ;
do while (rc = 0 and text ne ' ') ;
text = substr(tranwrd(' '||text,' '||code||' ',' '),2) ;
rc = hiter.next() ;
end ;
drop code rc ;
run;
Use array and regular express:
proc transpose data=codes out=temp;
var code;
run;
data want;
if _n_=1 then set temp;
array var col:;
set big_dataset;
do over var;
text = prxchange(cats('s/\b',var,'\b//'),-1,text);
end;
drop col:;
run;