Find matches by condition between 2 datasets in SA

2020-04-18 03:22发布

I'm trying to improve the processing time used via an already existing for-loop in a *.jsl file my classmates and I are using in our programming course using SAS. My question: is there a PROC or sequence of statements that exist that SAS offers that can replicate a search and match condition? Or a way to go through unsorted files without going line by line looking for matching condition(s)?

Our current scrip file is below:

if( roadNumber_Fuel[n]==roadNumber_TO[m] & 
    fuelDate[n]>=tripStart[m] & fuelDate[n]<=TripEnd[m],
                newtripID[n] = tripID[m];
            );

I have 2 sets of data simplified below.

DATA1:
ID1     Date1
1      May 1, 2012
2      Jun 4, 2013
3      Aug 5, 2013
..
.

&

DATA2:
ID2    Date2         Date3       TRIP_ID
1      Jan 1 2012    Feb 1 2012  9876 
2      Sep 5 2013    Nov 3 2013  931 
1      Dec 1 2012    Dec 3 2012  236
3      Mar 9 2013    May 3 2013  390
2      Jun 1 2013    Jun 9 2013  811 
1      Apr 1 2012    May 5 2012  76

...
..
.

I need to check a lot of iterations but my goal is to have the code check:

Data1.ID1 = Data2.ID2 AND (Date1 >Date2 and Date1 < Date3)

My desired output dataset woudld be

ID1     Date1         TRIP_ID
1      May 1, 2012    76
2      Jun 4, 2013    811

Thanks for any insight!

标签: sas
1条回答
疯言疯语
2楼-- · 2020-04-18 03:52

You can do range matches in two ways. First off, you can match using PROC SQL if you're familiar with SQL:

proc sql;
 create tableC as
  select * from table A
   left join table B
     on A.id=B.id and A.date > B.date1 and A.date < B.date2
  ;
quit;

Second, you can create a format. This is usually the faster option if it's possible to do this. This is tricky when you have IDs, but you can do it.

First, create a new variable, ID+date. Dates are numbers around 18,000-20,000, so multiply your ID by 100,000 and you're safe.

Second, create a dataset from the range dataset where START=lower date plus id*100,000, END=higher date + id*100,000, FMTNAME=some string that will become the format name (must start with A-Z or _ and have A-Z, _, digits only). LABEL is the value you want to retrieve (Trip_ID in the above example).

data b_fmts; 
 set b;
 start=id*100000+date1;
 end  =id*100000+date2;
 label=value_you_want_out;
 fmtname='MYDATEF';
run;

Then use PROC FORMAT with CNTLIN=` option to import formats.

proc format cntlin=b_fmts;
quit;

Make sure your date ranges don't overlap - if they do this will fail.

Then you can use it easily:

data a_match;
 set a;
 trip_id=put(id*100000+date,MYDATEF.);
run;
查看更多
登录 后发表回答