可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a very large table with an indexed datetime field. I want to do by group processing on the dataset by month and only output the last observation in each month.

The problem is that it doesn't contain a month field so I can't use something like this:

if last.month then do;
  output;
end;

Is there a way I can achieve this kind of behaviour without having to add a month field in a previous datastep? The table is 50 gig compressed so I want to avoid any unnecessary steps.

Thanks

回答1:

You can actually achieve this using 'by groupformat' against your original dataset, formatting the datetime field as 'dtmonyy5.' As the name implies, this groups by the formatted values instead of the original.

data new1;
set old;
format datetime dtmonyy5.;
by groupformat datetime;
if last.datetime;
run;

Another method is to use Proc Summary, although this can be memory intensive, particularly against large datasets. Here is the code.

proc summary data=old nway;
class datetime;
format datetime dtmonyy5.;
output out=new2 (drop=_:) maxid(datetime(_all_))=;
run;

Just a quick note on the previous answer, the 'month' function works against date fields, not datetime, so you would need to add the datepart function to the line.

month = month(datepart(datetime));

回答2:

This is a good situation to use a data view. This allows you to add variables that are created on-the-fly when the data set is processed, without creating a new physical data set.

data new / view=new;
 set old;
 month = month(datepart(datetime));
run;

If you run this, you'll see the processing time is less than a second, so you know it's not generating a new 50 GB data set. But you can just use the data view new with your code, as if it were a data set.