Calculate rolling sum for one column in time inter

2019-08-29 23:03发布

问题:

I have one problem and I think there is not much to correct to work right. I have table (with desired output column 'sum_usage'):

id  opt    t_purchase            t_spent       bonus   usage sum_usage
a    1  10NOV2017:12:02:00  10NOV2017:14:05:00   100     9        15
a    1  10NOV2017:12:02:00  10NOV2017:15:07:33   100     0        15
a    1  10NOV2017:12:02:00  10NOV2017:13:24:50   100     6        6
b    1  10NOV2017:13:54:00  10NOV2017:14:02:58   100     3        10
a    1  10NOV2017:12:02:00  10NOV2017:20:22:07   100    12        27
b    1  10NOV2017:13:54:00  10NOV2017:13:57:12   100     7 .      7

So, I need to sum all usage values from time_purchase (for one id, opt combination (group by id, opt) there is just one unique time_purchase) until t_spent. Also, I have about milion rows, so hash table would be the best solution. I've tried with:

data want;
 if _n_=1 then do;
  if 0 then set have(rename=(usage=_usage));
  declare hash h(dataset:'have(rename=(usage=_usage))',hashexp:20);
  h.definekey('id','opt', 't_purchase', 't_spent');
  h.definedata('_usage');
  h.definedone();
 end;
set have;
sum_usage=0;
do i=intck('second', t_purchase, t_spent) to t_spent ;
 if h.find(key:user,key:id_option,key:i)=0 then sum_usage+_usage;
end;
drop _usage i;
run;

The fifth line from the bottom is not correct for sure (do i=intck('second', t_purchase, t_spent), but have no idea how to approach this. So, the main problem is how to set up time interval to calculate this. I have already one function in this hash table func with the same keys, but without time interval, so it would be pretty good to write this one too, but it's not necessary.

回答1:

Personally, I would ditch the hash and use SQL.

Example Data:

data have;

input id $ opt    
    t_purchase  datetime20.
    t_spent     datetime20.
    bonus   usage sum_usage;

format 
    t_purchase  datetime20.
    t_spent     datetime20.;

datalines;
a    1  10NOV2017:12:02:00  10NOV2017:14:05:00   100     9        15
a    1  10NOV2017:12:02:00  10NOV2017:15:07:33   100     0        15
a    1  10NOV2017:12:02:00  10NOV2017:13:24:50   100     6        6
b    1  10NOV2017:13:54:00  10NOV2017:14:02:58   100     3        10
a    1  10NOV2017:12:02:00  10NOV2017:20:22:07   100    12        27
b    1  10NOV2017:13:54:00  10NOV2017:13:57:12   100     7       7
;

I'm leaving your sum_usage column here for comparison.

Now, create a table of sums. New value is sum_usage2.

proc sql noprint;
create table sums as
select a.id,
       a.opt,
       a.t_purchase,
       a.t_spent,
       sum(b.usage) as sum_usage2
    from have as a,
         have as b
    where a.id = b.id
      and a.opt = b.opt
      and b.t_spent <= a.t_spent
      and b.t_spent >= a.t_purchase
    group by a.id, 
       a.opt,
       a.t_purchase,
       a.t_spent;
quit;

Now that you have the sums, join them back to the original table:

proc sql noprint;
create table want as
select a.*,
       b.sum_usage2
    from have as a
      left join
         sums as b
      on a.id = b.id
      and a.opt = b.opt
      and a.t_spent = b.t_spent
      and a.t_purchase = b.t_purchase;
quit;

This produces the table you want. Alternatively, you can use a hash to look up the values and add the sum in a Data Step (which can be faster given the size).

data want2;
set have;
format sum_usage2 best.;
if _n_=1 then do;
    %create_hash(lk,id opt t_purchase t_spent, sum_usage2,"sums");
end;

rc = lk.find();

drop rc;
run;

%create_hash() macro available here https://github.com/FinancialRiskGroup/SASPerformanceAnalytics



回答2:

I believe this question is a morph of one your earlier ones where you compute a rolling sum by do a hash lookup for every second over a 3 hour period for each record in your data set. Hopefully you realized the simplicity of that approach has a large cost of 3*3600 hash lookups per record as well as having to load the entire data vector into a hash.

The time log data presented has new records inserted at the top of the data, and I presume the data to be descending monotonic in time.

A DATA Step can, in a single pass of monotonic data, compute the rolling sum over a time window. The technique uses 'ring' arrays, where-in index advancement is adjusted by modulus. One array is for the time and the other is for the metric (usage). The required array size is the maximum number of items that could occur within the time window.

Consider some generated sample data with time steps of 1, 2, and one jump of 200 seconds:

data have;
  time = '12oct2017:11:22:32'dt;
  usage = 0;
  do _n_ = 1 to &have_count;
     time + 2; *ceil(25*ranuni(123));
     if _n_ > 30 then time + -1;
     if _n_ = 145 then time + 200;
     usage = floor(180*ranuni(123));
     delta = time-lag(time);
     output;
  end;
run;

Start with the case of computing a rolling sum from prior items when sorted time ascending. (The descending case will follow):

The example parameters are RING_SIZE 16 and TIME_WINDOW of 12 seconds.

%let RING_SIZE = 16;
%let TIME_WINDOW = '00:00:12't;

data want;
  array ring_usage [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);
  array ring_time  [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);

  retain ring_tail 0 ring_head -1 span 0 span_usage 0;

  set have;
  by time ; * cause error if data not sorted per algorithm requirement;

  * unload from accumulated usage the tail items that fell out the window;
  do while (span and time - ring_time(ring_tail) > &TIME_WINDOW);
    span + -1;

    span_usage + -ring_usage(ring_tail);
    ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
  end;

  ring_head = mod ( ring_head + 1, &RING_SIZE );
  span + 1;

  if span > 1 and (ring_head = ring_tail) then do;
    _n_ = dim(ring_time);
    put 'ERROR: Ring array too small, size=' _n_;
    abort cancel;
  end;

  * update the ring array;
  ring_time(ring_head) = time;
  ring_usage(ring_head) = usage;

  span_usage + usage;

  drop ring_tail ring_head span;
run;

For the case of data sorted descending, you could jiggle things; sort ascending, compute rolling and resort descending.

What to do if such a jiggle can't be done, or you just want a single pass?

The items to be part of the rolling calculation have to be from 'lead' rows, or rows not yet read via SET. How is this possible ? A second SET statement can be used to open a separate channel to the data set, and thus obtain lead values.

There is a little more bookkeeping for processing lead data -- premature overwrite and diminished window at the end of data need to be handled.

data want2;
  array ring_usage [-1:%eval(&RING_SIZE-1)] _temporary_;
  array ring_time  [-1:%eval(&RING_SIZE-1)] _temporary_;

  retain lead_index 0 ring_tail -1 ring_head -1 span 1 span_usage . guard_index .;

  set have;

&debug put / _N_ ':' time= ring_head=;

  * unload ring_head slotted item from sum;
  span + -1;
  span_usage + -ring_usage(ring_head);

  * advance ring_head slot by 1, the vacated slot will be overwritten by lead;
  ring_head = mod ( ring_head + 1, &RING_SIZE ); 

&debug put +2 ring_time(ring_head)= span= 'head';

  * load ring with lead values via a second SET of the same data;
  if not end2 then do;

    do until (_n_ > 1 or lead_index = 0 or end2);
      set have(keep=time usage rename=(time=t usage=u)) end=end2;  * <--- the second SET ;

      if end2 then guard_index = lead_index;

&debug if end2 then put guard_index=;

      ring_time(lead_index) = t;
      ring_usage(lead_index) = u;

&debug put +2 ring_time(lead_index)=  'lead';

      lead_index = mod ( lead_index + 1, &RING_SIZE);
    end;
  end;

  * advance ring_tail to cover the time window;
  if ring_tail ne guard_index then do;

      ring_tail_was = ring_tail;
      ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;

      do while (time - ring_time(ring_tail) <= &TIME_WINDOW);

          span + 1;
          span_usage + ring_usage(ring_tail);

&debug put +2 ring_time(ring_tail)= span= 'seek';

          ring_tail_was = ring_tail;
          ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;

          if ring_tail_was = guard_index then leave;

          if span > 1 and (ring_head = ring_tail) then do;
            _n_ = dim(ring_time);
            put 'ERROR: Ring array too small, size=' _n_;
            abort cancel;
          end;
      end;

      * seek went beyond window, back tail off to prior index;
      ring_tail = ring_tail_was;

  end;

&debug put +2 ring_time(ring_tail)= span= 'mark';

  drop lead_index t u ring_: guard_index span;

  format ring: span: usage 6.;
run;
options source;

Confirm both methods have the same computation:

proc sort data=want2; by time;
run;

proc compare noprint data=want compare=want2 out=diff outnoequal;
  id time;
  var span_usage;
run;
---------- LOG ----------
NOTE: There were 150 observations read from the data set WORK.WANT.
NOTE: There were 150 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.DIFF has 0 observations and 4 variables.

I have not benchmarked ring-array versus SQL versus Proc EXPAND versus Hash.

Caution: Dead reckoning rolling values using +in and -out operations can experience round-off errors when dealing with non-integer values.



标签: sas hashtable