Say I have data that looks like
date, user, items_bought, event_number
2013-01-01, x, 2, 1
2013-01-02, x, 1, 2
2013-01-03, x, 0, 3
2013-01-04, x, 0, 4
2013-01-04, x, 1, 5
2013-01-04, x, 2, 6
2013-01-05, x, 3, 7
2013-01-06, x, 1, 8
2013-01-01, y, 1, 1
2013-01-02, y, 1, 2
2013-01-03, y, 0, 3
2013-01-04, y, 5, 4
2013-01-05, y, 6, 5
2013-01-06, y, 1, 6
to get the cumulative sum per user per data point I was doing
data.frame(cum_items_bought=unlist(tapply(as.numeric(data$items_bought), data$user, FUN = cumsum)))
output from this looks like
date, user, items_bought
2013-01-01, x, 2
2013-01-02, x, 3
2013-01-03, x, 3
2013-01-04, x, 3
2013-01-04, x, 4
2013-01-04, x, 6
2013-01-05, x, 9
2013-01-06, x, 10
2013-01-01, y, 1
2013-01-02, y, 2
2013-01-03, y, 2
2013-01-04, y, 7
2013-01-05, y, 13
2013-01-06, y, 14
However I want to restrict my sum to only add up those that happened within 3 days of each row (relative to the user). i.e. the output needs to look like this:
date, user, cum_items_bought_3_days
2013-01-01, x, 2
2013-01-02, x, 3
2013-01-03, x, 3
2013-01-04, x, 1
2013-01-04, x, 2
2013-01-04, x, 4
2013-01-05, x, 6
2013-01-06, x, 7
2013-01-01, y, 1
2013-01-02, y, 2
2013-01-03, y, 2
2013-01-04, y, 6
2013-01-05, y, 11
2013-01-06, y, 12
It seems like packages
xts
andzoo
contain functions that do what you want, although you may have the same problems with the size of your actual dataset as with @alexis_laz answer. Using the functions from thexts
answer to this question seem to do the trick.First I took the code from the answer I link to above and made sure it worked for just one
user
. I include theapply.daily
function because I believe from your edits/comments that you have multiple observations for some days for some users - I added an extra line to the toy dataset to reflect this.I thought the output could look nicer (more like example output from your question). I haven't worked with
zoo
objects much, but the answer to this question gave me some pointers for putting the info into adata.frame
.Once I had this worked out for one
user
, it was straightforward to expand this to the entire toy dataset. This is where speed could become an issue. I uselapply
anddo.call
for this step.I like James' answer better, but here's an alternative:
Here's a
dplyr
solution which will produce the desired result (14 rows) as specified in the question. Note that it takes care of duplicate date entries, for example, 2013-01-04 for user x.In my answer I use a custom function
myfunc
inside adplyr
chain. This is done using thedo
operator fromdplyr
. The custom function is passed the subsetted df byuser
groups. It then usessapply
to pass eachevent_number
and calculate the sums ofitems_bought
. The last line of thedplyr
chain deselects the undesired columns.Let me know if you'd like a more detailed explanation.
Edit after comment by OP:
If you need more flexibility to also conditionally sum up other columns, you can adjust the code as follows. I assume here, that the other columns should be summed up the same way as
items_bought
. If that is not correct, please specify how you want to sum up the other columns.I first create two additional columns with random numbers in the data (I'll post a
dput
of the data at the bottom of my answer):Next, you can modify
myfunc
to take 2 arguments, instead of 1. The first argument will remain the subsetted data.frame as before (represented by.
inside the dplyr chain andx
in the function definition ofmyfunc
), while the second argument tomyfunc
will specify the column to sum up (colname
).Then, you can use
myfunc
several times if you want to conditionally sum up several columns:Now you created conditional sums of the columns
items_bought
,newCol1
andnewCol2
. You can also leave out any of the sums in the dplyr chain or add more columns to sum up.Edit #2 after comment by OP:
To calculate the cumulative sum of distinct (unique) items bought per user, you could define a second custom function
myfunc2
and use it inside the dplyr chain. This function is also flexible asmyfunc
so that you can define the columns to which you want to apply the function.The code would then be:
Here is the data I used:
The following looks valid:
Where
data
:I'd like to propose an additional
data.table
approach combined withzoo
packagerollapplyr
functionFirst, we will aggregate
items_bought
column peruser
per uniquedate
(as you pointed out that there could be more than one unique date per user)Next, we will compute
rollapplyr
combined withsum
andpartial = TRUE
in order to cover up for margins (thanks for the advice @G. Grothendieck) in 3 days intervalsThis is the data set I've used
Here is an approach that doesn't use cumsum but a nested
lapply
instead. The first one goes over the users and then for each user the secondlapply
constructs the desired data frame by summing all items bought from within the last 2 days of each date. Note that ifdata$date
were not sorted, it would have to be sorted in ascending order first.Edit
To deal with the issue of having several timestamps for each day (more than 1 row per date) I would first aggregate by summing all items bought during at each time in the same day. You can do that e.g. using the built-in function
aggregate
but if your data is too large you can also usedata.table
for speed. I'll call your original data frame (with more than 1 row per date)predata
and the aggregated one (1 row per date)data
. So by callingyou get a data frame containing one row per date and columns date, user, items_bought. Now, I think the following way will be faster than the nested
lapply
above, but I am not sure since I cannot test it on your data. I am using data.table because it is meant to be fast (if used the right way, which I am not sure this is). The inner loop will be replaced by a functionf
. I do not know if there is a neater way, avoiding this function and replacing the double loop with only one call to data.table, or how to write a data.table call that would execute faster.Another way, which doesn't use data.table, assuming that you have enough RAM (again, I don't know the size of your data), is to store items bought 1 day before in a vector, then items bought 2 days before in another vector, etc, and to sum them up in the end. Something like
A final thing I would try would be to parallelize the
lapply
calls, e.g. by using the functionmclapply
instead, or by re-writing the code using the parallel functionality offoreach
orplyr
. Depending on the strength of your PC and the size of the task, this may outperform the data.table single-core performance...