I'm trying to learn R and there are a few things I've done for 10+ years in SAS that I cannot quite figure out the best way to do in R. Take this data:
id class t count desired
-- ----- ---------- ----- -------
1 A 2010-01-15 1 1
1 A 2010-02-15 2 3
1 B 2010-04-15 3 3
1 B 2010-09-15 4 4
2 A 2010-01-15 5 5
2 B 2010-06-15 6 6
2 B 2010-08-15 7 13
2 B 2010-09-15 8 21
I want to calculate the column desired as a rolling sum by id, class, and within a 4 months rolling window. Notice that not all months are present for each combination of id and class.
In SAS I'd typically do this in one of 2 ways:
RETAIN
plus a by id & class.PROC SQL
with a left join from df as df1 to df as df2 on id, class and the df1.d-df2.d within the appropriate window
What is the best R approach to this type of problem?
t <- as.Date(c("2010-01-15","2010-02-15","2010-04-15","2010-09-15",
"2010-01-15","2010-06-15","2010-08-15","2010-09-15"))
class <- c("A","A","B","B","A","B","B","B")
id <- c(1,1,1,1,2,2,2,2)
count <- seq(1,8,length.out=8)
desired <- c(1,3,3,4,5,6,13,21)
df <- data.frame(id,class,t,count,desired)
A farily efficient answer to this problem could be found using the data.table library.
Here are a few solutions:
1) zoo Using
ave
, for each group create a monthly series,m
, by merging the original series,z
, with a grid,g
. Then calculate the rolling sum and retain only the original time points:which gives:
Note We have assumed the times are ordered within each group (as in the question). If that is not so then sort
df
first.2) sqldf
which gives:
Note: If the merge should be too large to fit into memory then use
sqldf("...", dbname = tempfile())
to cause the intermediate results to be stored in a database which it creates on the fly and automatically destroys afterwards.3) Base R The sqldf solution motivates this base R solution which just translates the SQL into R:
The result is:
Note: This does do a merge in memory which might be a problem if the data set is very large.
UPDATE: Minor simplifications of first solution and also added second solution.
UPDATE 2: Added third solution.
I'm almost embarrassed to post this. I'm usually pretty good as these, but there's got to be a better way.
This first uses
zoo
'sas.yearmon
to get the dates in terms of just month and year, then reshapes it to get one column for eachid
/class
combination, then fills in with zeros before, after, and for missing months, then useszoo
to get the rolling sum, then pulls out just the desired months and merges back with the original data frame.