I'm very very new to R and am looking at ways of recreating an Excel VBA macro and Excel worksheet functions such as SUMIFS. SUMIFS sums a column if the row has entries matching multiple conditions on its other columns.
I have the below data frame and I want to compute a new column. The new column is the sum of Sample
for all rows that overlap with the Start Date
and EndDate
range. For example on line 1
it would be 697
(the sum of the first 3 lines
). The criteria for the sum specifically: include Sample
if EndDate >= StartDate[i] & StartDate <=EndDate[i]
StartDate EndDate Sample *SUMIFS example*
10/01/14 24/01/14 139 *697*
12/01/14 26/01/14 136
19/01/14 02/02/14 422
25/01/14 08/02/14 762
29/01/14 12/02/14 899
05/02/14 19/02/14 850
07/02/14 21/02/14 602
09/02/14 23/02/14 180
18/02/14 04/03/14 866
Any comments or pointers would be greatly appreciated.
You could do this with a loop or with a Cartesian merge. I don't know of any built in functions to do exactly this.
library(dplyr)
x = structure(list(StartDate = structure(c(1389312000, 1389484800,
1390089600, 1390608000, 1390953600, 1391558400, 1391731200, 1391904000,
1392681600), tzone = "UTC", class = c("POSIXct", "POSIXt")),
EndDate = structure(c(1390521600, 1390694400, 1391299200,
1391817600, 1392163200, 1392768000, 1392940800, 1393113600,
1393891200), tzone = "UTC", class = c("POSIXct", "POSIXt"
)), Sample = c(139L, 136L, 422L, 762L, 899L, 850L, 602L,
180L, 866L)), .Names = c("StartDate", "EndDate", "Sample"
), row.names = c(NA, -9L), class = "data.frame")
x2 = x
names(x2)=c('StartDate2','EndDate2','Sample2')
x3 = merge(x,x2,allow.cartesian =T)
x4 = summarise(group_by(x3,StartDate,EndDate),
sumifs=sum(Sample2[EndDate2 >= StartDate & StartDate2 <= EndDate]))
x_sumifs = merge(x,x4,by=c('StartDate','EndDate'))
This is what the output looks like.
> x_sumifs
StartDate EndDate Sample sumifs
1 2014-01-10 2014-01-24 139 697
2 2014-01-12 2014-01-26 136 1459
3 2014-01-19 2014-02-02 422 2358
4 2014-01-25 2014-02-08 762 3671
5 2014-01-29 2014-02-12 899 3715
6 2014-02-05 2014-02-19 850 4159
7 2014-02-07 2014-02-21 602 4159
8 2014-02-09 2014-02-23 180 3397
9 2014-02-18 2014-03-04 866 2498
You could use lapply/sapply
from base R
to do this. x
from @cameron.bracken's post.
x$sumifs <- sapply(seq_len(nrow(x)), function(i) with(x,
sum(Sample[EndDate >= StartDate[i] & StartDate <= EndDate[i]])))
x
# StartDate EndDate Sample sumifs
#1 2014-01-10 2014-01-24 139 697
#2 2014-01-12 2014-01-26 136 1459
#3 2014-01-19 2014-02-02 422 2358
#4 2014-01-25 2014-02-08 762 3671
#5 2014-01-29 2014-02-12 899 3715
#6 2014-02-05 2014-02-19 850 4159
#7 2014-02-07 2014-02-21 602 4159
#8 2014-02-09 2014-02-23 180 3397
#9 2014-02-18 2014-03-04 866 2498
Assuming you have the above data in a data frame called df
:
sum(df$Sample[EndDate >= df$StartDate & StartDate <= df$EndDate])
That is:
df$Sample[...]
selects the Sample
column, with conditions specified in [...]
EndDate >= df$StartDate
and StartDate <= df$EndDate
are from your example, converted to R conditions, with &
in between to require both conditions to be true at the same time. Notice that there are no i
indexes in the expression. That's how it works in R, the expression is evaluated for each row in the data frame, and the result of df$Sample[...]
is a vector of values, only the values where the expression in [...]
was true
sum
is of course a built-in function to calculate the sum, naturally
You can use the 'by' function to get the value. In 'by' data frame is split by row into data frames subsetted by the values of one or more factors, and a function is applied to each subset in turn.
x$sumifs <- by(Sample[EndDate >= StartDate[i] & StartDate <= EndDate[i]],sum)
More details about the function can be found here