I have a signal position indicator vector consisting out of -1s and 1s. In addition, I have volume data which I want to sum based on the value of Signal. The basic data table looks like this:
df <- cbind(Signal, Volume)
head(df, 20)
Signal Volume
2016-01-04 NA 37912403
2016-01-05 -1 23258238
2016-01-06 -1 25096183
2016-01-07 -1 45172906
2016-01-08 -1 35402298
2016-01-11 -1 29932385
2016-01-12 -1 28395390
2016-01-13 -1 33410553
2016-01-14 -1 48658623
2016-01-15 1 46132781
2016-01-19 1 30998256
2016-01-20 -1 59051429
2016-01-21 1 30518939
2016-01-22 1 30495387
2016-01-25 1 32482015
2016-01-26 -1 26877080
2016-01-27 -1 58699359
2016-01-28 1 107475327
2016-01-29 1 62739548
2016-02-01 1 46132726
What I would like to achieve is (without using a for loop) is to produce a vector of cum Volume, which would be reset every time the Signal changes. In addition, the values of volume should be multiplied by the value of the Signal, i.e. when Signal is -1 it should add -Volume to the current cum Volume.
Based on a similar questions on SO I have tried
ave(df$a, cumsum(c(F, diff(sign(diff(df$a))) != 0)*df$Volume), FUN=seq_along)
which produces the right grouping of Signal, but the Volume is not included for some reason. Without the reset the solution is fairly straightforward (posted on SO)
require(data.table)
DT <- data.table(dt)
DT[, Cum.Sum := cumsum(Volume), by=Signal]
Does anyone know a dplyr or data.table kind of solution for both resetting and conditioning a cum sum? Thanks.
This can be achieved by:
library(tidyverse)
library(data.table)
z %>%
group_by(rleid(Signal)) %>% #advance value every time Signal changes and group by that
mutate(cum = Signal*cumsum(Volume)) %>% #cumsum in each group
ungroup() %>% #ungroup so you could remove the grouping column
select(-4) #remove grouping column
or without data.table
by using rle
:
z %>%
mutate(rl = rep(1:length(rle(Signal)$length), times = rle(Signal)$length)) %>%
group_by(rl) %>%
mutate(cum = Signal*cumsum(Volume)) %>%
ungroup() %>%
select(-4)
#output
date Signal Volume cum
<fct> <int> <int> <int>
1 2016-01-04 NA 37912403 NA
2 2016-01-05 - 1 23258238 - 23258238
3 2016-01-06 - 1 25096183 - 48354421
4 2016-01-07 - 1 45172906 - 93527327
5 2016-01-08 - 1 35402298 -128929625
6 2016-01-11 - 1 29932385 -158862010
7 2016-01-12 - 1 28395390 -187257400
8 2016-01-13 - 1 33410553 -220667953
9 2016-01-14 - 1 48658623 -269326576
10 2016-01-15 1 46132781 46132781
11 2016-01-19 1 30998256 77131037
12 2016-01-20 - 1 59051429 - 59051429
13 2016-01-21 1 30518939 30518939
14 2016-01-22 1 30495387 61014326
15 2016-01-25 1 32482015 93496341
16 2016-01-26 - 1 26877080 - 26877080
17 2016-01-27 - 1 58699359 - 85576439
18 2016-01-28 1 107475327 107475327
19 2016-01-29 1 62739548 170214875
20 2016-02-01 1 46132726 216347601
data:
z <- read.table(text = "date Signal Volume
2016-01-04 NA 37912403
2016-01-05 -1 23258238
2016-01-06 -1 25096183
2016-01-07 -1 45172906
2016-01-08 -1 35402298
2016-01-11 -1 29932385
2016-01-12 -1 28395390
2016-01-13 -1 33410553
2016-01-14 -1 48658623
2016-01-15 1 46132781
2016-01-19 1 30998256
2016-01-20 -1 59051429
2016-01-21 1 30518939
2016-01-22 1 30495387
2016-01-25 1 32482015
2016-01-26 -1 26877080
2016-01-27 -1 58699359
2016-01-28 1 107475327
2016-01-29 1 62739548
2016-02-01 1 46132726", header = T)
A pure dplyr
way would be:
df %>%
na.omit() %>% # omit NA to not multiply by NA
mutate(isStep = (Signal - lag(Signal, 1)) != 0) %>% # Create a dummy variable for steps
mutate(isStep = ifelse(is.na(isStep), FALSE, isStep)) %>%
mutate(grp = cumsum(isStep)) %>% # create new ID based on steps
group_by(grp) %>% # group by before created steps
mutate(res = cumsum(Signal * Volume)) %>% # calculate value
select(x, Signal, Volume, res)
# # A tibble: 19 x 5
# # Groups: grp [6]
# grp x Signal Volume res
# <int> <fctr> <int> <int> <int>
# 1 0 2016-01-05 -1 23258238 -23258238
# 2 0 2016-01-06 -1 25096183 -48354421
# 3 0 2016-01-07 -1 45172906 -93527327
# 4 0 2016-01-08 -1 35402298 -128929625
# 5 0 2016-01-11 -1 29932385 -158862010
# 6 0 2016-01-12 -1 28395390 -187257400
# 7 0 2016-01-13 -1 33410553 -220667953
# 8 0 2016-01-14 -1 48658623 -269326576
# 9 1 2016-01-15 1 46132781 46132781
# 10 1 2016-01-19 1 30998256 77131037
# 11 2 2016-01-20 -1 59051429 -59051429
# 12 3 2016-01-21 1 30518939 30518939
# 13 3 2016-01-22 1 30495387 61014326
# 14 3 2016-01-25 1 32482015 93496341
# 15 4 2016-01-26 -1 26877080 -26877080
# 16 4 2016-01-27 -1 58699359 -85576439
# 17 5 2016-01-28 1 107475327 107475327
# 18 5 2016-01-29 1 62739548 170214875
# 19 5 2016-02-01 1 46132726 216347601
As suggested by @docendo this should work:
df[,cum := cumsum(Volume)*Signal,.(rleid(Signal))]
date Signal Volume cum
1: 2016-01-04 NA 37912403 NA
2: 2016-01-05 -1 23258238 -23258238
3: 2016-01-06 -1 25096183 -48354421
4: 2016-01-07 -1 45172906 -93527327
5: 2016-01-08 -1 35402298 -128929625
6: 2016-01-11 -1 29932385 -158862010
7: 2016-01-12 -1 28395390 -187257400
8: 2016-01-13 -1 33410553 -220667953
9: 2016-01-14 -1 48658623 -269326576
10: 2016-01-15 1 46132781 46132781
11: 2016-01-19 1 30998256 77131037