Suppose I have x = rnorm(100000)
and instead of doing a 1000
length sliding window moving average, I wanted to do a 1000
length sliding window that counts all the times that x
is above 0.2
.
For example,
x <- rnorm(1004)
start <- 1:1000
record <- list()
while(start[length(start)] <= length(x)) {
record[[length(record) + 1]] <- length(which(x[start] > 0.2))/length(start)
start <- start + 1
print(record[[length(record)]]);flush.console()
}
This becomes unmanegable for large length(x)
. What is a highly efficient method?
My contribution is to calculate the lagged difference between the cumulative sum of the condition
cumdiff = function(x) diff(c(0, cumsum( x > .2)), 20)
which along with
filt = function(x) filter(x > 0.2, rep(1, 20), sides=1)
library(TTR); ttr = function(x) runSum(x > .2, 20)
cumsub = function(x) { z <- cumsum(c(0, x>0.2)); tail(z,-20) - head(z,-20) }
performs ok
> library(microbenchmark)
> set.seed(123); xx = rnorm(100000)
> microbenchmark(cumdiff(xx), filt(xx), ttr(xx), cumsub(xx))
Unit: milliseconds
expr min lq median uq max neval
cumdiff(xx) 11.192005 12.387862 12.469253 12.77588 13.72404 100
filt(xx) 20.979503 22.058045 22.442765 23.02612 62.91730 100
ttr(xx) 8.390923 10.023934 10.119772 10.46309 11.04173 100
cumsub(xx) 7.015654 8.483432 8.538171 8.73596 9.65421 100
These differ in the specifics of how the result is represented (filt
and ttr
have leading NAs, for instance) and only filter
deals with embedded NA's
> xx[22] = NA
> head(cumdiff(xx)) # NA's propagate, silently
[1] 9 9 NA NA NA NA
> ttr(xx)
Error in runSum(x > 0.2, 20) : Series contains non-leading NAs
> tail(filt(xx), -19)
[1] 9 9 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 8 8 9
...
For example's sake I'm using a vector of length 100 and counting over a window of 20 elements
x = rnorm(100)
y <- x > 0.2
z <- filter(y, rep(1,20), sides=1)
The filter
command is the trick here. filter
takes the vector y
and walks through each element. The rep(1,20)
tells it to take the 20 elements at a time, multiply them by 1, and sum them (you could do re(1/20,20) to get the average for instance). The sides = 1
tells filter to look to the right of the element, i.e. consider the 20 elements previous to the current element.
> x
[1] -0.561107013 -0.687590079 -0.512261471 0.335243662 -2.483070553 -0.995055193 1.711199893 0.165077559 -0.130174687
[10] 0.671520592 -0.034204031 0.793000050 -1.407499532 0.929966998 0.464658383 1.291478397 1.122711651 -0.732289539
[19] -0.947808154 1.601688862 0.755147769 0.962415501 -0.254079608 -0.283894519 1.091660426 0.926601058 0.249405592
[28] -1.195970776 -0.384030894 -0.826048400 1.065626378 -1.935038267 1.464437755 0.675437142 0.201552948 0.367500849
[37] -0.232627068 -1.472755498 -1.894668101 -1.173382831 -0.339885920 -0.248978811 -0.378980900 1.545109833 0.280077032
[46] 0.684595064 1.867086057 0.320364730 -1.653199399 -1.365201143 -0.012793364 -0.874006845 -0.271929369 -0.119850417
[55] -0.361720731 0.361642939 0.676907910 0.816370133 0.685732967 -0.868583473 -1.830861530 1.423500571 -0.760462112
[64] 0.517177871 0.229760540 -0.187106487 -0.324125717 -0.515600949 -0.772414150 -1.144808696 0.003509157 0.138944467
[73] -0.386850873 0.268739173 0.348214296 0.431076350 -0.418408389 -1.398493830 -0.331882279 -0.203976286 -0.646023792
[82] -0.729974273 -0.744186350 0.877314843 -0.994216252 0.022162018 -1.101533562 -1.328293297 0.974986850 -0.152985355
[91] -0.714722527 0.428360138 -0.055607165 -1.015410904 0.320052726 1.951380333 2.288315801 1.295683645 -2.494534350
[100] 1.171319898
> y
[1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
[21] TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
[41] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
[61] FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
[81] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
> z
Time Series:
Start = 1
End = 100
Frequency = 1
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 9 10 11 11 10 11 12 12 12 12 11 12 11 12 12 12 12 11 11 11 10
[41] 9 8 8 9 9 9 9 10 10 10 9 9 8 7 6 6 7 8 9 9 9 10 10 10 10 9 8 7 7 7 7 7 7 8 9 9 8 7 6 6
[81] 6 5 5 5 4 4 4 4 5 5 5 6 6 5 5 5 6 7 7 8
One option is to use rollaply:
library(zoo)
rollapply(zoo(x), width=1000, FUN=function(x) sum(x>0.2))
UPD:
The same solution is faster with TTR:
library(TTR)
runSum(rnorm(n)>0.2, 1000)
Some benchmarking (f1 - zoo solution, f2 - original solution, f3 - TTR solution):
benchmark(f1(5e4), f2(5e4), f3(5e4), replications=10)
test replications elapsed relative user.self sys.self user.child sys.child
1 f1(50000) 10 29.39 326.556 29.36 0 NA NA
2 f2(50000) 10 55.57 617.444 55.50 0 NA NA
3 f3(50000) 10 0.09 1.000 0.09 0 NA NA