I'm looking for a faster solution to the problem below. I'll illustrate the problem with a small example and then provide the code to simulate a large data as that's the point of this question. My actual problem size is of list length = 1 million entries.
Say, I've two lists as shown below:
x <- list(c(82, 18), c(35, 50, 15))
y <- list(c(1,2,3,55,90), c(37,38,95))
Properties of x and y:
- Each element of the list
x
always sums up to 100. - Each element of
y
will always be sorted and will be always between 1 and 100.
The problem:
Now, what I'd like is this. Taking x[[1]]
and y[[1]]
, I'd like to find the count of numbers in y[[1]]
that are 1) <= 82 and 2) > 82 and <= 100. That would be, c(4, 1) because numbers <= 82 are c(1,2,3,55)
and number between 83 and 100 is c(90)
. Similarly for x[[2]]
and y[[2]]
, c(0, 2, 1). That is, the answer should be:
[[1]]
[1] 4 1
[[2]]
[1] 0 2 1
Let me know if this is still unclear.
Simulated data with 1 million entries
set.seed(1)
N <- 100
n <- 1e6
len <- sample(2:3, n, TRUE)
x <- lapply(seq_len(n), function(ix) {
probs <- sample(100:1000, len[ix])
probs <- probs/sum(probs)
oo <- round(N * probs)
if (sum(oo) != 100) {
oo[1] <- oo[1] + (100 - sum(oo))
}
oo
})
require(data.table)
ss <- sample(1:10, n, TRUE)
dt <- data.table(val=sample(1:N, sum(ss), TRUE), grp=rep(seq_len(n), ss))
setkey(dt, grp, val)
y <- dt[, list(list(val)),by=grp]$V1
What I've done so far:
Using mapply
(slow):
I thought of using rank
with ties.method="first"
and mapply
(obvious choice with 2 lists) first and tried out this:
tt1 <- mapply(y, x, FUN=function(a,b) {
tt <- rank(c(a, cumsum(b)), ties="first")[-(1:length(a))]; c(tt[1]-1, diff(tt)-1)
})
Although this works just fine, it takes a lot of time on 1M entries. I think the overhead of computing rank
and diff
that many times adds to it. This takes 241 seconds!
Therefore, I decided to try and overcome the usage of rank
and diff
by using data.table
and sorting with a "group" column. I came up with a longer but much faster solution shown below:
Using data.table
(faster):
xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl), type = "x")
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl), type = "y")
tt2 <-rbindlist(list(ydt, xdt[, list(cumval, grp, type)]))
setkey(tt2, grp, val)
xdt.pos <- which(tt2$type == "x")
tt2[, type.x := 0L][xdt.pos, type.x := xdt.pos]
tt2 <- tt2[xdt.pos][tt2[, .N, by = grp][, N := cumsum(c(0, head(N, -1)))]][, sub := type.x - N]
tt2[, val := xdt$val]
# time consuming step
tt2 <- tt2[, c(sub[1]-1, sub[2:.N] - sub[1:(.N-1)] - 1), by = grp]
tt2 <- tt2[, list(list(V1)),by=grp]$V1
This takes 26 seconds. So it's about 9 times faster. I'm wondering if it's possible to get much more speedup as I'll have to recursively compute this on 5-10 such 1 million elements. Thank you.