This question already has an answer here:
-
Efficient way to filter one data frame by ranges in another
3 answers
I want to get a list of values that fall in between multiple ranges.
library(data.table)
values <- data.table(value = c(1:100))
range <- data.table(start = c(6, 29, 87), end = c(10, 35, 92))
I need the results to include only the values that fall in between those ranges:
results <- c(6, 7, 8, 9, 10, 29, 30, 31, 32, 33, 34, 35, 87, 88, 89, 90, 91, 92)
I am currently doing this with a for loop,
results <- data.table(NULL)
for (i in 1:NROW(range){
results <- rbind(results,
data.table(result = values[value >= range[i, start] &
value <= range[i, end], value]))}
however the actual dataset is quite large and I am looking for a more efficient way.
Any suggestions are appreciated! Thank you!
Using the non-equi join possibility of data.table
:
values[range, on = .(value >= start, value <= end), .(results = x.value)]
which gives:
results
1: 6
2: 7
3: 8
4: 9
5: 10
6: 29
7: 30
8: 31
9: 32
10: 33
11: 34
12: 35
13: 87
14: 88
15: 89
16: 90
17: 91
18: 92
Or as per the suggestion of @Henrik: values[value %inrange% range]
. This works also very well on data.table's with multiple columns:
# create new data
set.seed(26042017)
values2 <- data.table(value = c(1:100), let = sample(letters, 100, TRUE), num = sample(100))
> values2[value %inrange% range]
value let num
1: 6 v 70
2: 7 f 77
3: 8 u 21
4: 9 x 66
5: 10 g 58
6: 29 f 7
7: 30 w 48
8: 31 c 50
9: 32 e 5
10: 33 c 8
11: 34 y 19
12: 35 s 97
13: 87 j 80
14: 88 o 4
15: 89 h 65
16: 90 c 94
17: 91 k 22
18: 92 g 46
If you have the latest CRAN version of data.table you can use non-equi joins. For example, you can create an index which you can then use to subset your original data:
idx <- values[range, on = .(value >= start, value <= end), which = TRUE]
# [1] 6 7 8 9 10 29 30 31 32 33 34 35 87 88 89 90 91 92
values[idx]
Here is one method using lapply
and %between%
rbindlist(lapply(seq_len(nrow(range)), function(i) values[value %between% range[i]]))
This method loops through the ranges data.table and subsets values in each iteration according to the variable in ranges. lapply
returns a list, which rbindlist
constructs into a data.table. If you want a vector, replace rbindlist
with unlist
.
benchmarks
Just to check the speeds of each suggestion on the given data, I ran a quick comparison
microbenchmark(
lmo=rbindlist(lapply(seq_len(nrow(range)), function(i) values[value %between% range[i]])),
dd={idx <- values[range, on = .(value >= start, value <= end), which = TRUE]; values[idx]},
jaap=values[range, on = .(value >= start, value <= end), .(results = x.value)],
inrange=values[value %inrange% range])
This returned
Unit: microseconds
expr min lq mean median uq max neval cld
lmo 1238.472 1460.5645 1593.6632 1520.8630 1613.520 3101.311 100 c
dd 688.230 766.7750 885.1826 792.8615 825.220 3609.644 100 b
jaap 798.279 897.6355 935.9474 921.7265 970.906 1347.380 100 b
inrange 463.002 518.3110 563.9724 545.5375 575.758 1944.948 100 a
As might be expected, my looping solution is quite a bit slower than the others. However, the clear winner is %inrange%
, which is essentially a vectorized extension of %between%
.