I asked a related question here but realized I was burning too much time calculating this complex measure (And the goal is to use with a randomization test so speed is an issue). So I've decided to throw out the weighting and just use the minimum distance between two measures. So here I have 2 vectors (in a data frame for demo purposes but in reality they are two vectors.
x y
1 FALSE TRUE
2 FALSE FALSE
3 TRUE FALSE
4 FALSE FALSE
5 FALSE TRUE
6 FALSE FALSE
7 FALSE FALSE
8 TRUE FALSE
9 FALSE TRUE
10 TRUE TRUE
11 FALSE FALSE
12 FALSE FALSE
13 FALSE FALSE
14 FALSE TRUE
15 TRUE FALSE
16 FALSE FALSE
17 TRUE TRUE
18 FALSE TRUE
19 FALSE FALSE
20 FALSE TRUE
21 FALSE FALSE
22 FALSE FALSE
23 FALSE FALSE
24 FALSE FALSE
25 TRUE FALSE
Here I have some code worked out to find the minimal distance but I need more speed (removal of unnecessary calls and better vectorization). Maybe I can't go any faster in base R.
## MWE EXAMPLE: THE DATA
x <- y <- rep(FALSE, 25)
x[c(3, 8, 10, 15, 17, 25)] <- TRUE
y[c(1, 5, 9, 10, 14, 17, 18, 20)] <- TRUE
## Code to Find Distances
xw <- which(x)
yw <- which(y)
min_dist <- function(xw, yw) {
unlist(lapply(xw, function(x) {
min(abs(x - yw))
}))
}
min_dist(xw, yw)
Is there any way to improve performance in base R? Using dplyr
or data.table
?
My vectors are much longer (10,000 + elements).
Edit per flodel's benching. flodel there's an issue I had anticipated in my MWE and I'm not sure how to fix it either. The problem arises if any x position is less than the minimum y position.
x <- y <- rep(FALSE, 25)
x[c(3, 8, 9, 15, 17, 25)] <- TRUE
y[c(5, 9, 10, 13, 15, 17, 19)] <- TRUE
xw <- which(x)
yw <- which(y)
flodel <- function(xw, yw) {
i <- findInterval(xw, yw)
pmin(xw - yw[i], yw[i+1L] - xw, na.rm = TRUE)
}
flodel(xw, yw)
## [1] -2 -1 -6 -2 -2 20
## Warning message:
## In xw - yw[i] :
## longer object length is not a multiple of shorter object length
flodel <- function(x, y) {
xw <- which(x)
yw <- which(y)
i <- findInterval(xw, yw, all.inside = TRUE)
pmin(abs(xw - yw[i]), abs(xw - yw[i+1L]), na.rm = TRUE)
}
GG1 <- function(x, y) {
require(zoo)
yy <- ifelse(y, TRUE, NA) * seq_along(y)
fwd <- na.locf(yy, fromLast = FALSE)[x]
bck <- na.locf(yy, fromLast = TRUE)[x]
wx <- which(x)
pmin(wx - fwd, bck - wx, na.rm = TRUE)
}
GG2 <- function(x, y) {
require(data.table)
dtx <- data.table(x = which(x))
dty <- data.table(y = which(y), key = "y")
dty[dtx, abs(x - y), roll = "nearest"]
}
Sample data:
x <- y <- rep(FALSE, 25)
x[c(3, 8, 10, 15, 17, 25)] <- TRUE
y[c(1, 5, 9, 10, 14, 17, 18, 20)] <- TRUE
X <- rep(x, 100)
Y <- rep(y, 100)
Unit test:
identical(flodel(X, Y), GG1(X, Y))
# [1] TRUE
Benchmarks:
library(microbenchmark)
microbenchmark(flodel(X,Y), GG1(X,Y), GG2(X,Y))
# Unit: microseconds
# expr min lq median uq max neval
# flodel(X, Y) 115.546 131.8085 168.2705 189.069 1980.316 100
# GG1(X, Y) 2568.045 2828.4155 3009.2920 3376.742 63870.137 100
# GG2(X, Y) 22210.708 22977.7340 24695.7225 28249.410 172074.881 100
[Edit by Matt Dowle] 24695 microseconds = 0.024 seconds. Inferences made on microbenchmarks with tiny data rarely hold on meaningful data sizes.
[Edit by flodel] My vectors had length 2500 which was rather meaningful given Tyler's statement (10k), but fine, let's try with vectors of length 2.5e7. I hope you will forgive me for using system.time
given the circumstances:
X <- rep(x, 1e6)
Y <- rep(y, 1e6)
system.time(flodel(X,Y))
# user system elapsed
# 0.694 0.205 0.899
system.time(GG1(X,Y))
# user system elapsed
# 31.250 16.496 112.967
system.time(GG2(X,Y))
# Error in `[.data.table`(dty, dtx, abs(x - y), roll = "nearest") :
# negative length vectors are not allowed
[Edit from Arun] - Benchmark for 2.5e7 using 1.8.11:
[Edit 2 from Arun] - Updating timings after Matt's recent faster binary search/merge
require(data.table)
arun <- function(x, y) {
dtx <- data.table(x=which(x))
setattr(dtx, 'sorted', 'x')
dty <- data.table(y=which(y))
setattr(dty, 'sorted', 'y')
dty[, y1 := y]
dty[dtx, roll="nearest"][, abs(y-y1)]
}
# minimum of three consecutive runs
system.time(ans1 <- arun(X,Y))
# user system elapsed
# 1.036 0.138 1.192
# minimum of three consecutive runs
system.time(ans2 <- flodel(X,Y))
# user system elapsed
# 0.983 0.197 1.221
identical(ans1, ans2) # [1] TRUE
Here are two solutions. Neither use a loop nor an apply function.
1) The first is the same as the solution I posted to your prior question if z
is 1 except the simplified assumptions here allow us to shorten it somewhat and we have reduced the answer by 1 relative to that one.
library(zoo)
yy <- ifelse(y, TRUE, NA) * seq_along(y)
fwd <- na.locf(yy, fromLast = FALSE)[x]
bck <- na.locf(yy, fromLast = TRUE)[x]
wx <- which(x)
pmin(wx - fwd, bck - wx, na.rm = TRUE)
2) The second is a data.table solution. data.table can take a roll="nearest"
argument which seems just what you need:
library(data.table)
dtx <- data.table(x = which(x))
dty <- data.table(y = which(y), key = "y")
dty[dtx, abs(x - y), roll = "nearest"]
I am not sure if it matters but I am using data.table version 1.8.11 (the CRAN version is currently 1.8.10).