What I want to do is more or less a combination of the problems discussed in the two following threads:
- Perform non-pairwise all-to-all comparisons between two unordered character vectors --- The opposite of intersect --- all-to-all setdiff
- Merge data frames based on numeric rownames within a chosen threshold and keeping unmatched rows as well
I have two numeric vectors:
b_1 <- c(543.4591, 489.36325, 12.03, 896.158, 1002.5698, 301.569)
b_2 <- c(22.12, 53, 12.02, 543.4891, 5666.31, 100.1, 896.131, 489.37)
I want to compare all elements in b_1
against all elements in b_2
and vice versa.
If element_i
in b_1
is NOT equal to any number in the range element_j ± 0.045
in b_2
then element_i
must be reported.
Likewise, if element_j
in b_2
is NOT equal to any number in the range element_i ± 0.045
in b_1
then element_j
must be reported.
Therefore, example answer based on the vectors provided above will be:
### based on threshold = 0.045
in_b1_not_in_b2 <- c(1002.5698, 301.569)
in_b2_not_in_b1 <- c(22.12, 53, 5666.31, 100.1)
Is there an R function that would do this?
A vectorized beast:
hours later...
Henrik shared a question complaining the memory explosion when using
outer
for long vectors: Matching two very very large vectors with tolerance (fast! but working space sparing). However, the memory bottleneck forouter
can be easily killed by blocking.With this function,
outer
never uses more memory than storing twochunk.size x chunk.size
matrices. Now let's do something crazy.If we do a simple
outer
, we need memory to store two1e+5 x 1e+5
matrices, which is up to 149 GB. However, on my Sandy Bridge (2011) laptop with only 4 GB RAM, computation is feasible.The performance is actually good enough, given that we have been using a very poor algorithm.
All answers here do exhausted search, that has complexity
length(b1) x length(b2)
. We could reduce this tolength(b1) + length(b2)
if we work on sorted arrays. But such deeply optimized algorithm can only be implemented with compiled language to obtain efficiency.If you are happy to use a non-
base
package,data.table::inrange
is a convenient function.inrange
is also efficient on larger data sets. On e.g.1e5
vectors,inrange
is> 700
times faster than the two other alternatives:And yes, they give the same result:
Several related, potentially useful answers when searching for
inrange
on SO.Here is an alternative approach