-->

More efficient strategy for which() or match()

2020-02-26 11:07发布

问题:

I have a vector of positive and negative numbers

vec<-c(seq(-100,-1), rep(0,20), seq(1,100))

the vector is larger than the example, and takes on a random set of values. I have to repetitively find the number of negative numbers in the vector... I am finding this is quite inefficient.

Since I only need to find the number of negative numbers, and the vector is sorted, I only need to know the index of the first 0 or positive number (there may be no 0s in the actual random vectors).

Currently I am using this code to find the length

length(which(vec<0))

but this forces R to go through the entire vector, but since it is sorted, there is no need.

I could use

match(0, vec)

but my vector does not always have 0s

So my question is, is there some kind of match() function that applies a condition instead of finding a specific value? Or is there a more efficient way to run my which() code?

Thank you

回答1:

The solutions offered so far all imply creating a logical(length(vec)) and doing a full or partial scan on this. As you note, the vector is sorted. We can exploit this by doing a binary search. I started thinking I'd be super-clever and implement this in C for even greater speed, but had trouble with debugging the indexing of the algorithm (which is the tricky part!). So I wrote it in R:

f3 <- function(x) {
    imin <- 1L
    imax <- length(x)
    while (imax >= imin) {
        imid <- as.integer(imin + (imax - imin) / 2)
        if (x[imid] >= 0)
            imax <- imid - 1L
        else
            imin <- imid + 1L
    }
    imax
}

For comparison with the other suggestions

f0 <- function(v) length(which(v < 0))
f1 <- function(v) sum(v < 0)
f2 <- function(v) which.min(v < 0) - 1L

and for fun

library(compiler)
f3.c <- cmpfun(f3)

Leading to

> vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
> identical(f0(vec), f1(vec))
[1] TRUE
> identical(f0(vec), f2(vec))
[1] TRUE
> identical(f0(vec), f3(vec))
[1] TRUE
> identical(f0(vec), f3.c(vec))
[1] TRUE
> microbenchmark(f0(vec), f1(vec), f2(vec), f3(vec), f3.c(vec))
Unit: microseconds
      expr       min        lq     median         uq       max neval
   f0(vec) 15274.275 15347.870 15406.1430 15605.8470 19890.903   100
   f1(vec) 15513.807 15575.229 15651.2970 17064.8830 18326.293   100
   f2(vec) 21473.814 21558.989 21679.3210 22733.1710 27435.889   100
   f3(vec)    51.715    56.050    75.4495    78.5295   100.730   100
 f3.c(vec)    11.612    17.147    28.5570    31.3160    49.781   100

Probably there are some tricky edge cases that I've got wrong! Moving to C, I did

library(inline)
f4 <- cfunction(c(x = "numeric"), "
    int imin = 0, imax = Rf_length(x) - 1, imid;
    while (imax >= imin) {
        imid = imin + (imax - imin) / 2;
        if (REAL(x)[imid] >= 0)
            imax = imid - 1;
        else
            imin = imid + 1;
    }
    return ScalarInteger(imax + 1);
")

with

> identical(f3(vec), f4(vec))
[1] TRUE
> microbenchmark(f3(vec), f3.c(vec), f4(vec))
Unit: nanoseconds
      expr   min      lq  median      uq   max neval
   f3(vec) 52096 53192.0 54918.5 55539.0 69491   100
 f3.c(vec) 10924 12233.5 12869.0 13410.0 20038   100
   f4(vec)   553   796.0   893.5  1004.5  2908   100

findInterval came up when a similar question was asked on the R-help list. It is slow but safe, checking that vec is actually sorted and dealing with NA values. If one wants to live on the edge (arguably no worse that implementing f3 or f4) then

f5.i <- function(v)
    .Internal(findInterval(v, 0 - .Machine$double.neg.eps, FALSE, FALSE))

is nearly as fast as the C implementation, but likely more robust and vectorized (i.e., look up a vector of values in the second argument, for easy range-like calculations).



回答2:

Use sum() and logical comparison:

sum( vec < 0 )
[1] 100

This will be pretty quick, and when you sum a logical, TRUE is 1 and FALSE is 0 so the total will be the number of negative values.

Uh oh, I feel the need for a benchmarking comparison... :-) Vector length is 2e5

library(microbenchmark)
vec<-c(seq(-100,-1,length.out=1e5), rep(0,20), seq(1,100,length.out=1e5))
microbenchmark( (which.min(vec < 0) - 1L) , (sum( vec < 0 )) )

Unit: milliseconds
                      expr      min       lq   median       uq       max neval
 (which.min(vec < 0) - 1L) 1.883847 2.130746 2.554725 3.141787 75.943911   100
            (sum(vec < 0)) 1.398100 1.500639 1.508688 1.745088  2.662164   100


回答3:

You could use which.min

 which.min(vec < 0) - 1L

This will return the first FALSE value, i.e. the first 0.