I have a vector of positive and negative numbers
vec<-c(seq(-100,-1), rep(0,20), seq(1,100))
the vector is larger than the example, and takes on a random set of values. I have to repetitively find the number of negative numbers in the vector... I am finding this is quite inefficient.
Since I only need to find the number of negative numbers, and the vector is sorted, I only need to know the index of the first 0 or positive number (there may be no 0s in the actual random vectors).
Currently I am using this code to find the length
length(which(vec<0))
but this forces R to go through the entire vector, but since it is sorted, there is no need.
I could use
match(0, vec)
but my vector does not always have 0s
So my question is, is there some kind of match() function that applies a condition instead of finding a specific value? Or is there a more efficient way to run my which() code?
Thank you
The solutions offered so far all imply creating a logical(length(vec))
and doing a full or partial scan on this. As you note, the vector is sorted. We can exploit this by doing a binary search. I started thinking I'd be super-clever and implement this in C for even greater speed, but had trouble with debugging the indexing of the algorithm (which is the tricky part!). So I wrote it in R:
f3 <- function(x) {
imin <- 1L
imax <- length(x)
while (imax >= imin) {
imid <- as.integer(imin + (imax - imin) / 2)
if (x[imid] >= 0)
imax <- imid - 1L
else
imin <- imid + 1L
}
imax
}
For comparison with the other suggestions
f0 <- function(v) length(which(v < 0))
f1 <- function(v) sum(v < 0)
f2 <- function(v) which.min(v < 0) - 1L
and for fun
library(compiler)
f3.c <- cmpfun(f3)
Leading to
> vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
> identical(f0(vec), f1(vec))
[1] TRUE
> identical(f0(vec), f2(vec))
[1] TRUE
> identical(f0(vec), f3(vec))
[1] TRUE
> identical(f0(vec), f3.c(vec))
[1] TRUE
> microbenchmark(f0(vec), f1(vec), f2(vec), f3(vec), f3.c(vec))
Unit: microseconds
expr min lq median uq max neval
f0(vec) 15274.275 15347.870 15406.1430 15605.8470 19890.903 100
f1(vec) 15513.807 15575.229 15651.2970 17064.8830 18326.293 100
f2(vec) 21473.814 21558.989 21679.3210 22733.1710 27435.889 100
f3(vec) 51.715 56.050 75.4495 78.5295 100.730 100
f3.c(vec) 11.612 17.147 28.5570 31.3160 49.781 100
Probably there are some tricky edge cases that I've got wrong! Moving to C, I did
library(inline)
f4 <- cfunction(c(x = "numeric"), "
int imin = 0, imax = Rf_length(x) - 1, imid;
while (imax >= imin) {
imid = imin + (imax - imin) / 2;
if (REAL(x)[imid] >= 0)
imax = imid - 1;
else
imin = imid + 1;
}
return ScalarInteger(imax + 1);
")
with
> identical(f3(vec), f4(vec))
[1] TRUE
> microbenchmark(f3(vec), f3.c(vec), f4(vec))
Unit: nanoseconds
expr min lq median uq max neval
f3(vec) 52096 53192.0 54918.5 55539.0 69491 100
f3.c(vec) 10924 12233.5 12869.0 13410.0 20038 100
f4(vec) 553 796.0 893.5 1004.5 2908 100
findInterval
came up when a similar question was asked on the R-help list. It is slow but safe, checking that vec
is actually sorted and dealing with NA values. If one wants to live on the edge (arguably no worse that implementing f3 or f4) then
f5.i <- function(v)
.Internal(findInterval(v, 0 - .Machine$double.neg.eps, FALSE, FALSE))
is nearly as fast as the C implementation, but likely more robust and vectorized (i.e., look up a vector of values in the second argument, for easy range-like calculations).
Use sum()
and logical comparison:
sum( vec < 0 )
[1] 100
This will be pretty quick, and when you sum a logical, TRUE
is 1 and FALSE
is 0 so the total will be the number of negative values.
Uh oh, I feel the need for a benchmarking comparison... :-) Vector length is 2e5
library(microbenchmark)
vec<-c(seq(-100,-1,length.out=1e5), rep(0,20), seq(1,100,length.out=1e5))
microbenchmark( (which.min(vec < 0) - 1L) , (sum( vec < 0 )) )
Unit: milliseconds
expr min lq median uq max neval
(which.min(vec < 0) - 1L) 1.883847 2.130746 2.554725 3.141787 75.943911 100
(sum(vec < 0)) 1.398100 1.500639 1.508688 1.745088 2.662164 100
You could use which.min
which.min(vec < 0) - 1L
This will return the first FALSE
value, i.e. the first 0.