I have a vector of positive and negative numbers
vec<-c(seq(-100,-1), rep(0,20), seq(1,100))
the vector is larger than the example, and takes on a random set of values. I have to repetitively find the number of negative numbers in the vector... I am finding this is quite inefficient.
Since I only need to find the number of negative numbers, and the vector is sorted, I only need to know the index of the first 0 or positive number (there may be no 0s in the actual random vectors).
Currently I am using this code to find the length
length(which(vec<0))
but this forces R to go through the entire vector, but since it is sorted, there is no need.
I could use
match(0, vec)
but my vector does not always have 0s
So my question is, is there some kind of match() function that applies a condition instead of finding a specific value? Or is there a more efficient way to run my which() code?
Thank you
You could use
which.min
This will return the first
FALSE
value, i.e. the first 0.Use
sum()
and logical comparison:This will be pretty quick, and when you sum a logical,
TRUE
is 1 andFALSE
is 0 so the total will be the number of negative values.Uh oh, I feel the need for a benchmarking comparison... :-) Vector length is 2e5
The solutions offered so far all imply creating a
logical(length(vec))
and doing a full or partial scan on this. As you note, the vector is sorted. We can exploit this by doing a binary search. I started thinking I'd be super-clever and implement this in C for even greater speed, but had trouble with debugging the indexing of the algorithm (which is the tricky part!). So I wrote it in R:For comparison with the other suggestions
and for fun
Leading to
Probably there are some tricky edge cases that I've got wrong! Moving to C, I did
with
findInterval
came up when a similar question was asked on the R-help list. It is slow but safe, checking thatvec
is actually sorted and dealing with NA values. If one wants to live on the edge (arguably no worse that implementing f3 or f4) thenis nearly as fast as the C implementation, but likely more robust and vectorized (i.e., look up a vector of values in the second argument, for easy range-like calculations).