可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have the following sorted vector:
> v
[1] -1 0 1 2 4 5 2 3 4 5 7 8 5 6 7 8 10 11
How can I remove the -1, 0, and 11 entries without looping over the whole vector, either with a user loop or implicitly with a language keyword? That is, I want to trim the vector at each edge and only at each edge, such that the sorted sequence is within my min,max parameters 1 and 10. The solution should assume that the vector is sorted to avoid checking every element.
This kind of solutions can come handy in vectorized operations for very large vectors, when we want to use the items in the vector as indexes in another object. For one application see this thread.
回答1:
All of the previous solutions implicitly check every element of the vector. As @Robert Kubrick points out, this does not take advantage of the fact that the vector is already sorted.
To take advantage of the sorted nature of the vector, you can use binary search (through findInterval
) to find the start and end indexes without looking at every element:
n<-1e9
v<--3:(n+3)
system.time(a <- v [v>=1 & v <=n]) # 68 s
system.time(b <- v[do.call(seq,as.list(findInterval(c(1,n),v)))]) # 15s
identical(a,b) # TRUE
It is a little clumsy, and there is some discussion that the binary search in findInterval
may not be entirely efficient, but the general idea is there.
As was pointed out in the comments, the above only works when the index is in the vector. Here is a function that I think will work:
in.range <- function(x, lo = -Inf, hi = +Inf) {
lo.idx <- findInterval(lo, x, all.inside = TRUE)
hi.idx <- findInterval(hi, x)
lo.idx <- lo.idx + x[lo.idx] >= lo
x[seq(lo.idx, hi.idx)]
}
system.time(b <- in.range(v, 1, n) # 15s
回答2:
To include elements in a vector by index:
v [2:10]
to exclude certain elements
v [-c (1, 11) ]
to only include a certain range:
v <- v [v>=1 & v <=10]
If I'm allowed to assume that, like in your example, the number of elements to be trimmed << the number of elements in the vector, then I think I can beat the binary search:
> n<-1e8
> v<--3:(n+3)
>
> min <- 1
> max <- length(v)
>
> calcMin <- function(v, minVal){
+ while(v[min] < minVal){
+ min <- min + 1
+ }
+ min
+ }
>
> calcMax <- function(v, maxVal){
+ while(v[max] > maxVal){
+ max <- max - 1
+ }
+ max
+ }
>
> #Compute the min and max indices and create a sequence
> system.time(a <- v[calcMin(v, 1):calcMax(v,n)])
user system elapsed
1.030 0.269 1.298
>
> #do a binary search to find the elements (as suggested by @nograpes)
> system.time(b <- v[do.call(seq,as.list(findInterval(c(1,n),v)))])
user system elapsed
2.208 0.631 2.842
>
> #use negative indexing to remove elements
> system.time(c <- v[-c(1:(calcMin(v, 1)-1), (calcMax(v,n)+1):length(v))])
user system elapsed
1.449 0.256 1.704
>
> #use head and tail to trim the vector
> system.time(d <- tail(head(v, n=(calcMax(v,n)-length(v))), n=-calcMin(v, 1)+1))
user system elapsed
2.994 0.877 3.871
>
> identical(a, b)
[1] TRUE
> identical(a, c)
[1] TRUE
> identical(a, d)
[1] TRUE
回答3:
There are many ways to do it, here's some:
> v <- -1:11 # creating your vector
> v[v %in% 1:10]
[1] 1 2 3 4 5 6 7 8 9 10
> setdiff(v, c(-1,0,11))
[1] 1 2 3 4 5 6 7 8 9 10
> intersect(v, 1:10)
[1] 1 2 3 4 5 6 7 8 9 10
Two more options, not so elegant.
> na.omit(match(v, 1:10))
> na.exclude(match(v, 1:10))
回答4:
You can use %in%
also :
vv <- c(-1, 0 ,1 ,2 ,4 ,5, 2 ,3 ,4, 5, 7 ,8, 5, 6, 7, 8, 10, 11)
vv[vv %in% 1:10]
[1] 1 2 4 5 2 3 4 5 7 8 5 6 7 8 10