How to trim an R vector?

2019-04-11 11:08发布

问题:

I have the following sorted vector:

> v
 [1] -1  0  1  2  4  5  2  3  4  5  7  8  5  6  7  8 10 11

How can I remove the -1, 0, and 11 entries without looping over the whole vector, either with a user loop or implicitly with a language keyword? That is, I want to trim the vector at each edge and only at each edge, such that the sorted sequence is within my min,max parameters 1 and 10. The solution should assume that the vector is sorted to avoid checking every element.

This kind of solutions can come handy in vectorized operations for very large vectors, when we want to use the items in the vector as indexes in another object. For one application see this thread.

回答1:

All of the previous solutions implicitly check every element of the vector. As @Robert Kubrick points out, this does not take advantage of the fact that the vector is already sorted.

To take advantage of the sorted nature of the vector, you can use binary search (through findInterval) to find the start and end indexes without looking at every element:

n<-1e9
v<--3:(n+3)
system.time(a <- v [v>=1 & v <=n]) # 68 s
system.time(b <- v[do.call(seq,as.list(findInterval(c(1,n),v)))]) # 15s
identical(a,b) # TRUE

It is a little clumsy, and there is some discussion that the binary search in findInterval may not be entirely efficient, but the general idea is there.


As was pointed out in the comments, the above only works when the index is in the vector. Here is a function that I think will work:

in.range <- function(x, lo = -Inf, hi = +Inf) {
   lo.idx <- findInterval(lo, x, all.inside = TRUE)
   hi.idx <- findInterval(hi, x)
   lo.idx <- lo.idx + x[lo.idx] >= lo
   x[seq(lo.idx, hi.idx)]
}

system.time(b <- in.range(v, 1, n) # 15s


回答2:

To include elements in a vector by index:

v [2:10]

to exclude certain elements

v [-c (1, 11) ]

to only include a certain range:

v <- v [v>=1 & v <=10]

If I'm allowed to assume that, like in your example, the number of elements to be trimmed << the number of elements in the vector, then I think I can beat the binary search:

> n<-1e8
> v<--3:(n+3)
> 
> min <- 1
> max <- length(v)
> 
> calcMin <- function(v, minVal){
+   while(v[min] < minVal){
+       min <- min + 1
+   }
+   min
+ }
> 
> calcMax <- function(v, maxVal){
+   while(v[max] > maxVal){
+       max <- max - 1
+   }
+   max
+ }
> 
> #Compute the min and max indices and create a sequence
> system.time(a <- v[calcMin(v, 1):calcMax(v,n)])
   user  system elapsed 
  1.030   0.269   1.298 
> 
> #do a binary search to find the elements (as suggested by @nograpes)
> system.time(b <- v[do.call(seq,as.list(findInterval(c(1,n),v)))])
   user  system elapsed 
  2.208   0.631   2.842 
> 
> #use negative indexing to remove elements
> system.time(c <- v[-c(1:(calcMin(v, 1)-1), (calcMax(v,n)+1):length(v))])
   user  system elapsed 
  1.449   0.256   1.704 
> 
> #use head and tail to trim the vector
> system.time(d <- tail(head(v, n=(calcMax(v,n)-length(v))), n=-calcMin(v, 1)+1))
   user  system elapsed 
  2.994   0.877   3.871 
> 
> identical(a, b)
[1] TRUE
> identical(a, c)
[1] TRUE
> identical(a, d)
[1] TRUE


回答3:

There are many ways to do it, here's some:

> v <- -1:11 # creating your vector
> v[v %in% 1:10]
 [1]  1  2  3  4  5  6  7  8  9 10
> setdiff(v, c(-1,0,11))
 [1]  1  2  3  4  5  6  7  8  9 10
> intersect(v, 1:10)
 [1]  1  2  3  4  5  6  7  8  9 10

Two more options, not so elegant.

> na.omit(match(v, 1:10))
> na.exclude(match(v, 1:10))


回答4:

You can use %in% also :

 vv <- c(-1,  0  ,1  ,2  ,4  ,5,  2  ,3  ,4,  5,  7  ,8,  5,  6,  7,  8, 10, 11)
 vv[vv %in% 1:10]

 [1]  1  2  4  5  2  3  4  5  7  8  5  6  7  8 10