Fast check whether one vector is part of another v

2019-05-04 22:25发布

问题:

I have a list of vectors, say vecs = list(vec1=1:3, vec2=1:5, vec3=2:6, vec4=1:7). I want to remove all the list members that if it is contained in another list member. For example, vecs$vec1 is part of vecs$vec2, (or vecs$vec4), so I want to remove it.

I want a very fast implementation, because length(vecs) is very big. What I did is first sort vecs member by length vecs = vecs[ order(unlist(lapply(vecs, length))) ], then for check_member = vecs[i], check whether it is part of vecs[ i+1 ], vecs[ i+2 ]... .Is there any better strategy? Complete code:

vecs = list(vec1=1:3, vec2=1:5, vec3=2:6, vec4=1:7, vec5=2:3)
vecs = vecs[ order(unlist(lapply(vecs, length))) ] ##sort by member length
vecs_len = length(vecs)
toRemove = numeric(vecs_len ) ##record whether to remove this member
for( i in 1:(vecs_len-1 ))
for( j in (i+1):(vecs_len ))
{
   if( length( setdiff(vecs[[i]],vecs[[j]]) )==0  ) {toRemove[i] = 1; break} ##check whether vecs[[i]] is part of vecs[[j]]

}
vecs = vecs[!toRemove]

回答1:

Please try this on your large data. I am starting from vectors of characters as you clarified in your comment. I am also assuming the vectors do not contain duplicates (easy to fix via lapply(vecs, unique) if they do)

vecs <- list(vec1=letters[1:3],
             vec2=letters[1:5],
             vec3=letters[2:6],
             vec4=letters[1:7])

First, turn your data into a list of factors (i.e. integers) sharing the same levels:

vlevels <- unique(unlist(vecs))
nlev    <- length(vlevels)
fvecs   <- lapply(vecs, factor, levels = vlevels)

Then, I convert the data into a matrix of 0 and 1 for included/not-included:

vtabs <- vapply(fvecs, tabulate, integer(nlev), nbins = nlev)
#      vec1 vec2 vec3 vec4
# [1,]    1    1    0    1
# [2,]    1    1    1    1
# [3,]    1    1    1    1
# [4,]    0    1    1    1
# [5,]    0    1    1    1
# [6,]    0    0    1    1
# [7,]    0    0    0    1

Next, I compute the crossproduct to see how many elements two vectors have in common, and compare with the length of each vector to decide whether the first one is a subset of the second one.

num.match <- crossprod(vtabs)
vlen <- sapply(vecs, length)
is.subset.mat <- num.match == vlen
diag(is.subset.mat) <- FALSE
#       vec1  vec2  vec3  vec4
# vec1 FALSE  TRUE FALSE  TRUE
# vec2 FALSE FALSE FALSE  TRUE
# vec3 FALSE FALSE FALSE  TRUE
# vec4 FALSE FALSE FALSE FALSE

Finally, I sum row-wise to tell if each vector is a subset of any other:

is.subset <- rowSums(is.subset.mat) > 0L
# vec1  vec2  vec3  vec4 
# TRUE  TRUE  TRUE FALSE 

You are only left with filtering:

out <- vecs[!is.subset]

If your initial list is so large that this is too slow, then I would suggest that you run it chunk-by-chunk, then again on the results.



回答2:

After doing some benchmarking here we can see you could just slightly modify your original code for a nice speedup. Using %in% and any is more efficient than setdiff and length.

# Your code
fun1 <- function(vecs){
  vecs_len = length(vecs)
  toRemove = numeric(vecs_len ) ##record whether to remove this member
  for( i in 1:(vecs_len-1 ))
    for( j in (i+1):(vecs_len ))
    {
      if( length( setdiff(vecs[[i]],vecs[[j]]) )==0  ) toRemove[i] = 1 ##check whether vecs[[i]] is part of vecs[[j]]

    }
  vecs = vecs[!toRemove]
  return(vecs)
}

# flodel's code
fun2 <- function(vecs){
  vlevels <- unique(unlist(vecs))
  nlev    <- length(vlevels)
  fvecs   <- lapply(vecs, factor, levels = vlevels)
  vtabs <- vapply(fvecs, tabulate, integer(nlev), nbins = nlev)
  num.match <- crossprod(vtabs)
  vlen <- sapply(vecs, length)
  is.subset.mat <- num.match == vlen
  diag(is.subset.mat) <- FALSE
  is.subset <- rowSums(is.subset.mat) > 0L
  out <- vecs[!is.subset]
  return(out)  
}

# slight modification
fun3 <- function(vecs){
  vecs_len = length(vecs)
  toRemove = numeric(vecs_len ) ##record whether to remove this member
  for( i in 1:(vecs_len-1 ))
    for( j in (i+1):(vecs_len ))
    {
      if( any(vecs[[i]] %in% vecs[[j]])  ) toRemove[i] = 1 ##check whether vecs[[i]] is part of vecs[[j]]

    }
  vecs = vecs[!toRemove]
  return(vecs)
}

microbenchmark(fun1(vecs), fun2(vecs), fun3(vecs), times=100L)
Unit: microseconds
       expr     min       lq      mean   median       uq     max neval
 fun1(vecs) 154.919 166.4245 179.62297 172.5880 180.3950 356.681   100
 fun2(vecs) 220.255 231.5555 279.99874 237.7185 290.9335 609.398   100
 fun3(vecs)  50.544  54.0370  64.86082  57.3250  64.5155 291.345   100


回答3:

I think something like the following should work. It will give you a list of those vectors which are subsets of one of the others. It will be a subset of itself - so I only do it for those where it's a subset of a least two vectors (ie, itself and one other). Credit to https://stat.ethz.ch/pipermail/r-help/2005-March/068491.html for the %in% function.

dfSubsets <- list()

ds <- lapply(vecs,function(x){
        subsets = 0
        lapply( vecs, function(y){
            if(all(x %in% y)){subsets <<- subsets + 1}
        })
        if(subsets > 1){
            dfSubsets <<- c(dfSubsets,list(x))
        }
}
)

dfSubsets

If you want a list of the indices of those vectors which are subsets, let me know.

I'm sure dplyr could probably help as well, but I'm not familiar with that.

Apply generally leads to a gain in efficiency - of course, if you only care about if it's the subset of one vector, it will sometimes be faster using a for loop and break, but you can't know that in advance.

This method works for strings or integers.



标签: r vector