I have a list of vectors, say vecs = list(vec1=1:3, vec2=1:5, vec3=2:6, vec4=1:7)
. I want to remove all the list members that if it is contained in another list member. For example, vecs$vec1
is part of vecs$vec2
, (or vecs$vec4
), so I want to remove it.
I want a very fast implementation, because length(vecs)
is very big. What I did is first sort vecs
member by length vecs = vecs[ order(unlist(lapply(vecs, length))) ]
, then for check_member = vecs[i]
, check whether it is part of vecs[ i+1 ], vecs[ i+2 ]...
.Is there any better strategy? Complete code:
vecs = list(vec1=1:3, vec2=1:5, vec3=2:6, vec4=1:7, vec5=2:3)
vecs = vecs[ order(unlist(lapply(vecs, length))) ] ##sort by member length
vecs_len = length(vecs)
toRemove = numeric(vecs_len ) ##record whether to remove this member
for( i in 1:(vecs_len-1 ))
for( j in (i+1):(vecs_len ))
{
if( length( setdiff(vecs[[i]],vecs[[j]]) )==0 ) {toRemove[i] = 1; break} ##check whether vecs[[i]] is part of vecs[[j]]
}
vecs = vecs[!toRemove]
Please try this on your large data. I am starting from vectors of characters as you clarified in your comment. I am also assuming the vectors do not contain duplicates (easy to fix via lapply(vecs, unique)
if they do)
vecs <- list(vec1=letters[1:3],
vec2=letters[1:5],
vec3=letters[2:6],
vec4=letters[1:7])
First, turn your data into a list of factors (i.e. integers) sharing the same levels:
vlevels <- unique(unlist(vecs))
nlev <- length(vlevels)
fvecs <- lapply(vecs, factor, levels = vlevels)
Then, I convert the data into a matrix of 0
and 1
for included/not-included:
vtabs <- vapply(fvecs, tabulate, integer(nlev), nbins = nlev)
# vec1 vec2 vec3 vec4
# [1,] 1 1 0 1
# [2,] 1 1 1 1
# [3,] 1 1 1 1
# [4,] 0 1 1 1
# [5,] 0 1 1 1
# [6,] 0 0 1 1
# [7,] 0 0 0 1
Next, I compute the crossproduct to see how many elements two vectors have in common, and compare with the length of each vector to decide whether the first one is a subset of the second one.
num.match <- crossprod(vtabs)
vlen <- sapply(vecs, length)
is.subset.mat <- num.match == vlen
diag(is.subset.mat) <- FALSE
# vec1 vec2 vec3 vec4
# vec1 FALSE TRUE FALSE TRUE
# vec2 FALSE FALSE FALSE TRUE
# vec3 FALSE FALSE FALSE TRUE
# vec4 FALSE FALSE FALSE FALSE
Finally, I sum row-wise to tell if each vector is a subset of any other:
is.subset <- rowSums(is.subset.mat) > 0L
# vec1 vec2 vec3 vec4
# TRUE TRUE TRUE FALSE
You are only left with filtering:
out <- vecs[!is.subset]
If your initial list is so large that this is too slow, then I would suggest that you run it chunk-by-chunk, then again on the results.
After doing some benchmarking here we can see you could just slightly modify your original code for a nice speedup. Using %in%
and any
is more efficient than setdiff
and length
.
# Your code
fun1 <- function(vecs){
vecs_len = length(vecs)
toRemove = numeric(vecs_len ) ##record whether to remove this member
for( i in 1:(vecs_len-1 ))
for( j in (i+1):(vecs_len ))
{
if( length( setdiff(vecs[[i]],vecs[[j]]) )==0 ) toRemove[i] = 1 ##check whether vecs[[i]] is part of vecs[[j]]
}
vecs = vecs[!toRemove]
return(vecs)
}
# flodel's code
fun2 <- function(vecs){
vlevels <- unique(unlist(vecs))
nlev <- length(vlevels)
fvecs <- lapply(vecs, factor, levels = vlevels)
vtabs <- vapply(fvecs, tabulate, integer(nlev), nbins = nlev)
num.match <- crossprod(vtabs)
vlen <- sapply(vecs, length)
is.subset.mat <- num.match == vlen
diag(is.subset.mat) <- FALSE
is.subset <- rowSums(is.subset.mat) > 0L
out <- vecs[!is.subset]
return(out)
}
# slight modification
fun3 <- function(vecs){
vecs_len = length(vecs)
toRemove = numeric(vecs_len ) ##record whether to remove this member
for( i in 1:(vecs_len-1 ))
for( j in (i+1):(vecs_len ))
{
if( any(vecs[[i]] %in% vecs[[j]]) ) toRemove[i] = 1 ##check whether vecs[[i]] is part of vecs[[j]]
}
vecs = vecs[!toRemove]
return(vecs)
}
microbenchmark(fun1(vecs), fun2(vecs), fun3(vecs), times=100L)
Unit: microseconds
expr min lq mean median uq max neval
fun1(vecs) 154.919 166.4245 179.62297 172.5880 180.3950 356.681 100
fun2(vecs) 220.255 231.5555 279.99874 237.7185 290.9335 609.398 100
fun3(vecs) 50.544 54.0370 64.86082 57.3250 64.5155 291.345 100
I think something like the following should work. It will give you a list of those vectors which are subsets of one of the others. It will be a subset of itself - so I only do it for those where it's a subset of a least two vectors (ie, itself and one other). Credit to https://stat.ethz.ch/pipermail/r-help/2005-March/068491.html for the %in% function.
dfSubsets <- list()
ds <- lapply(vecs,function(x){
subsets = 0
lapply( vecs, function(y){
if(all(x %in% y)){subsets <<- subsets + 1}
})
if(subsets > 1){
dfSubsets <<- c(dfSubsets,list(x))
}
}
)
dfSubsets
If you want a list of the indices of those vectors which are subsets, let me know.
I'm sure dplyr could probably help as well, but I'm not familiar with that.
Apply generally leads to a gain in efficiency - of course, if you only care about if it's the subset of one vector, it will sometimes be faster using a for loop and break, but you can't know that in advance.
This method works for strings or integers.