I have a list of numerical vectors, and I need to create a list containing only one copy of each vector. There isn't a list method for the identical function, so I wrote a function to apply to check every vector against every other.
F1 <- function(x){
to_remove <- c()
for(i in 1:length(x)){
for(j in 1:length(x)){
if(i!=j && identical(x[[i]], x[[j]]) to_remove <- c(to_remove,j)
}
}
if(is.null(to_remove)) x else x[-c(to_remove)]
}
The problem is that this function becomes very slow as the size of the input list x increases, partly due to the assignment of two large vectors by the for loops. I'm hoping for a method that will run in under one minute for a list of length 1.5 million with vectors of length 15, but that might be optimistic.
Does anyone know a more efficient way of comparing each vector in a list with every other vector? The vectors themselves are guaranteed to be equal in length.
Sample output is shown below.
x = list(1:4, 1:4, 2:5, 3:6)
F1(x)
> list(1:4, 2:5, 3:6)
As per @JoshuaUlrich and @thelatemail, ll[!duplicated(ll)]
works just fine.
And thus, so should unique(ll)
I previously suggested a method using sapply with the idea of not checking every element in the list (I deleted that answer, as I think using unique
makes more sense)
Since efficiency is a goal, we should benchmark these.
# Let's create some sample data
xx <- lapply(rep(100,15), sample)
ll <- as.list(sample(xx, 1000, T))
ll
Putting it up against some becnhmarks
fun1 <- function(ll) {
ll[c(TRUE, !sapply(2:length(ll), function(i) ll[i] %in% ll[1:(i-1)]))]
}
fun2 <- function(ll) {
ll[!duplicated(sapply(ll, digest))]
}
fun3 <- function(ll) {
ll[!duplicated(ll)]
}
fun4 <- function(ll) {
unique(ll)
}
#Make sure all the same
all(identical(fun1(ll), fun2(ll)), identical(fun2(ll), fun3(ll)),
identical(fun3(ll), fun4(ll)), identical(fun4(ll), fun1(ll)))
# [1] TRUE
library(rbenchmark)
benchmark(digest=fun2(ll), duplicated=fun3(ll), unique=fun4(ll), replications=100, order="relative")[, c(1, 3:6)]
test elapsed relative user.self sys.self
3 unique 0.048 1.000 0.049 0.000
2 duplicated 0.050 1.042 0.050 0.000
1 digest 8.427 175.563 8.415 0.038
# I took out fun1, since when ll is large, it ran extremely slow
Fastest Option:
unique(ll)
You could hash each of the vectors and then use !duplicated()
to identify unique elements of the resultant character vector:
library(digest)
## Some example data
x <- 1:44
y <- 2:10
z <- rnorm(10)
ll <- list(x,y,x,x,x,z,y)
ll[!duplicated(sapply(ll, digest))]
# [[1]]
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
#
# [[2]]
# [1] 2 3 4 5 6 7 8 9 10
#
# [[3]]
# [1] 1.24573610 -0.48894189 -0.18799758 -1.30696395 -0.05052373 0.94088670
# [7] -0.20254574 -1.08275938 -0.32937153 0.49454570
To see at a glance why this works, here's what the hashes look like:
sapply(ll, digest)
[1] "efe1bc7b6eca82ad78ac732d6f1507e7" "fd61b0fff79f76586ad840c9c0f497d1"
[3] "efe1bc7b6eca82ad78ac732d6f1507e7" "efe1bc7b6eca82ad78ac732d6f1507e7"
[5] "efe1bc7b6eca82ad78ac732d6f1507e7" "592e2e533582b2bbaf0bb460e558d0a5"
[7] "fd61b0fff79f76586ad840c9c0f497d1"