challenge: optimize unlisting [easy]

2019-03-25 05:05发布

问题:

Because SO is a bit slow lately, I'm posting an easy question. I would appreciate it if big fishes stayed on the bench for this one and give rookies a chance to respond.

Sometimes we have objects that have a ridiculous amount of large list elements (vectors). How would you "unlist" this object into a single vector. Show proof that your method is faster than unlist().

回答1:

If you don't need names and your list is one level deep, then if you can beat

.Internal(unlist(your_list, FALSE, FALSE))

I will vote up everything you do on SO for the next 1 year!!!

[Update: if non-unique names are needed and the list is not recursive, here is a version which improves over the unlist 100 times

 myunlist <- function(l){
    names <- names(l)
    vec <- unlist(l, F, F)
    reps <- unlist(lapply(l, length), F, F)
    names(vec) <- rep(names, reps)
    vec
    }

 myunlist(list(a=1:3, b=2))
 a a a b 
 1 2 3 2 

 > tl <- list(a = 1:20000, b = 1:5000, c = 2:30)
 > system.time(for(i in 1:200) unlist(tl))
 user  system elapsed 
 22.97    0.00   23.00 

 > system.time(for(i in 1:200) myunlist(tl))
 user  system elapsed 
 0.2     0.0     0.2 

 > system.time(for(i in 1:200) unlist(tl, F, F))
 user  system elapsed 
 0.02    0.00    0.02 

]

[Update2: Responce to challenge Nr3 from Richie Cotton.

bigList3 <- replicate(500, rnorm(1e3), simplify = F)

unlist_vit <- function(l){
    names(l) <- NULL
    do.call(c, l)
    }

library(rbenchmark)

benchmark(unlist = unlist(bigList3, FALSE, FALSE),
          rjc    = unlist_rjc(bigList3),
          vit    = unlist_vit(bigList3),
          order  = "elapsed",
          replications = 100,
          columns = c("test", "relative", "elapsed")
          )

    test  relative elapsed
1 unlist   1.0000    2.06
3    vit   1.4369    2.96
2    rjc   3.5146    7.24

]

PS: I assume a "big fish" is the one with more reputation than you. So I am pretty much small here :).



回答2:

A non-unlist() solution would have to be pretty darned fast to beat unlist() would it not? Here it takes less than two second to unlist a list with 2000 numeric vectors of length 100,000 each.

> bigList2 <- as.list(data.frame(matrix(rep(rnorm(1000000), times = 200), 
+                                       ncol = 2000)))
> print(object.size(bigList2), units = "Gb")
1.5 Gb
> system.time(foo <- unlist(bigList2, use.names = FALSE))
   user  system elapsed 
  1.897   0.000   2.019

With bigList2 and foo in my workspace, R is using ~9Gb of my available memory. The key is use.names = FALSE. Without it unlist() is painfully slow. Exactly how slow I'm still waiting to find out...

We can speed this up a little bit more by setting recursive = FALSE and then we have effectively the same as VitoshKa's answer (two representative timings):

> system.time(foo <- unlist(bigList2, recursive = FALSE, use.names = FALSE))
   user  system elapsed 
  1.379   0.001   1.416
> system.time(foo <- .Internal(unlist(bigList2, FALSE, FALSE)))
   user  system elapsed 
  1.335   0.000   1.344

... finally the use.names = TRUE version finished...:

> system.time(foo <- unlist(bigList2, use = TRUE))
    user   system  elapsed 
2307.839   10.978 2335.815

and it was consuming all my systems 16Gb of RAM so I gave up at that point...



回答3:

c() has the logical argument recursive which will recursively unlist a vector when set to TRUE (default is obviously FALSE).

l <- replicate(500, rnorm(1e3), simplify = F)

microbenchmark::microbenchmark(
  unlist = unlist(l, FALSE, FALSE),
  c = c(l, recursive = TRUE, use.names = FALSE)
)

# Unit: milliseconds
# expr      min       lq     mean   median       uq      max neval
# unlist 3.083424 3.121067 4.662491 3.172401 3.985668 27.35040   100
#      c 3.084890 3.133779 4.090520 3.201246 3.920646 33.22832   100


回答4:

As a medium-size fish, I'm jumping in with a first-attempt solution that gives a benchmark for little fishes to beat. It's about 3 times slower than unlist.

I'm using a smaller version of ucfagls's test list. (Since it fits in memory better.)

bigList3 <- as.list(data.frame(matrix(rep(rnorm(1e5), times = 200), ncol = 2000)))

The basic idea is to create one long vector to store the answer, then loop over list items copying values from the list.

unlist_rjc <- function(l)
{
  lengths <- vapply(l, length, FUN.VALUE = numeric(1), USE.NAMES = FALSE)
  total_len <- sum(lengths)
  end_index <- cumsum(lengths)
  start_index <- 1 + c(0, end_index)
  v <- numeric(total_len)
  for(i in seq_along(l))
  {
    v[start_index[i]:end_index[i]] <- l[[i]]
  }
  v
}

t1 <- system.time(for(i in 1:10) unlist(bigList2, FALSE, FALSE))
t2 <- system.time(for(i in 1:10) unlist_rjc(bigList2))
t2["user.self"] / t1["user.self"]  # 3.08

Challenges for little fishes:
1. Can you extend it to deal with other types than numeric?
2. Can you get it to work with recursion (nested lists)?
3. Can you make it faster?

I'll upvote anyone with less points than me whose answer meets one or more of these mini-challenges.