vectorizing & parallelizing the disagregation of a

2019-09-08 15:41发布

问题:

Here's some code that generates a list of data.frames and then converts that original list into a new list with each list element a list of the rows of each data frame.

Eg.
- l1 has length 10 and each element is a data.frame with 1000 rows.
- l2 is a list of length 1000 (nrow(l1[[k]])) and each element is a list of length 10 (length(l1)) containing row-vectors from the elements of l1

l1 <- vector("list", length= 10)
set.seed(65L)
for (i in 1:10) {
  l1[[i]] <- data.frame(matrix(rnorm(10000),ncol=10))
}

l2 <- vector(mode="list", length= nrow(l1[[1]]))
for (i in 1:nrow(l1[[1]])) {
  l2[[i]] <- lapply(l1, function(l) return(unlist(l[i,])))
}

Edit To clarify how l1 relates to l2, here is language agnostic code.

for (j in 1:length(l1) {
  for (i in 1:nrow(l1[[1]]) { # where nrow(l1[[1]]) == nrow(l1[[k]]) k= 2,...,10
    l2[[i]][[j]] <- l1[[j]][i,]
  }
}

How do I speed the creation of l2 up via vectorization or parallelization? The problem I'm having is that parallel::parLapplyLB splits lists; however, I don't want to split the list l1, what I want to do is split the rows within each element of l1. An intermediate solution would vectorize my current approach by using some *apply function to replace the for-loop. This could obviously be extended to a parallel solution as well.

If I solve this on my own before an acceptable solution, I'll post my answer here.

回答1:

I would break the structure completely and rebuild the second list via split. This approach needs much more memory than the original one but at least for the given example it is >10x faster:

sgibb <- function(x) {
  ## get the lengths of all data.frames (equal to `sapply(x, ncol)`)
  n <- lengths(x)
  ## destroy the list structure
  y <- unlist(x, use.names = FALSE)
  ## generate row indices (stores the information which row the element in y
  ## belongs to)
  rowIndices <- unlist(lapply(n, rep.int, x=1L:nrow(x[[1L]])))
  ## split y first by rows
  ## and subsequently loop over these lists to split by columns
  lapply(split(y, rowIndices), split, f=rep.int(seq_along(n), n))
}

alex <- function(x) {
  l2 <- vector(mode="list", length= nrow(x[[1]]))
  for (i in 1:nrow(x[[1]])) {
    l2[[i]] <- lapply(x, function(l) return(unlist(l[i,])))
  }
  l2
}

## check.attributes is need because the names differ
all.equal(alex(l1), sgibb(l1), check.attributes=FALSE)

library(rbenchmark)
benchmark(alex(l1), sgibb(l1), order = "relative", replications = 10)
#       test replications elapsed relative user.self sys.self user.child sys.child
#2 sgibb(l1)           10   0.808    1.000     0.808        0          0         0
#1  alex(l1)           10  11.970   14.814    11.972        0          0         0