R - Extracting information from list of lists of d

2019-06-07 11:27发布


I have two needs, both connected to a dataset similar to the reproducible one below. I have a list of 18 entities, each composed of a list of 17-19 data.frames. Reproducible dataset follows (there are matrices instead of data.frames, but I do not suppose that makes a difference):

test <- list(list(matrix(10:(50-1), ncol = 10), matrix(60:(100-1), ncol = 10), matrix(110:(150-1), ncol = 10)),
             list(matrix(200:(500-1), ncol = 10), matrix(600:(1000-1), ncol = 10), matrix(1100:(1500-1), ncol = 10)))
  1. I need to subset each dataframe/matrix into two parts (by a given number of rows) and save to a new list of lists
  2. Secondly, I need to extract and save a given column(s) out of every data.frame in a list of lists.

I have no idea how to go around doing it apart from for(), but I am sure it should be possible with apply() family of functions.

Thank you for reading


My expected output would look as follows:

extractedColumns <- list(list(matrix(10:(50-1), ncol = 10)[, 2], matrix(60:(100-1), ncol = 10)[, 2], matrix(110:(150-1), ncol = 10)[, 2]),
                         list(matrix(200:(500-1), ncol = 10)[, 2], matrix(600:(1000-1), ncol = 10)[, 2], matrix(1100:(1500-1), ncol = 10)[, 2]))

numToSubset <- 3
substetFrames <- list(list(list(matrix(10:(50-1), ncol = 10)["first length - numToSubset rows", ], matrix(10:(50-1), ncol = 10)["last numToSubset rows", ]), 
                           list(matrix(60:(100-1), ncol = 10)["first length - numToSubset rows", ], matrix(60:(100-1), ncol = 10)["last numToSubset rows", ]),
                                list(matrix(110:(150-1), ncol = 10)["first length - numToSubset rows", ], matrix(110:(150-1), ncol = 10)["last numToSubset rows", ])),

It gets to look very messy, hope you can follow what I want.


You can use two nested lapplys:

lapply(test, function(x) lapply(x, '[', c(2, 3)))


[1] 11 12

[1] 61 62

[1] 111 112

[1] 201 202

[1] 601 602

[1] 1101 1102


The first lapply will be applied on the two lists of test. Each one of those two lists contain another 3. The second lapply will iterate over those 3 lists and subset (thats the '[' function in the second lapply) columns c(2, 3).

Note: In the case of a matrix [ will subset elements 2 and 3 but the same function will subset columns when used on a data.frame.

Subsetting rows and columns

lapply is very flexible with the use of anonymous functions. By changing the code into:

#change rows and columns into what you need
lapply(test, function(x) lapply(x, function(y) y[rows, columns]))

You can specify any combination of rows or columns you want.


To piggyback @LyzandeR's answer, consider the often ignored sibling of the apply family, rapply that can recursively run functions on lists of vectors/matrices, returning such nested structures. Often it can compare to nested lapply or its variants v/sapply:

newtest1 <- lapply(test, function(x) lapply(x, '[', c(2, 3)))

newtest2 <- rapply(test, function(x) `[`(x, c(2, 3)), classes="matrix", how="list")

all.equal(newtest1, newtest2)
# [1] TRUE

Interestingly, to my amazement, rapply runs slower in this use case compared to nested lapply! Hmmmm, back to the lab I go...


microbenchmark(newtest1 <- lapply(test, function(x) lapply(x, '[', c(2, 3))))    
# Unit: microseconds
#     mean median     uq    max neval
# 31.92804 31.278 32.241 74.587   100

microbenchmark(newtest2 <- rapply(test, function(x) `[`(x, c(2, 3)),
                                        classes="matrix", how="list"))    
# Unit: microseconds
#    min    lq     mean median      uq    max neval
# 69.293 72.18 79.53353 73.143 74.5865 219.91   100

Even more interesting, is removing the [ operator for the equivalent matrix bracketing, nested lapply runs even better and rapply even worse!

microbenchmark(newtest3 <- lapply(test, function(x) 
                                  lapply(x, function(y) y[c(2, 3), 1])))
# Unit: microseconds
#    min     lq     mean median     uq    max neval
# 26.947 28.391 32.00987 29.354 30.798 100.09   100

all.equal(newtest1, newtest3)
# [1] TRUE

microbenchmark(newtest4 <- rapply(test, function(x) x[c(2,3), 1], 
                                  classes="matrix", how="list"))
# Unit: microseconds
#    min     lq     mean median     uq     max neval
# 74.105 76.752 80.37076 77.955 78.918 203.549   100

all.equal(newtest2, newtest4)
# [1] TRUE