I have two needs, both connected to a dataset similar to the reproducible one below. I have a list of 18 entities, each composed of a list of 17-19 data.frames. Reproducible dataset follows (there are matrices instead of data.frames, but I do not suppose that makes a difference):
test <- list(list(matrix(10:(50-1), ncol = 10), matrix(60:(100-1), ncol = 10), matrix(110:(150-1), ncol = 10)),
list(matrix(200:(500-1), ncol = 10), matrix(600:(1000-1), ncol = 10), matrix(1100:(1500-1), ncol = 10)))
- I need to subset each dataframe/matrix into two parts (by a given number of rows) and save to a new list of lists
- Secondly, I need to extract and save a given column(s) out of every
data.frame
in a list of lists.
I have no idea how to go around doing it apart from for()
, but I am sure it should be possible with apply()
family of functions.
Thank you for reading
EDIT:
My expected output would look as follows:
extractedColumns <- list(list(matrix(10:(50-1), ncol = 10)[, 2], matrix(60:(100-1), ncol = 10)[, 2], matrix(110:(150-1), ncol = 10)[, 2]),
list(matrix(200:(500-1), ncol = 10)[, 2], matrix(600:(1000-1), ncol = 10)[, 2], matrix(1100:(1500-1), ncol = 10)[, 2]))
numToSubset <- 3
substetFrames <- list(list(list(matrix(10:(50-1), ncol = 10)["first length - numToSubset rows", ], matrix(10:(50-1), ncol = 10)["last numToSubset rows", ]),
list(matrix(60:(100-1), ncol = 10)["first length - numToSubset rows", ], matrix(60:(100-1), ncol = 10)["last numToSubset rows", ]),
list(matrix(110:(150-1), ncol = 10)["first length - numToSubset rows", ], matrix(110:(150-1), ncol = 10)["last numToSubset rows", ])),
etc...)
It gets to look very messy, hope you can follow what I want.
You can use two nested lapply
s:
lapply(test, function(x) lapply(x, '[', c(2, 3)))
Ouput:
[[1]]
[[1]][[1]]
[1] 11 12
[[1]][[2]]
[1] 61 62
[[1]][[3]]
[1] 111 112
[[2]]
[[2]][[1]]
[1] 201 202
[[2]][[2]]
[1] 601 602
[[2]][[3]]
[1] 1101 1102
Explanation
The first lapply
will be applied on the two lists of test
. Each one of those two lists contain another 3. The second lapply
will iterate over those 3 lists and subset (thats the '['
function in the second lapply
) columns c(2, 3)
.
Note: In the case of a matrix [
will subset elements 2 and 3 but the same function will subset columns when used on a data.frame.
Subsetting rows and columns
lapply
is very flexible with the use of anonymous functions. By changing the code into:
#change rows and columns into what you need
lapply(test, function(x) lapply(x, function(y) y[rows, columns]))
You can specify any combination of rows or columns you want.
To piggyback @LyzandeR's answer, consider the often ignored sibling of the apply family, rapply
that can recursively run functions on lists of vectors/matrices, returning such nested structures. Often it can compare to nested lapply
or its variants v/sapply
:
newtest1 <- lapply(test, function(x) lapply(x, '[', c(2, 3)))
newtest2 <- rapply(test, function(x) `[`(x, c(2, 3)), classes="matrix", how="list")
all.equal(newtest1, newtest2)
# [1] TRUE
Interestingly, to my amazement, rapply
runs slower in this use case compared to nested lapply
! Hmmmm, back to the lab I go...
library(microbenchmark)
microbenchmark(newtest1 <- lapply(test, function(x) lapply(x, '[', c(2, 3))))
# Unit: microseconds
# mean median uq max neval
# 31.92804 31.278 32.241 74.587 100
microbenchmark(newtest2 <- rapply(test, function(x) `[`(x, c(2, 3)),
classes="matrix", how="list"))
# Unit: microseconds
# min lq mean median uq max neval
# 69.293 72.18 79.53353 73.143 74.5865 219.91 100
Even more interesting, is removing the [
operator for the equivalent matrix bracketing, nested lapply
runs even better and rapply
even worse!
microbenchmark(newtest3 <- lapply(test, function(x)
lapply(x, function(y) y[c(2, 3), 1])))
# Unit: microseconds
# min lq mean median uq max neval
# 26.947 28.391 32.00987 29.354 30.798 100.09 100
all.equal(newtest1, newtest3)
# [1] TRUE
microbenchmark(newtest4 <- rapply(test, function(x) x[c(2,3), 1],
classes="matrix", how="list"))
# Unit: microseconds
# min lq mean median uq max neval
# 74.105 76.752 80.37076 77.955 78.918 203.549 100
all.equal(newtest2, newtest4)
# [1] TRUE