I have a list of lists, containing data.frames, from which I want to select only a few rows. I can achieve it in a for-loop, where I create a sequence based on the amount of rows and select only row indices according to that sequence.
But if I have deeper nested lists it doesn't work anymore. I am also sure, that there is a better way of doing that without a loop.
What would be an efficient and generic approach to sample from nested lists, that vary in their dimensions and contain data.frames or matrices?
## Dummy Data
n1=100;n2=300;n3=100
crdOrig <- list(
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
## Code to opimize
FiltRef <- list()
filterBy = 10
for (r in 1:length(crdOrig)) {
tmp <- do.call(rbind, crdOrig[[r]])
filterInd <- seq(1,nrow(tmp), by = filterBy)
FiltRef[[r]] <- tmp[filterInd,]
}
crdResult <- do.call(rbind, FiltRef)
# Plotting
crdOrigPl <- do.call(rbind, unlist(crdOrig, recursive = F))
plot(crdOrigPl[,1], crdOrigPl[,2], col="red", pch=20)
points(crdResult[,1], crdResult[,2], col="green", pch=20)
The code above works also if a list contains several data.frames (data below).
## Dummy Data (Multiple DF)
crdOrig <- list(
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)),
data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
But if a list contains multiple lists, it throws an error trying to bind the result (FiltRef
) together.
The result can be a data.frame with 2 columns (x,y) - like crdResult
or a one dimensional list like FiltRef
(from the first example)
## Dummy Data (Multiple Lists)
crdOrig <- list(
list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)
+1 and Thank you all for your brilliant answers! They all work and there is a lot to learn from each one of them. I will give this one to @Gwang-Jin Kim as his solution is the most flexible and extensive, although they all deserve to be checked!
I too would flatten the list-of-lists into a standard representation (and do all analysis on the flattened representation, not just the subseting), but keep track of relevant indexing information, e.g.,
The internal function
.f()
visits each element of a list. If the element is a data.frame, it adds a unique identifier to index it. If it's a list, then it calls itself on each element of the list (incrementing a depth counter, in case this is useful, one could also add a 'group' counter) and then row-binds the elements. I use an internal function so that I can have a variablei
to increment across function calls. The end result is a single data frame with a index to use for referencing the original results.The overall pattern of
.f()
can be adjusted for additional data types, e.g., (some details omitted)Preparation and implementation of
flatten
Well, there are many other answers which are in principle the same.
I meanwhile implemented for fun the flattening of nested lists.
Since I am thinking in Lisp:
Implemented first
car
andcdr
from lisp.Some predicate functions:
Which are necessary to build flatten (for data frame lists)
After this, the actual function is defined using a sampling function.
Defining sampling function
The actual collector function (from nested data-frame-lists)
collect.df.samples
first flattens the nested list construct of data framesdf.list.construct
to a flat list of data frames. It applies the functionsample.one.nth.of.rows
to each elements of the list (lapply
). There by it produces a list of sampled data frames (which contain the fraction - here 1/10th of the original data frame rows). These sampled data frames arerbind
ed across the list. The resulting data frame is returned. It consists of the sampled rows of each of the data frames.Testing on example
Refactoring for later modifications
By writing the
collect.df.samples
function to:One can make the sampler function replace-able. (And if not: By changing the
fraction
parameter, one can enhance or reduce amount of rows collected from each data frame.)The sampler function is in this definition easily exchangable
For choosing every nth (e.g. every 10th) row in the data frame, instead of a random sampling, you could e.g. use the sampler function:
and input it as
df.sampler.fun =
incollect.df.samples
. Then, this function will be applied to every data frame in the nested df list object and collected to one data frame.Here's an answer in base borrowing from a custom "rapply" function mentioned here rapply to nested list of data frames in R
Consider a recursive call conditionally checking if first item is a data.frame or list class.
I would just flatten the whole darn thing and work on a clean list.