I quite often come across data that is structured something like this:
employees <- list(
list(id = 1,
dept = "IT",
age = 29,
sportsteam = "softball"),
list(id = 2,
dept = "IT",
age = 30,
sportsteam = NULL),
list(id = 3,
dept = "IT",
age = 29,
sportsteam = "hockey"),
list(id = 4,
dept = NULL,
age = 29,
sportsteam = "softball"))
In many cases such lists could be tens of millions of items long, so memory concerns and efficiency are always a concern.
I would like to turn the list into a dataframe but if I run:
library(data.table)
employee.df <- rbindlist(employees)
I get errors because of the NULL values. My normal strategy is to use a function like:
nullToNA <- function(x) {
x[sapply(x, is.null)] <- NA
return(x)
}
and then:
employees <- lapply(employees, nullToNA)
employee.df <- rbindlist(employees)
which returns
id dept age sportsteam
1: 1 IT 29 softball
2: 2 IT 30 NA
3: 3 IT 29 hockey
4: 4 NA 29 softball
However, the nullToNA function is very slow when applied to 10 million cases so it'd be good if there was a more efficient approach.
One point that seems to slow the process down it the is.null function can only be applied to one item at a time (unlike is.na which can scan a full list in one go).
Any advice on how to do this operation efficiently on a large dataset?
All of these solutions (I think) are hiding the fact that the data table is still a lost of lists and not a list of vectors (I did not notice in my application either until it started throwing unexpected errors during
:=
). Try this:data.table(t(sapply(employees, function(x) unlist(lapply(x, function(x) ifelse(is.null(x),NA,x))))))
I believe it works fine, but I am not sure if it will suffer from slowness and can be optimized further.
Many efficiency problems in R are solved by first changing the original data into a form that makes the processes that follow as fast and easy as possible. Usually, this is matrix form.
If you bring all the data together with
rbind
, yournullToNA
function no longer has to search though nested lists, and thereforesapply
serves its purpose (looking though a matrix) more efficiently. In theory, this should make the process faster.Good question, by the way.
A two step approach creates a dataframe after combing it with
rbind
:Now replace the NULL's, I am using "NULL" as R doesn't put NULL when you load the data and is reading it as character when you load it.
A tidyverse solution that I find easier to read is to write a function that works on a single element and map it over all of your NULLs.
I'll use @rich-scriven's rbind and lapply approach to create a matrix, and then turn that into a dataframe.
Then we can use
purrr::modify_depth()
at a depth of 2 to applyreplace_x()
I often find
do.call()
functions hard to read. A solution I use daily (with a MySQL output containing"NULL"
character values):But for all solutions: please remember that
NA
cannot be used for calculation withoutna.rm = TRUE
, but withNULL
you can.NaN
gives the same problem. For example: