Binding dataframes in list after data cleaning on

2019-08-20 12:06发布

问题:

This is a follow up on my last question (Rbinding large list of dataframes after I did some data cleaning on the list). I've gotten smarter and the former question got messy.

I have 43 xlsx files which I loaded in to a list in R:

file.list <- list.files(recursive=T,pattern='*.xlsx')

dat = lapply(file.list, function(i){
x = read_xlsx(i, sheet=1, col_names = T)

# Create column with file name  
x$file = i
# Return data
x
})

I then added some column names:

my_names <- c("ID", "UDLIGNNR","BILAGNR", "AKT", "BA",
          "IART", "HTRANS", "DTRANS", "BELOB", "REGD",
          "BOGFD", "VALORD", "UDLIGND", 
          "UÅ", "AFSTEMNGL", "NRBASIS","FIBILAG", "FILE")
dat <- lapply(dat, setNames, my_names)

Then I removed some columns:

dat <- lapply(dat, function(x) { x["UÅ"] <- NULL; x })
dat <- lapply(dat, function(x) { x["FIBILAG"] <- NULL; x })

I don't really need to remove them, but when I tried to merge the dataframes I kept getting errors about the class of these. So I just removed them.

Then I changes all my columns to character. I'm kinda new to R, so I'm aware that this code isn't very sexy and you probably could have made a loop or just one function for this. But this is what I did:

dat <- lapply(dat, function(x) { x["ID"] <- as.character(x["ID"]); x })

[I did the same for all columns]

And then I go and bind the data.

df <- rbindlist(dat)

EDIT:

I've found that it isn't my binding methods that's the problem (thank you for your inputs on that). I've deleted the part about the binding methods.

The problem lies in how I change the coltypes on my columns in my dataframes within the list.

I've also tried:

    dat <- lapply(dat, function(x) { x[,"ID"] <- as.character(x[,"ID"]); x })

I added a comma before "ID". This didn't help. I feel like I need to use unlist, but I'm not sure how to use it here?

回答1:

If you have a list of data frames dat i would recommend

df <- dplyr::bind_rows(dat)

to row bind them to one big data frame.

Alternatively you should use the purrr map family and return a row binded data.frame directly?

df <- purrr::map_dfr(file.list, function(x) readxl::read_xslx(x))


回答2:

I found a solution!!

Thank you for your help!

Apparently the problem wasn't in the binding of the data frames in the nested list. The problem was that I was changing the column types the wrong way.

Here's my code - and it works! And it's super faster than the other one!

file.list <- list.files(recursive=T,pattern='*.xlsx')

dat = lapply(file.list, function(i){
x = read_xlsx(i, sheet=1, col_names = T)

# Create column with file name  
x$file = i
# Return data
x
})

# Setting column names
my_names <- c("ID", "UDLIGNNR","BILAGNR", "AKT", "BA",
          "IART", "HTRANS", "DTRANS", "BELOB", "REGD",
          "BOGFD", "VALORD", "UDLIGND", 
          "UÅ", "AFSTEMNGL", "NRBASIS","FIBILAG", "FILE")

dat <- lapply(dat, setNames, my_names)

# Removing problematic columns
dat <- lapply(dat, function(x) { x["UÅ"] <- NULL; x })
dat <- lapply(dat, function(x) { x["FIBILAG"] <- NULL; x })


dat2 <- lapply(dat, function(df) setDT(df)[, (1:16) := lapply(.SD, as.character), .SDcols = 1:16])

# Merging
df <- rbindlist(dat2)

Oh and several people told me to use bind_rows instead (@atomman and @Probel)

And I want to give credit to the person whom I stole the first part from, but I can't remember...