R, rbind with multiple files defined by a variable

2019-03-05 07:41发布

First off, this is related to a homework question for the Coursera R programming course. I have found other ways to do what I want to do but my research has led me to a question I'm curious about. I have a variable number of csv files that I need to pull data from and then take the mean of the "pollutant" column in said files. The files are listed in their directory with an id number. I put together the following code which works fine for a single csv file but doesn't work for multiple csv files:

pollutantmean <- function (directory, pollutant, id = 1:332) {
  id <- formatC(id, width=3, flag="0")`
  dataset<-read.csv(paste(directory, "/", id,".csv",sep=""),header=TRUE)`
  mean(dataset[,pollutant], na.rm = TRUE)`
}

I also know how to rbind multiple csv files together if I know the ids when I am creating the function, but I am not sure how to assign rbind to a variable range of ids or if thats even possible. I found other ways to do it such as calling an lapply and the unlisting the data, just curious if there is an easier way.

标签: r rbind
2条回答
小情绪 Triste *
2楼-- · 2019-03-05 08:05

A vector is not accepted for 'file' in read.csv(file, ...)

Below is a slight modification of yours. A vector of file paths are created and they are looped by sapply.

files <- paste("directory-name/",formatC(1:332, width=3, flag="0"),
               ".csv",sep="")
pollutantmean <- function(file, pollutant) {
    dataset <- read.csv(file, header = TRUE)
    mean(dataset[, pollutant], na.rm = TRUE)
}
sapply(files, pollutantmean)
查看更多
Summer. ? 凉城
3楼-- · 2019-03-05 08:13

Well, this uses an lapply, but it might be what you want.

file_list <- list.files("*your directory*", full.names = T)

combined_data <- do.call(rbind, lapply(file_list, read.csv, header = TRUE))

This will turn all of your files into one large dataset, and from there it's easy to take the mean. Is that what you wanted?

An alternative way of doing this would be to step through file by file, taking sums and number of observations and then taking the mean afterwards, like so:

sums <- numeric()
n <- numeric()
i <- 1
for(file in file_list){
  temp_df <- read.csv(file, header = T)
  temp_mean <- mean(temp_df$pollutant)
  sums[i] <- sum(temp_df$pollutant)
  n[i] <- nrow(temp_df)
  i <- i + 1
}
new_mean <- sum(sums)/sum(n)

Note that both of these methods require that only your desired csvs are in that folder. You can use a pattern argument in the list.files call if you have other files in there that you're not interested in.

查看更多
登录 后发表回答