Working with Large Number of csv files in R

I have a directory containing ~40000 csv files, each ranging in size from ~400 bytes to ~11 MB. I have written a function that reads in a csv file and calculates some basic numbers for each csv file (eq. how many values of "female" are in each csv file). This code successfully ran for the same number of csv files, but when the csv files were smaller.

I'm using the packages parallel and doParallel to run this on my machine and receive the following error:

Error in unserialize(node$con) : error reading from connection

I suspect I'm running out of memory but I am not sure how best to handle the increased size of the files I'm working with. My code is as follows:

Say that 'fpath' is the path to my directory where all these csvs live:

fpath<-"../Desktop/bigdirectory"   


filelist <- as.vector(list.files(fpath,pattern="site.csv"))
(f <- file.path(fpath,filelist))

# demographics csv in specified directory
dlist <- as.vector(list.files(fpath,pattern="demographics.csv"))
(d <- file.path(fpath,dlist))
demos <- fread(d,header=T,sep=",")


cl <- makeCluster(4)

registerDoParallel(cl)
setDefaultCluster(cl)
clusterExport(NULL,c('transit.grab'))
clusterEvalQ(NULL,library(data.table))

sdemos <- demo.grab(dl[[1]],demos)
stmp   <- parLapply(cl,f,FUN=transit.grab,sdemos)

And the function 'transit.grab' is the following:

    transit.grab <- function(sitefile,selected.demos){

    require(sqldf)

    demos <- selected.demos

    # Renaming sitefile columns
    sf           <- fread(sitefile,header=T,sep=",")
    names(sf)[1] <- c("id")

    # Selecting transits from site file using list of selected users
    sdat <- sqldf('select sf.* from sf inner join demos on sf.id=demos.id')
return(sdat)
}

I'm not looking for someone to debug code, as I know it runs properly for a smaller amount of data, but rather I desperately need suggestions on how to implement this code for ~6.7 GB of data. Any and all feedback is welcome, thanks!

UPDATE: As suggested, I replaced sqldf() with merge(), which reduced my computation time by half when tested on a smaller directory. When observing my memory usage via Activity Monitor, my trend is pretty flat. BUT now when I try running my code on the large directory, my R session crashes.

标签： r csv parallel-processing

0条回答

Working with Large Number of csv files in R

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间