I have a directory containing ~40000 csv files, each ranging in size from ~400 bytes to ~11 MB. I have written a function that reads in a csv file and calculates some basic numbers for each csv file (eq. how many values of "female" are in each csv file). This code successfully ran for the same number of csv files, but when the csv files were smaller.
I'm using the packages parallel
and doParallel
to run this on my machine and receive the following error:
Error in unserialize(node$con) : error reading from connection
I suspect I'm running out of memory but I am not sure how best to handle the increased size of the files I'm working with. My code is as follows:
Say that 'fpath' is the path to my directory where all these csvs live:
filelist <- as.vector(list.files(fpath,pattern="site.csv"))
(f <- file.path(fpath,filelist))
# demographics csv in specified directory
dlist <- as.vector(list.files(fpath,pattern="demographics.csv"))
(d <- file.path(fpath,dlist))
demos <- fread(d,header=T,sep=",")
cl <- makeCluster(4)
sdemos <- demo.grab(dl[[1]],demos)
stmp <- parLapply(cl,f,FUN=transit.grab,sdemos)
And the function 'transit.grab' is the following:
transit.grab <- function(sitefile,selected.demos){
demos <- selected.demos
# Renaming sitefile columns
sf <- fread(sitefile,header=T,sep=",")
names(sf)[1] <- c("id")
# Selecting transits from site file using list of selected users
sdat <- sqldf('select sf.* from sf inner join demos on sf.id=demos.id')
I'm not looking for someone to debug code, as I know it runs properly for a smaller amount of data, but rather I desperately need suggestions on how to implement this code for ~6.7 GB of data. Any and all feedback is welcome, thanks!
As suggested, I replaced sqldf()
with merge()
, which reduced my computation time by half when tested on a smaller directory. When observing my memory usage via Activity Monitor, my trend is pretty flat. BUT now when I try running my code on the large directory, my R session crashes.