Convert R read.csv to a readLines batch?

2019-07-24 09:13发布

问题:

I have a fitted model that I'd like to apply to score a new dataset stored as a CSV. Unfortunately, the new data set is kind of large, and the predict procedure runs out of memory on it if I do it all at once. So, I'd like to convert the procedure that worked fine for small sets below, into a batch mode that processes 500 lines at a time, then outputs a file for each scored 500.

I understand from this answer (What is a good way to read line-by-line in R?) that I can use readLines for this. So, I'd be converting from:

trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)

newdata <- as.data.frame(read.csv('newstuff.csv'), stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=filename)

to something like:

trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)

con  <- file("newstuff.csv", open = "r")
i = 0
while (length(mylines <- readLines(con, n = 500, warn = FALSE)) > 0) {
    i = i+1
        newdata <- as.data.frame(mylines, stringsAsFactors=F)
        preds <- predict(fit,newdata)
        write.csv(preds, file=paste(filename,i,'.csv',sep=''))
}
close(con)

However, when I print the mylines object inside the loop, it doesn't get auto-columned correctly the same way read.csv produces something that is---headers are still a mess, and whatever modulo column-width happens under the hood that wraps the vector into an ncol object isn't happening.

Whenever I find myself writing barbaric things like cutting the first row, wrapping the columns, I generally suspect R has a better way to do things. Any suggestions for how I can get a read.csv-like output form a readLines csv connection?

回答1:

If you want to read your data into memory in chunks using read.csv by using the skip and nrows arguments. In pseudo-code:

read_chunk = function(start, n) {
   read.csv(file, skip = start, nrows = n)
 }

start_indices = (0:no_chunks) * chunk_size + 1
lapply(start_indices, function(x) {
   dat = read_chunk(x, chunk_size)
   pred = predict(fit, dat)
   write.csv(pred)
  }

Alternatively, you could put the data into an sqlite database, and use the sqlite package to query the data in chunks. See also this answer, or do some digging with [r] large csv on SO.