I have a large data set in GBs that I'd have to process before I analyse them. I tried creating a connector, which allows me to loop through the large datasets and extract chunks at a time.This allows me to quarantine data that satisfies some conditions.
My problem is that I am not able to create an indicator for the connector that stipulates it is null and to execute close(connector) when the end of the dataset is reached. Moreover, for the first chunk of extracted data, I'd have to skip 17 lines since the file contains header that R is not able to read.
A manual attempt that works:
filename="nameoffile.txt"
con<<-file(description=filename,open="r")
data<-read.table(con,nrows=1000,skip=17,header=FALSE)
data<-read.table(con,nrows=1000,skip=0,header=FALSE)
.
.
.
till end of dataset
Since I'd want to avoid mannually keying the above command until I reach the end of the dataset, I attempted to write a loop to automate the process, which was unsuccessful.
My attempt with loops that failed:
filename="nameoffile.txt"
con<<-file(description=filename,open="r")
data<-read.table(con,nrows=1000,skip=17,header=FALSE)
if (nrow(rval)==0) {
con <<-NULL
close(con)
}else{
if(nrow(rval)!=0){
con <<-file(description=filename, open="r")
data<-read.table(conn,nrows=1000,skip=0,header=FALSE)
}}
Looks like you're on the right track. Just open the connection once (you don't need to use
<<-
, just<-
; use a larger chunk size so that R's vectorized operations can be used to process each chunk efficiently), along the lines ofIteration seems to me like a good strategy, especially for a file that you're going to process once rather than say reference repeatedly like a data base. The answer is modified to try to be more robust about detecting reading at the end of the file.