I want to stream a big data table into R LINE BY LINE, and if the current line has a specific condition (lets say the first columns is >15), add the line to a data frame in memory. I have written following code:
count<-1;
Mydata<-NULL;
fin <- FALSE;
while (!fin){
if (count==1){
Myrow=read.delim(pipe('cat /dev/stdin'), header=F,sep="\t",nrows=1);
Mydata<-rbind(Mydata,Myrow);
count<-count+1;
}
else {
count<-count+1;
Myrow=read.delim(pipe('cat /dev/stdin'), header=F,sep="\t",nrows=1);
if (Myrow!=""){
if (MyCONDITION){
Mydata<-rbind(Mydata,Myrow);
}
}
else
{fin<-TRUE}
}
}
print(Mydata);
But I get the error "data not available". Please note that my data is big and I don't want to read it all in once and apply my condition (in this case it was easy).
I think it would be wiser to use an R function like
readLines
.readLines
supports only reading a specified number of lines, e.g. 1. Combine that with opening afile
connection first, and then callingreadLines
repeatedly gets you what you want. When callingreadLines
multiple times, the nextn
lines are read from the connection. In R code:Additional comments:
stdin()
. I suggest you use this instead of usingpipe('cat /dev/stdin')
. This probably makes it more robust, and definitely more cross-platform.Mydata
at the beginning and keep growing it usingrbind
. If the number of lines that yourbind
becomes larger, this will get really slow. This has to do with the fact that when the object grows, the OS needs to find a new memory location for it, which ends up taking a lot of time. Better is to pre-allocateMyData
, or use apply style loops.