I'm new to R and the FF package, and am trying to better understand how FF allows users to work with large datasets (>4Gb). I have spent a considerable amount of time trawling the web for tutorials, but the ones I could find generally go over my head.
I learn best by doing, so as an exercise, I would like to know how to create a long-format time-series dataset, similar to R's in-built "Indometh" dataset, using arbitrary values. Then I would like to reshape it into wide format. Then I would like to save the output as a csv file.
With small datasets this is simple, and can be achieved using the following script:
##########################################
#Generate the data frame
DF<-data.frame()
for(Subject in 1:6){
for(time in 1:11){
DF<-rbind(DF,c(Subject,time,runif(1)))
}
}
names(DF)<-c("Subject","time","conc")
##########################################
#Reshape to wide format
DF<-reshape(DF, v.names = "conc", idvar = "Subject", timevar = "time", direction = "wide")
##########################################
#Save csv file
write.csv(DF,file="DF.csv")
But I would like to learn to do this for file sizes of approximately 10 Gb. How would I do this using the FF package? Thanks in advance.
The function reshape
does not explicitly exists for ffdf objects. But it is quite straightforward to execute with functionality from package ffbase
. Just use ffdfdply from package ffbase
, split by Subject and apply reshape
inside the function.
An example on the Indometh dataset with 1000000 subjects.
require(ffbase)
require(datasets)
data(Indometh)
## Generate some random data
x <- expand.ffgrid(Subject = ff(factor(1:1000000)), time = ff(unique(Indometh$time)))
x$conc <- ffrandom(n=nrow(x), rfun = rnorm)
dim(x)
[1] 11000000 3
## and reshape to wide format
result <- ffdfdply(x=x, split=x$Subject, FUN=function(datawithseveralsplitelements){
df <- reshape(datawithseveralsplitelements,
v.names = "conc", idvar = "Subject", timevar = "time", direction = "wide")
as.data.frame(df)
})
class(result)
[1] "ffdf"
colnames(result)
[1] "Subject" "conc.0.25" "conc.0.5" "conc.0.75" "conc.1" "conc.1.25" "conc.2" "conc.3" "conc.4" "conc.5" "conc.6" "conc.8"
dim(result)
[1] 1000000 12
You would be hard put to construct a less efficient method than what you offer. Using rbind.data.frame is incredibly inefficient. Try this instead to create a six thousand line dataset for 6 subjects:
DF <- data.frame( Subj = rep( 1:6, each=1000), matrix(runif(6000*11), nrow=6000) )
Scaling it up to have a billion items (US billion, not UK billion) should give you about an 10GB object, so maybe trying 80 million lines or so?
I think asking for a tutorial in the ff-package is out-of-scope for SO. Please read the FAQ. Such questions are generally closed because the questioner demonstrates that they don't really know what they are talking about.