I tried aggregation on large dataset using 'ffbase' package using ffdfdply
function in R.
lets say I have three variables called Date,Item and sales. Here I want to aggregate the sales over Date and Item using sum function. Could you please guide me through some proper syntax in R.
Here I tried like this:
grp_qty <- ffdfdply(x=data[c("sales","Date","Item")], split=as.character(data$sales),FUN = function(data)
summaryBy(Date+Item~sales, data=data, FUN=sum)).
I would appreciate for your solution.
Mark that ffdfdply is part of ffbase, not ff.
To show an example of the usage of ffdfdply, let's generate an ffdf
with 50Mio rows.
require(ffbase)
data <- expand.ffgrid(Date = ff(seq.Date(Sys.Date(), Sys.Date()+10000, by = "day")), Item = ff(factor(paste("Item", 1:5000))))
data$sales <- ffrandom(n = nrow(data))
# split by date -> assuming that all sales of 1 date can fit into RAM
splitby <- as.character(data$Date, by = 250000)
grp_qty <- ffdfdply(x=data[c("sales","Date","Item")],
split=splitby,
FUN = function(data){
## This happens in RAM - containing **several** split elements so here we can use data.table which works fine for in RAM computing
require(data.table)
data <- as.data.table(data)
result <- data[, list(sales = sum(sales, na.rm=TRUE)), by = list(Date, Item)]
as.data.frame(result)
})
dim(grp_qty)
Mark that grp_qty is an ffdf
which resides on disk.