I am working on a large dataframe in R of 2,3 Million records that contain transactions of users at locations with starting and stop times. My goal is to create a new dataframe that contains the amount of time connected per user/per location. Let's call this hourly connected.
Transaction can differ from 8 minutes to 48 hours, thus the goal dataframe will be around 100 Million records and will grow each month.
The code underneath shows how the final dataframe is developed, although the total code is much complexer. Running the total code takes ~ 9 hours on a Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16 cores 128GB RAM.
library(dplyr)
numsessions<-1000000
startdate <-as.POSIXlt(runif(numsessions,1,365*60*60)*24,origin="2015-1-1")
df.Sessions<-data.frame(userID = round(runif(numsessions,1,500)),
postalcode = round(runif(numsessions,1,100)),
daynr = format(startdate,"%w"),
start =startdate ,
end= startdate + runif(1,1,60*60*10)
)
dfhourly.connected <-df.Sessions %>% rowwise %>% do(data.frame(userID=.$userID,
hourlydate=as.Date(seq(.$start,.$end,by=60*60)),
hournr=format(seq(.$start,.$end,by=60*60),"%H")
)
)
We want to parallelize this procedure over (some of) the 16 cores to speed up the procedure. A first attempt was to use the multidplyr
package. The partition is made based on daynr
df.hourlyconnected<-df.Sessions %>%
partition(daynr,cluster=init_cluster(6)) %>%
rowwise %>% do(data.frame(userID=.$userID,
hourlydate=as.Date(seq(.$start,.$end,by=60*60)),
hournr=format(seq(.$start,.$end,by=60*60),"%H")
)
) %>% collect()
Now, the rowwise
function appears to require a dataframe as input instead of a partition.
My questions are
Is there a workaround to perform a rowwise calculation on partitions per core?
Has anyone got a suggestion to perform this calculation with a different R package and methods?
(I think posting this as an answer could benefit future readers who have interest in efficient coding.)
R is a vectorized language, thus operations by row are one of the most costly operations; Especially if you are evaluating lots of functions, dispatching methods, converting classes and creating new data set while you at it.
Hence, the first step is to reduce the "by" operations. By looking at your code, it seems that you are enlarging the size of your data set according to
userID
,start
andend
- all the rest of the operations could come afterwords (and hence be vectorized). Also, runningseq
(which isn't a very efficient function by itself) twice by row adds nothing. Lastly, calling explicitlyseq.POSIXt
on aPOSIXt
class will save you the overhead of method dispatching.I'm not sure how to do this efficiently with
dplyr
, becausemutate
can't handle it and thedo
function (IIRC) always proved it self to be highly inefficient. Hence, let's try thedata.table
package that can handle this task easilyAgain, please note that I minimized "by row" operations to a single function call while avoiding methods dispatch
Now that we have the data set ready, we don't need any by row operations any more, everything can be vectorized from now on.
Though, vectorizing isn't the end of story. We also need to take into consideration classes conversions, method dispatching, etc. For instance, we can create both the
hourlydate
andhournr
using either differentDate
class functions or usingformat
or maybe evensubstr
. The trade off that needs to be taken in account is that, for instance,substr
will be the fastest, but the result will be acharacter
vector rather aDate
one - it's up to you to decide if you prefer the speed or the quality of the end product. Sometimes you can win both, but first you should check your options. Lets benchmark 3 different vectorized ways of calculating thehournr
variabledata.table::hour
is the clear winner by both speed and quality (results are in an integer vector rather a character one), while improving the speed of your previous solution by factor of ~x12,000 (and I haven't even tested it against your by row implementation).Now lets try 3 different ways for
data.table::hour
Seems like the first and third options are pretty much the same speed-wise, while I prefer
as.IDate
because of theinteger
storage mode.Now that we know where both efficiency and quality lies, we could simply finish the task by running
(You can then easily remove the unnecessary columns using a similar syntax of
res[, yourcolname := NULL]
which I'll leave to you)There could be probably more efficient ways of solving this, but this demonstrates a possible way of how to make your code more efficient.
As a side note, if you want further to investigate
data.table
syntax/features, here's a good readhttps://github.com/Rdatatable/data.table/wiki/Getting-started