These days I've been working with a data.frame of 8M registers, and I need to improve a loop that analyzes this data.
I will describe each process of the problem that I am trying to solve. First, I have to arrange all the data.frame in ascending order by three fields ClientID, Date and Time. (Check) Then, using that arranged data.frame, I must operate the differences between each of the observations, where it can be only done when the ClientID is the same. For example:
ClientID|Date(YMD)|Time(HMS)
A|20120101|110000
A|20120101|111500
A|20120101|120000
B|20120202|010000
B|20120202|012030
According to the data up, the result that I want to obtain is the following:
ClientID|Date(YMD)|Time(HMS)|Difference(minutes)
A|20120101|110000|0.00
A|20120101|111500|15.00
A|20120101|120000|45.00
B|20120202|010000|0
B|20120202|012030|20.30
The problem now is that, analyzing all this with a data.frame of 8M observations, it takes like 3 days. I wish I could parallelize this process. My idea is that the data.frame could be segmented by clusters, but this segmentation could be in order and not randomly, and then using the library foreach or another library, could take by clusters the analysis and set it to the number of cores available. For example:
Cluster|ClientID|Date(YMD)|Time(HMS)
CORE 1|
1|A|20120101|110000
1|A|20120101|111500
1|A|20120101|120000
CORE 2|
2|B|20120202|010000
2|B|20120202|012030
I wouldn't recommend trying to parallelize this. Using the
data.table
package and working with times stored in an integer format this should take a pretty trivial amount of time.Generate some example data
gives
Calculate time differences
...
Results:
...