I want to perform calculations for each company number in the column PERMNO of my data frame, the summary of which can be seen here:
> summary(companydataRETS)
PERMNO RET
Min. :10000 Min. :-0.971698
1st Qu.:32716 1st Qu.:-0.011905
Median :61735 Median : 0.000000
Mean :56788 Mean : 0.000799
3rd Qu.:80280 3rd Qu.: 0.010989
Max. :93436 Max. :19.000000
My solution so far was to create a variable with all possible company numbers
compns <- companydataRETS[!duplicated(companydataRETS[,"PERMNO"]),"PERMNO"]
And then use a foreach loop using parallel computing which calls my function get.rho() which in turn perform the desired calculations
rhos <- foreach (i=1:length(compns), .combine=rbind) %dopar%
get.rho(subset(companydataRETS[,"RET"],companydataRETS$PERMNO == compns[i]))
I tested it for a subset of my data and it all works. The problem is that I have 72 million observations, and even after leaving the computer working overnight, it still didn't finish.
I am new in R, so I imagine my code structure can be improved upon and there is a better (quicker, less computationally intensive) way to perform this same task (perhaps using apply or with, both of which I don't understand). Any suggestions?
There are many ways to do something like this and your
foreach
solution is one of them. Only looking at the code you supplied, I can only guess at the most appropriate solution...However, I assume the biggest slowdown in your code is actually your
get.rho
function not the looping or subsetting. If you'd like to share that function, I bet you'll get some amazing answers that will both speed things up and clarify some "R-isms".With that said, there are also many alternatives to doing what you're doing.
The
plyr
package is tailor made for this type of computation. It uses a split-apply-combine strategy. The first two letters of the function indicate the input and output data types.Since you're inputting a data.frame and outputting a data.frame,
ddply
is the function to use:If you're on not windows, you can easily multithread this calc using
tapply
is also a perfect candidate:The
data.table
package, as mentioned in the comments is also excellent at this, but I'll leave that code as an exercise to the reader.However as I said above, if your
get.rho
function is slow, no matter how clever you get with your subsetting and looping technique, the calculations will take a long time.edit for function code in post:
If this is time series data, or data that can be treated as such, there are many packages and functions that do this sort of lag comparison. I'm not very well versed in them, but a quick perusal of google and CRAN task views will give you an excellent overview of your options.
I haven't benchmarked it, but I think its safe to assume the slowest section of your code is in the
lm
call. Doing this on asample
of your data instead of the full set will speed things up dramatically. But I'm sure someone out there will have a much better and more complete solution.As suggested by Joran, I looked into the library
data.table
. The modifications to the code areI ran the code as I originally had (using
subset
) and once usingdata.table
, with the variablecompns
comprising of only 30 of the 28659 companies in the dataset. Here are the outputs ofsystem.time()
for the two versions:Using
subset
:Using
data.table
(For some reason using
%do%
instead of%dopar%
for the original code made it ran faster. Thesystem.time()
forsubset
is the one using%do%
, the faster of the two in this case.)I had left the original code running overnight and it hadn't finished after 5 hours, so I gave up and killed it. With this small modification I had my results in less than 5 minutes (I think about 3 mins)!
EDIT
There is an even easier way to do it using
data.table
, without the use offoreach
, which involves substituting the last line of the code above by