I often need to apply a function to each pair of columns in a dataframe/matrix and return the results in a matrix. Now I always write a loop to do this. For instance, to make a matrix containing the p-values of correlations I write:
df <- data.frame(x=rnorm(100),y=rnorm(100),z=rnorm(100))
n <- ncol(df)
foo <- matrix(0,n,n)
for ( i in 1:n)
{
for (j in i:n)
{
foo[i,j] <- cor.test(df[,i],df[,j])$p.value
}
}
foo[lower.tri(foo)] <- t(foo)[lower.tri(foo)]
foo
[,1] [,2] [,3]
[1,] 0.0000000 0.7215071 0.5651266
[2,] 0.7215071 0.0000000 0.9019746
[3,] 0.5651266 0.9019746 0.0000000
which works, but is quite slow for very large matrices. I can write a function for this in R (not bothering with cutting time in half by assuming a symmetrical outcome as above):
Papply <- function(x,fun)
{
n <- ncol(x)
foo <- matrix(0,n,n)
for ( i in 1:n)
{
for (j in 1:n)
{
foo[i,j] <- fun(x[,i],x[,j])
}
}
return(foo)
}
Or a function with Rcpp:
library("Rcpp")
library("inline")
src <-
'
NumericMatrix x(xR);
Function f(fun);
NumericMatrix y(x.ncol(),x.ncol());
for (int i = 0; i < x.ncol(); i++)
{
for (int j = 0; j < x.ncol(); j++)
{
y(i,j) = as<double>(f(wrap(x(_,i)),wrap(x(_,j))));
}
}
return wrap(y);
'
Papply2 <- cxxfunction(signature(xR="numeric",fun="function"),src,plugin="Rcpp")
But both are quite slow even on a pretty small dataset of 100 variables ( I thought the Rcpp function would be faster, but I guess conversion between R and C++ all the time takes its toll):
> system.time(Papply(matrix(rnorm(100*300),300,100),function(x,y)cor.test(x,y)$p.value))
user system elapsed
3.73 0.00 3.73
> system.time(Papply2(matrix(rnorm(100*300),300,100),function(x,y)cor.test(x,y)$p.value))
user system elapsed
3.71 0.02 3.75
So my question is:
- Due to the simplicity of these functions I assume this is already somewhere in R. Is there an apply or
plyr
function that does this? I have looked for it but haven't been able to find it. - If so, is it faster?
It wouldn't be faster, but you can use
outer
to simplify the code. It does require a vectorized function, so here I've usedVectorize
to make a vectorized version of the function to get the correlation between two columns.92% of the time is being spent in
cor.test.default
and routines it calls so its hopeless trying to get faster results by simply rewritingPapply
(other than the savings from computing only those above or below the diagonal assuming that your function is symmetric inx
andy
).I'm not sure if this addresses your problem in a proper manner, but take a look at William Revelle's
psych
package.corr.test
returns list of matrices with correlation coefs, # of obs, t-test statistic, and p-value. I know I use it all the time (and AFAICS you're also a psychologist, so it may suite your needs as well). Writing loops is not the most elegant way of doing this.You can use
mapply
, but as the other answers state its unlikely to be much faster as most of the time is being used up bycor.test
.You could reduce the amount of work
mapply
does by using the symmetry assumption and noting the zero diagonal, eg