I am trying to compute a 300,000x300,000 matrix in R, my codes are working quite well but it's been running for days now, how can i make it more efficient and time saving?
My codes are working well but it has been running for days now, attached are a subset of what I'm working with, the ID extends to 300,000; how can i make the codes run faster in minutes as it has been running for days now.
fam <- structure(list(ID = c(1L, 2L, 3L, 4L, 6L, 5L, 7L), dad = c(0L,
0L, 1L, 1L, 1L, 3L, 5L), mum = c(0L, 0L, 0L, 2L, 4L, 4L, 6L),
GEN = c(1L, 1L, 2L, 2L, 3L, 3L, 4L)), class = "data.frame", row.names = c(NA,
-7L))
hom<-function(data) {
library(Matrix)
library(foreach)
n<-max(as.numeric(fam[,"ID"]))
t<-min(as.numeric(fam[,"ID"]))
A<-Matrix(0,nrow=n,ncol=n, sparse=TRUE)
while(t <=n) {
s<-max(fam[t,"dad"],fam[t,"mum"])
d<-min(fam[t,"dad"],fam[t,"mum"])
if (s>0 & d>0 )
{
if (fam[t,"GEN"]==999 & s!=d)
{ warning("both dad and mum should be the same, different for at least one individual")
NULL
}
A[t,t]<- 2-0.5^(fam[t,"GEN"]-1)+0.5^(fam[t,"GEN"])*A[fam[t,"dad"],fam[t,"mum"]]
foreach(j = 1:(t-1), .verbose=TRUE, .combine='c', .packages=c("Matrix", "foreach")) %do%
{
A[t,j]<- 0.5*(A[j,fam[t,"dad"]]+A[j,fam[t,"mum"]])
A[j,t]<- A[t,j]
}
}
if (s>0 & d==0 )
{
if ( fam[t,"GEN"]==999)
{ warning("both dad and mum should be the same, one parent equal to zero for at least individual")
NULL }
A[t,t]<- 2-0.5^(fam[t,"GEN"]-1)
foreach(j = 1:(t-1), .verbose=TRUE, .combine='c', .packages=c("Matrix", "foreach")) %do%
{
A[t,j]<-0.5*A[j,s]
A[j,t]<-A[t,j]
}
}
if (s==0 )
{
A[t,t]<- 2-0.5^(fam[t,"GEN"]-1)
}
cat(" MatbyGEN: ", t ,"\n")
t <- t+1
}
A
}
Output of the above example
%%MatrixMarket matrix coordinate real symmetric
7 7 26
1 1 1
3 1 .5
4 1 .5
5 1 .75
6 1 .5
7 1 .625
2 2 1
4 2 .5
5 2 .25
6 2 .25
7 2 .25
3 3 1.5
4 3 .25
5 3 .375
6 3 .875
7 3 .625
4 4 1.5
5 4 1
6 4 .875
7 4 .9375
5 5 1.8125
6 5 .6875
7 5 1.25
6 6 1.78125
7 6 1.234375
7 7 1.91796875
The issue is getting it to work faster for a matrix of 300k x 300k, this would take days or weeks to run as i have been running it for a while now, what can i do to make it run faster?
N.B: save the example as "anything.txt", then read the file in as "fam <- read.delim(, header = TRUE, sep="")"
The problem you have is that this is recursive. Each loop depends on the previous loop's results. Therefore, you can't really use vectorization to solve the problem.
If you want to use R for this, you're best bet is to look into
Rcpp
. I'm not that good withRcpp
but I do have some suggestions.The easiest thing to do is to get rid of the
foreach
loop and replace it with a regularfor
loop. There's a lot of overhead to use parallel threads and when a function is recursive, it's hard for the workers to really do better on their own.The next thing to do is to contemplate whether you really need a sparse matrix. If you're not having memory problems, you might as well use a regular matrix.
The last thing to do is to rethink how you initialize everything. Parts of that code gets repeated multiple times like the assignment to the
diag
. Since we're summing separate elements, we can initialize thediag
with the part common to all 3 code snippets2 - 0.5^(fam[t, 'GEN'] - 1)
.This is important because that allows us to skip ahead. Your original code snippet had like, 1,000 rows with 0s for 'mum' and 'dad'. With this initialization, we can skip right ahead to the first row with a non-zero result for 'mum' or 'dad':
I decided in the interest of skipping
if
statements, I wanted to usesum(c(..., ...))
to sum up everything. That way, if the subset resulted in aNULL
, I could still sum. Altogether:Performance
All code