storage problem in R. alternative to nested loop f

2020-02-15 03:11发布

问题:

With the following pieces of information, I can easily create an array of matrices

b0=data.frame(b0_1=c(11.41,11.36),b0_2=c(8.767,6.950))
b1=data.frame(b1_1=c(0.8539,0.9565),b1_2=c(-0.03179,0.06752))
b2=data.frame(b2_1=c(-0.013020 ,-0.016540),b2_2=c(-0.0002822,-0.0026720))
T.val=data.frame(T1=c(1,1),T2=c(1,2),T3=c(2,1))
dt_data=cbind(b0,b1,b2,T.val)
fu.time=seq(0,50,by=0.8)
pat=ncol(T.val) #number of T's
nit=2 #no of rows

pt.array1=array(NA, dim=c(nit,length(fu.time),pat)) 

for ( it.er in 1:nit){
  for ( ti in 1:length(fu.time)){
    for (pt in 1:pat){
      pt.array1[it.er,ti,pt]=b0[it.er,T.val[it.er,pt]]+b1[it.er,T.val[it.er,pt]]*fu.time[ti]+b2[it.er,T.val[it.er,pt]]*fu.time[ti]^2
    }
  }
}

pt.array_mean=apply(pt.array1, c(3,2), mean)
pt.array_LCL=apply(pt.array1, c(3,2), quantile, prob=0.25)
pt.array_UCL=apply(pt.array1, c(3,2), quantile, prob=0.975)

Now with these additional data, I can create three plots as follows

    mydata
       pt.ID      time IPSS
1      1  0.000000   10
2      1  1.117808    8
3      1  4.504110    5
4      1  6.410959   14
5      1 13.808220   10
6      1 19.890410    4
7      1 28.865750   15
8      1 35.112330    7
9      2  0.000000    6
10     2  1.117808    7
11     2  4.109589    8
12     2 10.093151    7
13     2 16.273973   11
14     2 18.345205   18
15     2 21.567120   14
16     2 25.808220   12
17     2 56.087670    5
18     3  0.000000    8
19     3  1.413699    3
20     3  4.405479    3
21     3 10.389041    8


pdf("plots.pdf")
par(mfrow=c(3,2))
for( pt.no in 1:pat){
  plot(IPSS[ID==pt.no]~time[ID==pt.no],xlim=c(0,57),ylim=c(0,35),type="l",col="black",
      xlab="f/u time", ylab= "",main = paste("patient", pt.no),data=mydata)
  points(IPSS[ID==pt.no]~time[ID==pt.no],data=mydata)
  lines(pt.array_mean[pt.no,]~fu.time, col="blue")
  lines(pt.array_LCL[pt.no,]~fu.time, col="green")
  lines(pt.array_UCL[pt.no,]~fu.time, col="green")
}
dev.off()

The problem arise when the number of rows in each matrix is much bigger say 10000. It takes too much computation time to create the pt.array1 for large number of rows in b0, b1 and b2. Is there any alternative way I can do it quickly using any builtin function? Can I avoid the storage allocation for pt.array1 as I am not using it further? I just need pt.array_mean, pt.array_UCL and pt.array_LCL for myplot. Any help is appreciated.

回答1:

There are a couple of other approaches you can employ.

First, you largely have a model of b0 + b1*fu + b2*fu^2. Therefore, you could make the coefficients and apply the fu after the fact:

ind <- expand.grid(nits = seq_len(nit), pats = seq_len(pat))
mat_ind <- cbind(ind[, 'nits'], T.val[as.matrix(ind)])

b_mat <- matrix(c(b0[mat_ind], b1[mat_ind], b2[mat_ind]), ncol = 3)

b_mat
       [,1]     [,2]       [,3]
[1,] 11.410  0.85390 -0.0130200
[2,] 11.360  0.95650 -0.0165400
[3,] 11.410  0.85390 -0.0130200
[4,]  6.950  0.06752 -0.0026720
[5,]  8.767 -0.03179 -0.0002822
[6,] 11.360  0.95650 -0.0165400

Now if we apply the model to each row, we will get all of your raw results. The only problem is that we don't match your original output - each column slice of your array is equivalent of a row slice of my matrix output.

pt_array <- apply(b_mat, 1, function(x) x[1] + x[2] * fu.time + x[3] * fu.time^2)

pt_array[1,]
[1] 11.410 11.360 11.410  6.950  8.767 11.360

pt.array1[, 1, ]
      [,1]  [,2]   [,3]
[1,] 11.41 11.41  8.767
[2,] 11.36  6.95 11.360

That's OK because we can fix the shape of it as we get summary statistics - we just need to take the colSums and colQuantiles of each row converted to a 2 x 3 matrix:

library(matrixStats)

pt_summary = array(t(apply(pt_array,
                         1,
                         function(row) {
                           M <- matrix(row, ncol = pat)
                           c(colMeans2(M),colQuantiles(M, probs = c(0.25, 0.975))
                           )
                           }
                         )),
                   dim = c(length(fu.time), pat, 3),
                   dimnames = list(NULL, paste0('pat', seq_len(pat)), c('mean', 'LCL', 'UCL'))
)

pt_summary[1, ,] #slice at time = 1

        mean      LCL      UCL
pat1 11.3850 11.37250 11.40875
pat2  9.1800  8.06500 11.29850
pat3 10.0635  9.41525 11.29518

# rm(pt.array1)

Then to do your final graphing, I simplified it - the data argument can be a subset(mydata, pt.ID == pt.no). Additionally, since the summary statistics are now in an array format, matlines allows everything to be done at once:

par(mfrow=c(3,2))

for( pt.no in 1:pat){
  plot(IPSS~pt.ID, data=subset(mydata, pt.ID == pt.no),
       xlim=c(0,57), ylim=c(0,35),
       type="l",col="black", xlab="f/u time", ylab= "",
       main = paste("patient", pt.no)
       )

  points(IPSS~time, data=subset(mydata, pt.ID == pt.no))

  matlines(y = pt_summary[,pt.no ,], x = fu.time, col=c("blue", 'green', 'green'))
}