I'm having trouble running this double for loop to properly store the calculated values into the matrix (mentioned below). The reason why I elected to use the double For Loop and not apply() or mean() is that I want to obtain the unique combinations of the two columns and eliminate redundancy (explained below). See below for an example:
A<-c(1,2,3,4,5)
B<-c(2,3,4,5,6)
Q1<-data.frame(cbind(A,B))
mean<-matrix(nrow=5, ncol = 5)
for(i in 1: length(Q1$A)){
for(j in 2: length(Q1$B)){
mean[i,j]<-sum(Q1$A[i]+Q1$B[j])/2
}
}
Here, I attempted to run the entire A vector through the entire B vector while eliminating redundancy, such that A[1] has four values from B[2], and A[2] has three values from B[3]. However, this was my result.
[,1] [,2] [,3] [,4] [,5]
[1,] NA 2.0 2.5 3.0 3.5
[2,] NA 2.5 3.0 3.5 4.0
[3,] NA 3.0 3.5 4.0 4.5
[4,] NA 3.5 4.0 4.5 5.0
[5,] NA 4.0 4.5 5.0 5.5
While the first column was what I expected, I have values I didn't want. What I want instead is the below matrix output:
[,1] [,2] [,3] [,4] [,5]
[1,] NA 2.0 2.5 3.0 3.5
[2,] NA NA 3.0 3.5 4.0
[3,] NA NA NA 4.0 4.5
[4,] NA NA NA NA 5.0
[5,] NA NA NA NA NA
Any suggestions?
[Original Solution (see Update 2 for the faster solutions)]
f.m <- function(Q1) {
z <- matrix(nrow=nrow(Q1),ncol=nrow(Q1))
b <- row(z) < col(z)
z[b] <- (Q1$A[col(z)[b]] + Q1$B[row(z)[b]])/2
z
}
[Sample output]
f.m(Q1)
# [,1] [,2] [,3] [,4] [,5]
# [1,] NA 2 2.5 3.0 3.5
# [2,] NA NA 3.0 3.5 4.0
# [3,] NA NA NA 4.0 4.5
# [4,] NA NA NA NA 5.0
# [5,] NA NA NA NA NA
[Benchmarking Setup]
f0 <- function(Q1) {
mean<-matrix(nrow=nrow(Q1), ncol = nrow(Q1))
for(i in 1: length(Q1$A)){
for(j in 2: length(Q1$B)){
mean[i,j]<-sum(Q1$A[i]+Q1$B[j])/2
}
}
mean
}
f1 <- function(Q1) {
mean<-matrix(nrow=nrow(Q1), ncol = nrow(Q1))
for(i in 2: length(Q1$A)){
for(j in i: length(Q1$B)){
mean[i,j]<-sum(Q1$A[i]+Q1$B[j])/2
}
}
mean
}
# Note that f0() and f1() don't return the desired result for the sample output
f2 <- function(Q1) {
mean<-outer(1: length(Q1$A),
1: length(Q1$B),
Vectorize(function(i,j){
if(i >= j)
return(NA)
else
return(sum(Q1$A[i]+Q1$B[j])/2)
}))
mean
}
library(rbenchmark)
[Benchmarking Result]
A <- B <- 1:100
Q1<-data.frame(A,B)
benchmark(f0(Q1), f1(Q1), f2(Q1), f.m(Q1), replications = 10)
test replications elapsed relative user.self sys.self user.child sys.child
4 f.m(Q1) 10 0.011 1.000 0.012 0.000 0 0
1 f0(Q1) 10 3.018 274.364 3.007 0.008 0 0
2 f1(Q1) 10 1.477 134.273 1.474 0.003 0 0
3 f2(Q1) 10 1.777 161.545 1.774 0.002 0 0
[Update 1]
Another order of running time could be saved by direct calculation of the entire matrix, which avoids messing with costly (comparing to summation) subsetting:
f.m2 <- function(Q1) outer(Q1$A,Q1$B,'+')*0.5
Another portion of benchmarking:
A <- B <- 1:1000
Q1<-data.frame(A,B)
#benchmark(f0(Q1), f1(Q1), f2(Q1), f.m(Q1), replications = 10)
benchmark(f.m(Q1), f.m2(Q1), replications = 10)
test replications elapsed relative user.self sys.self user.child sys.child
1 f.m(Q1) 10 1.839 10.274 1.746 0.093 0 0
2 f.m2(Q1) 10 0.179 1.000 0.144 0.035 0 0
[Update 2]
1) As noted by David Arenburg, function f.m2()
does not produce exactly the expected output, because lower left triangle and main diagonal of the output should be filled with NAs. The function f.m2()
can be fixed to produce the proper answer at the cost of performance (see benchmarking below).
# Suggested by David Arenburg
f.m2.1 <- function(Q1) {
Res <- outer(Q1$A,Q1$B,'+')*0.5;
Res[lower.tri(Res, diag = TRUE)] <- NA;
Res
}
2) Here is another approach suggested by David Arenburg, which makes use of the CJ
function from the data.table
package:
library(data.table)
f.DA <- function(Q1){
Res <- matrix(rowMeans(CJ(Q1$A, Q1$B)), ncol = nrow(Q1))
Res[lower.tri(Res, diag = TRUE)] <- NA
Res
}
3) Here is an Rcpp
-based approach:
library(Rcpp)
cppFunction('NumericMatrix fC(NumericVector A, NumericVector B) {
int n = A.size();
NumericMatrix out(n,n);
std::fill( out.begin(), out.end(), NumericVector::get_na() ) ;
for (int i = 0; i < n; i++) {
for (int j = i+1; j < n; j++) {
out(i,j) = 0.5*(A[i] + B[j]);
}
}
return out;
}')
4) And another benchmarking study:
A <- B <- 1:3000
Q1<-data.frame(A,B)
benchmark(f.m2(Q1), f.m2.1(Q1), f.DA(Q1), fC(Q1$A, Q1$B), replications = 10)
test replications elapsed relative user.self sys.self user.child sys.child
3 f.DA(Q1) 10 7.442 11.556 6.200 1.209 0 0
2 f.m2.1(Q1) 10 5.111 7.936 4.404 0.661 0 0
1 f.m2(Q1) 10 1.007 1.564 0.733 0.263 0 0
4 fC(Q1$A, Q1$B) 10 0.644 1.000 0.525 0.116 0 0
The second for loop should be:
for(j in (i+1):length(Q1$B))
you want to use the next
keyword to skip the operations you don't need, as in:
A<-c(1,2,3,4,5)
B<-c(2,3,4,5,6)
Q1<-data.frame(cbind(A,B))
mean<-matrix(nrow=5, ncol = 5)
for(i in 1: length(Q1$A))
for(j in 2: length(Q1$B)){
if(i >= j)
next
mean[i,j]<-sum(Q1$A[i]+Q1$B[j])/2
}
or you could make the iterand of the inner for
loop conditional on the value of the outer index, as in:
mean<-matrix(nrow=5, ncol = 5)
for(i in 2: length(Q1$A)){
for(j in i: length(Q1$B)){
mean[i,j]<-sum(Q1$A[i]+Q1$B[j])/2
}
}
or you could use outer()
as in:
mean<-outer(1: length(Q1$A),
1: length(Q1$B),
Vectorize(function(i,j){
if(i >= j)
return(NA)
else
return(sum(Q1$A[i]+Q1$B[j])/2)
}))
Not exactly a double For Loop, but you could just use the outer
function to calculate the averages.
outer(Q1$Col1, Q1$Col2, "+")/2